Live Migration and Synchronous Replicated Storage With Xen, DRBD and LVM

Tags: , , , , , , , , , , ,

Xen LVM & DRBD Overview


The Xen Hypervisor provides a great deal of flexibility and high availability options when it comes to deploying virtual machines. One of the most attractive features it offers is called live migration. Live migration is the ability to take a running virtual machine (“domU”) and move it from one Xen host server (“dom0”) to another. As you might expect, it is called “Live” because it is done while the virtual machine is on, without any noticeable degradation in performance or availability.

LVM is a logical volume manager for Linux. It allows you to take one or more disks and carve them up into dynamic volumes. Logical volumes are like really flexible partitions. They can be grown, shrunk, snapshotted, renamed, copied and moved around with minimal overhead and without ever needing to re-calculate partition sizes.

The DRBD Project provides storage replication at the block-level, essentially it is a network-enabled raid driver. With DRBD we can take a LVM volume and synchronize its contents on two servers. DRBD supports a multi-master architecture which as you’ll read is perfect for Xen.

Combining these technologies together provides us with a serious feature set; Dynamic volume management, snapshotting, synchronous replication, virtualization and best of all live migration. These are the fundamentals of massively expensive enterprise solutions and we’re going to implement them for free.

Architecture and Scalability


For the purposes of this howto I’ll be working with only two systems, however, it is fairly easy to scale this concept up by deploying additional servers. At a point of critical-mass you could also begin decoupling the components into a highly available replicated storage system and Xen servers which connect via iSCSI.

Hardware

1x Intel Pentium 4 @
2x 80G SATA Hard Disks
2GB DDR2 RAM
2x Gigabit Ethernet Adapters

Software

CentOS 5.2
kernel-xen-2.6.18-92.1.6.el5
xen-3.0.3-64.el5_2.1
drbd82-8.2.6-1.el5.centos
kmod-drbd82-xen-8.2.6-1.2.6.18_92.1.6.el5

[ad]

Setting up the DRBD Environment


The first thing DRBD requires is a backing block device, so lets create a LVM volume for this purpose.

[root@mk1 ~]# lvcreate -L 512M -n vm_shaolin vg0
 
[root@mk1 ~]# lvs
 
LV              VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
vm_shaolin vg0  -wi-ao 512M

Next, we tell drbd.conf about the LVM block device.

/etc/drbd.conf is DRBD’s main configuration file. It is made up of a number of sub-sections and supports many many more options than I currently utilize. [http://www.drbd.org/users-guide-emb/re-drbdconf.html man drbd.conf] does a great job of explaining every possible configuration option and should provide you with hours of entertainment.

#/etc/drbd.conf
common {
protocol C;
 
}
 
resource vm_shaolin {
 
disk      /dev/vg0/vm_shaolin;
device    /dev/drbd1;
meta-disk internal;
 
syncer {
rate 500M;
verify-alg sha1;
}
 
on mk1 {
address   10.0.0.81:7789;
}
 
on mk2 {
address   10.0.0.82:7789;
}
 
}

drbd.conf explained

common {
protocol C;
}

“protocol” defines the method which is used to determine that data is synchronized. Method C is the safest, it ensures that a write has completed on both sides before reporting success. Other methods are defined in man drbd.conf.

resource vm_shaolin {
disk      /dev/vg0/vm_shaolin;
device    /dev/drbd1;
meta-disk internal;

“resource” starts the section which describes a specific volume. The “disk” is the block device where data is actually stored, in our case this is an LVM volume. “device” is the name of the device presented by DRBD. “meta-disk” defines where the DRBD meta-data is stored. I choose internal because it is the most automatic option. If you want to squeeze every ounce of IO performance or are paranoid about combining DRBD meta-data and filesystem data on the same block device you may want to investigate using a separate block device for meta-data storage.

syncer {
rate 500M;
verify-alg sha1;
}

“syncer” defines the parameters of the syncornization system. I have upped the rate to 500M (Mega Bytes) per second in order to let drbd fully utilize the dual gigabit network interfaces I’ve given it. “verify-alg” defines the hashing method used to compare blocks between systems.

on mk1 {
address   10.0.0.81:7789;
}
 
on mk2 {
address   10.0.0.82:7789;
}

The “on” statements define parameters specific to actual host involved in the DRBD cluster. Be aware that these names do need to resolve properly on the host in order for DRBD to start. “address” defines the IP and port drbd will both listen on and connect to on each server. Make sure that your iptables rules allow access to these ports.

Initializing The DRBD Volume

Now that we’ve defined the working parameters of the DRBD replication subsystem we’re ready to initialize the volume.

We start off by initializing the meta data for the DRBD devices on the first node:

[root@mk1 ~]# drbdadm create-md vm_shaolin
 
v08 Magic number not found
v07 Magic number not found
v07 Magic number not found
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initialized bitmap
New drbd meta data block sucessfully created.

Do the same on the second node.

[root@mk2 ~]# drbdadm create-md vm_shaolin
 
v08 Magic number not found
v07 Magic number not found
v07 Magic number not found
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initialized bitmap
New drbd meta data block sucessfully created.

With the meta-data created we can now attach the drbd device to the backing block device and establish a network connection on both sides. This must be performed on both nodes.

[root@mk1 ~]# drbdadm up vm_shaolin
[root@mk2 ~]# drbdadm up vm_shaolin

”Note: “up” is a shorthand command which runs the “attach” command followed by the “connect” command behind the scenes.”

Let’s check the status of our volume through the /proc/drbd interface.

[root@mk1 ~]# cat /proc/drbd
1: cs:Connected st:Secondary/Secondary ds:Inconsistent/Inconsistent C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:524236

This shows us that the volume is in a connected state and that both nodes are showing up as secondary and inconsistent. This is what we expect to see as we have not yet put any actual data on the volume. Now we need to synchronize our array, this only needs to be run on one node.

[root@mk1 ~]# drbdadm -- --overwrite-data-of-peer primary vm_shaolin

Now when we look at /proc/drbd we see progress as the volume is synchronized over the network.

root@mk1 ~]# cat /proc/drbd
 
1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
ns:19744 nr:0 dw:0 dr:19744 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 oos:504492
[>....................] sync'ed:  4.0% (504492/524236)K
finish: 0:00:03 speed: 40,944 (40,944) K/sec

Once the volume has finished its synchronization we should see that both sides are showing “UpToDate” device status.

[root@mk1 ~]# cat /proc/drbd
*        *
1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r---
ns:458700 nr:0 dw:0 dr:458700 al:0 bm:28 lo:0 pe:0 ua:0 ap:0 oos:0

Now that we’ve verified the device status we’re ready to promote the volume to “Primary” status on the primary server.

[root@mk1 ~]# drbdadm primary vm_shaolin

We should see this Primary/Secondary status reflected in the /proc/drbd interface.

[root@mk1 ~]# cat /proc/drbd
*        *
1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
ns:458700 nr:0 dw:0 dr:458700 al:0 bm:28 lo:0 pe:0 ua:0 ap:0 oos:0

[ad]

Setting up the Xen Environment


Our volume is now ready for data, we can format it with mkfs or populate it with a pristine block device image now. Since this is going to be the root file system of a Xen system I usually start with a file system image from stacklet. I’ll leave it up to you to get your favorite OS installed on this block device.

DomU configuration

The Xen virtual machine configuration file is pretty standard. The important piece is that we specify the resource name using the drbd xen block script provide by the drbd distribution.

#/etc/xen/configs/shaolin
name    = 'shaolin';
memory     = 512;
maxmem  = 4096;
kernel  = '/boot/xenU/vmlinuz-2.6.16.49-xenU-r1';
disk = [ 'drbd:vm_shaolin,hda1,w' ];
root = '/dev/hda1 ro';
vif = [ 'bridge=xenbr0, mac=a0:00:00:01:00:01' ];

Xend configuration

In order for live migration to work we need to enable it and define the network interfaces, ports and permitted hosts for xend. These configuration steps must be completed on both hosts.

#/etc/xen/xend-config.sxp
(xend-relocation-server yes)
(xend-relocation-port 8002)
(xend-relocation-address '')

“xend-relocation-server” switches the live migration functionality on or off. “xend-relocation-port” defines the TCP port used for incoming relocation, 8002 looks good. “xend-relocation-address” is a list of hosts allowed to migrate virtual machines on to this one, I leave this empty to allow any host and then restrict access using iptables.

Once live migration has been enabled in your xend config you’ll need to restart xend

# /etc/init.d/xend restart
 
restart xend:                                              [  OK  ]

Verify that xend is listening for relocation connections:

# netstat -nlp |grep 8002
 
tcp        0      0 0.0.0.0:8002                0.0.0.0:*                   LISTEN      4109/python

Starting the Virtual Machine

We can now start our virtual machine using the “xm create” command. This is critical, ”’your virtual machine must be run on only one host at a time”’. Two different virtual machines using the same block device simultaneously will severely corrupt your file system.

[root@mk1 ~]# xm create /etc/xen/configs/shaolin
Using config file "/etc/xen/configs/shaolin".
Started domain shaolin

To check in on it after we have started it we use the “xm list” command.

[root@mk1 ~]# xm list
 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      491     2 r-----    176.6
shaolin                                    1      511     1 -b----      7.1

Migrating the Virtual Machine

Once the virtual machine has been started on the first node we can migrate it over to the second. Run the following command on the first node:

[root@mk1 ~]# xm migrate --live shaolin mk2

Its normal for this command to have no output. You should now see that your VM is no longer running on the first node.

[root@mk1 ~]# xm list
 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      489     2 r-----    208.7

Let’s verify that it migrated over…

root@mk2 ~]# xm list
 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     1509     2 r-----    263.8
shaolin                                    3      511     1 -b----     26.2

There it is, you may notice the counters have been reset.

A few notes and precautions about live migration:

  • In order to ensure that your switch knows that your virtual machine’s MAC address is located on a different port it is best to generate traffic by pinging out from the virtual machine continually during the migration.