Live Migration and Synchronous Replicated Storage With Xen, DRBD and LVM
Tags: drbd, high availability, infrastructure, linux, lvm, redhat, scalability, storage, systems administration, unix, Virtualization, Xen
Xen LVM & DRBD Overview
The Xen Hypervisor provides a great deal of flexibility and high availability options when it comes to deploying virtual machines. One of the most attractive features it offers is called live migration. Live migration is the ability to take a running virtual machine (“domU”) and move it from one Xen host server (“dom0”) to another. As you might expect, it is called “Live” because it is done while the virtual machine is on, without any noticeable degradation in performance or availability.
LVM is a logical volume manager for Linux. It allows you to take one or more disks and carve them up into dynamic volumes. Logical volumes are like really flexible partitions. They can be grown, shrunk, snapshotted, renamed, copied and moved around with minimal overhead and without ever needing to re-calculate partition sizes.
The DRBD Project provides storage replication at the block-level, essentially it is a network-enabled raid driver. With DRBD we can take a LVM volume and synchronize its contents on two servers. DRBD supports a multi-master architecture which as you’ll read is perfect for Xen.
Combining these technologies together provides us with a serious feature set; Dynamic volume management, snapshotting, synchronous replication, virtualization and best of all live migration. These are the fundamentals of massively expensive enterprise solutions and we’re going to implement them for free.
Architecture and Scalability
For the purposes of this howto I’ll be working with only two systems, however, it is fairly easy to scale this concept up by deploying additional servers. At a point of critical-mass you could also begin decoupling the components into a highly available replicated storage system and Xen servers which connect via iSCSI.
Hardware
1x Intel Pentium 4 @ 2x 80G SATA Hard Disks 2GB DDR2 RAM 2x Gigabit Ethernet Adapters |
Software
CentOS 5.2 kernel-xen-2.6.18-92.1.6.el5 xen-3.0.3-64.el5_2.1 drbd82-8.2.6-1.el5.centos kmod-drbd82-xen-8.2.6-1.2.6.18_92.1.6.el5 |
[ad]
Setting up the DRBD Environment
The first thing DRBD requires is a backing block device, so lets create a LVM volume for this purpose.
[root@mk1 ~]# lvcreate -L 512M -n vm_shaolin vg0 [root@mk1 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert vm_shaolin vg0 -wi-ao 512M |
Next, we tell drbd.conf about the LVM block device.
/etc/drbd.conf is DRBD’s main configuration file. It is made up of a number of sub-sections and supports many many more options than I currently utilize. [http://www.drbd.org/users-guide-emb/re-drbdconf.html man drbd.conf] does a great job of explaining every possible configuration option and should provide you with hours of entertainment.
#/etc/drbd.conf common { protocol C; } resource vm_shaolin { disk /dev/vg0/vm_shaolin; device /dev/drbd1; meta-disk internal; syncer { rate 500M; verify-alg sha1; } on mk1 { address 10.0.0.81:7789; } on mk2 { address 10.0.0.82:7789; } } |
drbd.conf explained
common { protocol C; } |
“protocol” defines the method which is used to determine that data is synchronized. Method C is the safest, it ensures that a write has completed on both sides before reporting success. Other methods are defined in man drbd.conf.
resource vm_shaolin { disk /dev/vg0/vm_shaolin; device /dev/drbd1; meta-disk internal; |
“resource” starts the section which describes a specific volume. The “disk” is the block device where data is actually stored, in our case this is an LVM volume. “device” is the name of the device presented by DRBD. “meta-disk” defines where the DRBD meta-data is stored. I choose internal because it is the most automatic option. If you want to squeeze every ounce of IO performance or are paranoid about combining DRBD meta-data and filesystem data on the same block device you may want to investigate using a separate block device for meta-data storage.
syncer { rate 500M; verify-alg sha1; } |
“syncer” defines the parameters of the syncornization system. I have upped the rate to 500M (Mega Bytes) per second in order to let drbd fully utilize the dual gigabit network interfaces I’ve given it. “verify-alg” defines the hashing method used to compare blocks between systems.
on mk1 { address 10.0.0.81:7789; } on mk2 { address 10.0.0.82:7789; } |
The “on” statements define parameters specific to actual host involved in the DRBD cluster. Be aware that these names do need to resolve properly on the host in order for DRBD to start. “address” defines the IP and port drbd will both listen on and connect to on each server. Make sure that your iptables rules allow access to these ports.
Initializing The DRBD Volume
Now that we’ve defined the working parameters of the DRBD replication subsystem we’re ready to initialize the volume.
We start off by initializing the meta data for the DRBD devices on the first node:
[root@mk1 ~]# drbdadm create-md vm_shaolin v08 Magic number not found v07 Magic number not found v07 Magic number not found v08 Magic number not found Writing meta data... initialising activity log NOT initialized bitmap New drbd meta data block sucessfully created. |
Do the same on the second node.
[root@mk2 ~]# drbdadm create-md vm_shaolin v08 Magic number not found v07 Magic number not found v07 Magic number not found v08 Magic number not found Writing meta data... initialising activity log NOT initialized bitmap New drbd meta data block sucessfully created. |
With the meta-data created we can now attach the drbd device to the backing block device and establish a network connection on both sides. This must be performed on both nodes.
[root@mk1 ~]# drbdadm up vm_shaolin |
[root@mk2 ~]# drbdadm up vm_shaolin |
”Note: “up” is a shorthand command which runs the “attach” command followed by the “connect” command behind the scenes.”
Let’s check the status of our volume through the /proc/drbd interface.
[root@mk1 ~]# cat /proc/drbd 1: cs:Connected st:Secondary/Secondary ds:Inconsistent/Inconsistent C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:524236 |
This shows us that the volume is in a connected state and that both nodes are showing up as secondary and inconsistent. This is what we expect to see as we have not yet put any actual data on the volume. Now we need to synchronize our array, this only needs to be run on one node.
[root@mk1 ~]# drbdadm -- --overwrite-data-of-peer primary vm_shaolin |
Now when we look at /proc/drbd we see progress as the volume is synchronized over the network.
root@mk1 ~]# cat /proc/drbd 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r--- ns:19744 nr:0 dw:0 dr:19744 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 oos:504492 [>....................] sync'ed: 4.0% (504492/524236)K finish: 0:00:03 speed: 40,944 (40,944) K/sec |
Once the volume has finished its synchronization we should see that both sides are showing “UpToDate” device status.
[root@mk1 ~]# cat /proc/drbd * * 1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r--- ns:458700 nr:0 dw:0 dr:458700 al:0 bm:28 lo:0 pe:0 ua:0 ap:0 oos:0 |
Now that we’ve verified the device status we’re ready to promote the volume to “Primary” status on the primary server.
[root@mk1 ~]# drbdadm primary vm_shaolin |
We should see this Primary/Secondary status reflected in the /proc/drbd interface.
[root@mk1 ~]# cat /proc/drbd * * 1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:458700 nr:0 dw:0 dr:458700 al:0 bm:28 lo:0 pe:0 ua:0 ap:0 oos:0 |
[ad]
Setting up the Xen Environment
Our volume is now ready for data, we can format it with mkfs or populate it with a pristine block device image now. Since this is going to be the root file system of a Xen system I usually start with a file system image from stacklet. I’ll leave it up to you to get your favorite OS installed on this block device.
DomU configuration
The Xen virtual machine configuration file is pretty standard. The important piece is that we specify the resource name using the drbd xen block script provide by the drbd distribution.
#/etc/xen/configs/shaolin name = 'shaolin'; memory = 512; maxmem = 4096; kernel = '/boot/xenU/vmlinuz-2.6.16.49-xenU-r1'; disk = [ 'drbd:vm_shaolin,hda1,w' ]; root = '/dev/hda1 ro'; vif = [ 'bridge=xenbr0, mac=a0:00:00:01:00:01' ]; |
Xend configuration
In order for live migration to work we need to enable it and define the network interfaces, ports and permitted hosts for xend. These configuration steps must be completed on both hosts.
#/etc/xen/xend-config.sxp (xend-relocation-server yes) (xend-relocation-port 8002) (xend-relocation-address '') |
“xend-relocation-server” switches the live migration functionality on or off. “xend-relocation-port” defines the TCP port used for incoming relocation, 8002 looks good. “xend-relocation-address” is a list of hosts allowed to migrate virtual machines on to this one, I leave this empty to allow any host and then restrict access using iptables.
Once live migration has been enabled in your xend config you’ll need to restart xend
# /etc/init.d/xend restart restart xend: [ OK ] |
Verify that xend is listening for relocation connections:
# netstat -nlp |grep 8002 tcp 0 0 0.0.0.0:8002 0.0.0.0:* LISTEN 4109/python |
Starting the Virtual Machine
We can now start our virtual machine using the “xm create” command. This is critical, ”’your virtual machine must be run on only one host at a time”’. Two different virtual machines using the same block device simultaneously will severely corrupt your file system.
[root@mk1 ~]# xm create /etc/xen/configs/shaolin Using config file "/etc/xen/configs/shaolin". Started domain shaolin |
To check in on it after we have started it we use the “xm list” command.
[root@mk1 ~]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 491 2 r----- 176.6 shaolin 1 511 1 -b---- 7.1 |
Migrating the Virtual Machine
Once the virtual machine has been started on the first node we can migrate it over to the second. Run the following command on the first node:
[root@mk1 ~]# xm migrate --live shaolin mk2 |
Its normal for this command to have no output. You should now see that your VM is no longer running on the first node.
[root@mk1 ~]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 489 2 r----- 208.7 |
Let’s verify that it migrated over…
root@mk2 ~]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 1509 2 r----- 263.8 shaolin 3 511 1 -b---- 26.2 |
There it is, you may notice the counters have been reset.
A few notes and precautions about live migration: