Live Migration and Synchronous Replicated Storage With Xen, DRBD and LVM

Tags: , , , , , , , , , , ,


Xen LVM & DRBD Overview


The Xen Hypervisor provides a great deal of flexibility and high availability options when it comes to deploying virtual machines. One of the most attractive features it offers is called live migration. Live migration is the ability to take a running virtual machine (“domU”) and move it from one Xen host server (“dom0”) to another. As you might expect, it is called “Live” because it is done while the virtual machine is on, without any noticeable degradation in performance or availability.

LVM is a logical volume manager for Linux. It allows you to take one or more disks and carve them up into dynamic volumes. Logical volumes are like really flexible partitions. They can be grown, shrunk, snapshotted, renamed, copied and moved around with minimal overhead and without ever needing to re-calculate partition sizes.

The DRBD Project provides storage replication at the block-level, essentially it is a network-enabled raid driver. With DRBD we can take a LVM volume and synchronize its contents on two servers. DRBD supports a multi-master architecture which as you’ll read is perfect for Xen.

Combining these technologies together provides us with a serious feature set; Dynamic volume management, snapshotting, synchronous replication, virtualization and best of all live migration. These are the fundamentals of massively expensive enterprise solutions and we’re going to implement them for free.

Architecture and Scalability


For the purposes of this howto I’ll be working with only two systems, however, it is fairly easy to scale this concept up by deploying additional servers. At a point of critical-mass you could also begin decoupling the components into a highly available replicated storage system and Xen servers which connect via iSCSI.

Hardware

1x Intel Pentium 4 @
2x 80G SATA Hard Disks
2GB DDR2 RAM
2x Gigabit Ethernet Adapters

Software

CentOS 5.2
kernel-xen-2.6.18-92.1.6.el5
xen-3.0.3-64.el5_2.1
drbd82-8.2.6-1.el5.centos
kmod-drbd82-xen-8.2.6-1.2.6.18_92.1.6.el5

[ad]

Setting up the DRBD Environment


The first thing DRBD requires is a backing block device, so lets create a LVM volume for this purpose.

[root@mk1 ~]# lvcreate -L 512M -n vm_shaolin vg0
 
[root@mk1 ~]# lvs
 
LV              VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
vm_shaolin vg0  -wi-ao 512M

Next, we tell drbd.conf about the LVM block device.

/etc/drbd.conf is DRBD’s main configuration file. It is made up of a number of sub-sections and supports many many more options than I currently utilize. [http://www.drbd.org/users-guide-emb/re-drbdconf.html man drbd.conf] does a great job of explaining every possible configuration option and should provide you with hours of entertainment.

#/etc/drbd.conf
common {
protocol C;
 
}
 
resource vm_shaolin {
 
disk      /dev/vg0/vm_shaolin;
device    /dev/drbd1;
meta-disk internal;
 
syncer {
rate 500M;
verify-alg sha1;
}
 
on mk1 {
address   10.0.0.81:7789;
}
 
on mk2 {
address   10.0.0.82:7789;
}
 
}

drbd.conf explained

common {
protocol C;
}

“protocol” defines the method which is used to determine that data is synchronized. Method C is the safest, it ensures that a write has completed on both sides before reporting success. Other methods are defined in man drbd.conf.

resource vm_shaolin {
disk      /dev/vg0/vm_shaolin;
device    /dev/drbd1;
meta-disk internal;

“resource” starts the section which describes a specific volume. The “disk” is the block device where data is actually stored, in our case this is an LVM volume. “device” is the name of the device presented by DRBD. “meta-disk” defines where the DRBD meta-data is stored. I choose internal because it is the most automatic option. If you want to squeeze every ounce of IO performance or are paranoid about combining DRBD meta-data and filesystem data on the same block device you may want to investigate using a separate block device for meta-data storage.

syncer {
rate 500M;
verify-alg sha1;
}

“syncer” defines the parameters of the syncornization system. I have upped the rate to 500M (Mega Bytes) per second in order to let drbd fully utilize the dual gigabit network interfaces I’ve given it. “verify-alg” defines the hashing method used to compare blocks between systems.

on mk1 {
address   10.0.0.81:7789;
}
 
on mk2 {
address   10.0.0.82:7789;
}

The “on” statements define parameters specific to actual host involved in the DRBD cluster. Be aware that these names do need to resolve properly on the host in order for DRBD to start. “address” defines the IP and port drbd will both listen on and connect to on each server. Make sure that your iptables rules allow access to these ports.

Initializing The DRBD Volume

Now that we’ve defined the working parameters of the DRBD replication subsystem we’re ready to initialize the volume.

We start off by initializing the meta data for the DRBD devices on the first node:

[root@mk1 ~]# drbdadm create-md vm_shaolin
 
v08 Magic number not found
v07 Magic number not found
v07 Magic number not found
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initialized bitmap
New drbd meta data block sucessfully created.

Do the same on the second node.

[root@mk2 ~]# drbdadm create-md vm_shaolin
 
v08 Magic number not found
v07 Magic number not found
v07 Magic number not found
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initialized bitmap
New drbd meta data block sucessfully created.

With the meta-data created we can now attach the drbd device to the backing block device and establish a network connection on both sides. This must be performed on both nodes.

[root@mk1 ~]# drbdadm up vm_shaolin
[root@mk2 ~]# drbdadm up vm_shaolin

”Note: “up” is a shorthand command which runs the “attach” command followed by the “connect” command behind the scenes.”

Let’s check the status of our volume through the /proc/drbd interface.

[root@mk1 ~]# cat /proc/drbd
1: cs:Connected st:Secondary/Secondary ds:Inconsistent/Inconsistent C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:524236

This shows us that the volume is in a connected state and that both nodes are showing up as secondary and inconsistent. This is what we expect to see as we have not yet put any actual data on the volume. Now we need to synchronize our array, this only needs to be run on one node.

[root@mk1 ~]# drbdadm -- --overwrite-data-of-peer primary vm_shaolin

Now when we look at /proc/drbd we see progress as the volume is synchronized over the network.

root@mk1 ~]# cat /proc/drbd
 
1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
ns:19744 nr:0 dw:0 dr:19744 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 oos:504492
[>....................] sync'ed:  4.0% (504492/524236)K
finish: 0:00:03 speed: 40,944 (40,944) K/sec

Once the volume has finished its synchronization we should see that both sides are showing “UpToDate” device status.

[root@mk1 ~]# cat /proc/drbd
*        *
1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r---
ns:458700 nr:0 dw:0 dr:458700 al:0 bm:28 lo:0 pe:0 ua:0 ap:0 oos:0

Now that we’ve verified the device status we’re ready to promote the volume to “Primary” status on the primary server.

[root@mk1 ~]# drbdadm primary vm_shaolin

We should see this Primary/Secondary status reflected in the /proc/drbd interface.

[root@mk1 ~]# cat /proc/drbd
*        *
1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
ns:458700 nr:0 dw:0 dr:458700 al:0 bm:28 lo:0 pe:0 ua:0 ap:0 oos:0

[ad]

Setting up the Xen Environment


Our volume is now ready for data, we can format it with mkfs or populate it with a pristine block device image now. Since this is going to be the root file system of a Xen system I usually start with a file system image from stacklet. I’ll leave it up to you to get your favorite OS installed on this block device.

DomU configuration

The Xen virtual machine configuration file is pretty standard. The important piece is that we specify the resource name using the drbd xen block script provide by the drbd distribution.

#/etc/xen/configs/shaolin
name    = 'shaolin';
memory     = 512;
maxmem  = 4096;
kernel  = '/boot/xenU/vmlinuz-2.6.16.49-xenU-r1';
disk = [ 'drbd:vm_shaolin,hda1,w' ];
root = '/dev/hda1 ro';
vif = [ 'bridge=xenbr0, mac=a0:00:00:01:00:01' ];

Xend configuration

In order for live migration to work we need to enable it and define the network interfaces, ports and permitted hosts for xend. These configuration steps must be completed on both hosts.

#/etc/xen/xend-config.sxp
(xend-relocation-server yes)
(xend-relocation-port 8002)
(xend-relocation-address '')

“xend-relocation-server” switches the live migration functionality on or off. “xend-relocation-port” defines the TCP port used for incoming relocation, 8002 looks good. “xend-relocation-address” is a list of hosts allowed to migrate virtual machines on to this one, I leave this empty to allow any host and then restrict access using iptables.

Once live migration has been enabled in your xend config you’ll need to restart xend

# /etc/init.d/xend restart
 
restart xend:                                              [  OK  ]

Verify that xend is listening for relocation connections:

# netstat -nlp |grep 8002
 
tcp        0      0 0.0.0.0:8002                0.0.0.0:*                   LISTEN      4109/python

Starting the Virtual Machine

We can now start our virtual machine using the “xm create” command. This is critical, ”’your virtual machine must be run on only one host at a time”’. Two different virtual machines using the same block device simultaneously will severely corrupt your file system.

[root@mk1 ~]# xm create /etc/xen/configs/shaolin
Using config file "/etc/xen/configs/shaolin".
Started domain shaolin

To check in on it after we have started it we use the “xm list” command.

[root@mk1 ~]# xm list
 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      491     2 r-----    176.6
shaolin                                    1      511     1 -b----      7.1

Migrating the Virtual Machine

Once the virtual machine has been started on the first node we can migrate it over to the second. Run the following command on the first node:

[root@mk1 ~]# xm migrate --live shaolin mk2

Its normal for this command to have no output. You should now see that your VM is no longer running on the first node.

[root@mk1 ~]# xm list
 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      489     2 r-----    208.7

Let’s verify that it migrated over…

root@mk2 ~]# xm list
 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     1509     2 r-----    263.8
shaolin                                    3      511     1 -b----     26.2

There it is, you may notice the counters have been reset.

A few notes and precautions about live migration:

  • In order to ensure that your switch knows that your virtual machine’s MAC address is located on a different port it is best to generate traffic by pinging out from the virtual machine continually during the migration.

  • 16 Responses to “Live Migration and Synchronous Replicated Storage With Xen, DRBD and LVM”

    1. Jon Says:

      This is awesome! I’m deploying a xen environment within the next few weeks, and I’ve been looking for a replication solution. Not only did I find what I was looking for, but I get live migration too! Woohoo!

      [Reply]

    2. Zachery Dunny Says:

      Would be really interesting to see your traffic spike if you had any stats enabled 🙂

      [Reply]

    3. Ab Circle Pro Review Says:

      Good day website owner might I make use of a number of of the data from this post if I provide a link back again for a internet site?

      [Reply]

    4. keith Says:

      Sure

      [Reply]

    5. Line Says:

      Conhe?a

      [Reply]

    6. Stuart Gathman Says:

      I know it’s asking a lot, but this doesn’t work on Centos5.5 migrating from a 32-bit dom0 to a 64-bit dom0 or vice versa. The save image is copied to the target, the drbd primary is switched, but the xen restore chokes on the save image from the other arch. This has been fixed at xensource, but the patch hasn’t percolated to EL5 yet. We can still quickly switch using drbd by shutting down and rebooting on the other machine, but the live migration would have been cool.

      [Reply]

    7. keith Says:

      Stuart, I think you would still run into that bug if you were using a storage method other than DRBD, right? I don’t usually use the xen version supplied with EL5, its just too old! I do try to run their kernels (especially in domU), but if you need to patch the dom0 kernel until the fix makes its way into the redhat kernels I would go for it!

      [Reply]

    8. paul Says:

      I’m confused on how this can work. The DRBD docs say that in a Primary/Secondary resource, the secondary side can be neither read from nor written to. Is there a DRBD mode change going on that’s not explictly in your article?

      http://www.drbd.org/users-guide-emb/ch-admin.html#s-roles

      [Reply]

      Keith Reply:

      The shift from Secondary to Primary is handled by the xen drbd block script. In the VM config we specify to use this on the disk line.

      “disk = [ ‘drbd:vm_shaolin,hda1,w’ ];”

      You’re right though, without the xen block script, the block device would remain as a secondary and writing to it would not work without manual intervention.

      [Reply]

    9. todd bailey Says:

      I’m a bit confused on if this technology will provide what I’m looking for.
      I have 2 machines running Mint 13 and 15, both have 8 tb of data storage.

      I want to set up some form of sychronous data replication so updates made to array 1 will be updated to array 2 or vice a versa. If/when a node goes down, I want the data to resync once the node is back on line, or basically a networked raid 1 array.

      [Reply]

      Keith Reply:

      Hey Todd,

      That’s what that DRBD portion of the stack is there to handle. It acts as an abstraction layer and effectively implements a networked raid-1 across different hardware.

      [Reply]

      todd bailey Reply:

      Basically what I want to set up is a network based raid 1, that will automatically re-sync itself so take writes to volume 1 on machine 1 and volume 2 on machine 2 will reflect these changes. I’m currently using 2 jfs partitions and have to run rsync to update.

      [Reply]

    10. todd bailey Says:

      Am I reading the setup guide correctly? For every partition I want to mirror I need an identical sized partition to be used as a lower level storage?

      “Preparing your lower-level storage

      After you have installed DRBD, you must set aside a roughly identically sized storage area on both cluster nodes. This will become the lower-level device for your DRBD resource. You may use any type of block device found on your system for this purpose. Typical examples include: …”

      [Reply]

    11. todd bailey Says:

      What I mean is do I need 2 identical sized arrays per machine I want to mirror?

      In other words if I have 2 machines, each with 8 tb storage, do I need to add an additional 8 tb arrays per machine to implement drdb ?

      [Reply]

    12. Miguel Says:

      Hi,

      I am trying to make this kind of setup work with Xen 4.2.2 and DRBD 8.4.3 but it seems that it is not supported anymore.

      A Virtual Machine cannot start if the disk is configured as [drbd:name_of_resource]. (I am using the XL toolstack)

      If anybody has experience about how to make this setup work with recent versions of Xen and DRBD I appreciate the information.

      Thanks a lot.

      Miguel

      [Reply]

    13. m p patel Says:

      sir! I am working since couple of days but i am not able to solve….tommmrrow i will try and let you know about lvm on top of drbd.
      i am using ubuntu 14.04+ xen 4.4 and have created vm using gvnviewer. I have tried with NFS and other stuff but don’t get succedded then xen community gave me strike about DRBD to be used with live migraiton.

      i will surely tell my part becasue i have been trying since lot of days with DRBD….but it gives no resources allocated at drbdadm command…

      [Reply]

    Join the Conversation