Xen block iSCSI script with multipath support
Tags: iSCSI, linux, Multipath, redhat, Xen
When connecting a server to a storage area network (SAN) its important to make certain that you’re hosts are prepared for the occasional blip in SAN connectivity. Device mapper multipath to the rescue! Multipath is an abstraction layer between you and the raw block devices which allows for multiple I/O paths or networks (IO multipathing) and gives you an increased level of control over what happens should a block device start reporting errors. Best of all its built right in to the modern Linux kernel.
I maintain a cluster of Xen servers that store VM images on an EqualLogic PS6000 Series iSCSI SAN as raw LUNs. Its super-stable and makes it very simple to manage, snapshot and replicate storage. The only drawback is EqualLogic’s limitation of 512 connections per storage pool. This means that for every LUN (read VM) created we consume a connection. Multiply this by the number of dom0s and you’ll quickly see that the available connections would get eaten up in no time. In order to step around this boundary I made some significant modifications the block-iscsi Xen block script I found on an e-mail thread. Sorry, I don’t remember where it came from and there are many variations floating around.
I’ve tested this script on RHEL5 running Xen 3.1.4, your mileage may vary but as always, I’d love to hear your feedback!
/etc/xen/scripts/block-iscsi
#!/bin/bash # block-iscsi - 2009 Keith Herron <keith@backdrift.org> # # multipath enabled block-iscsi xen block script. # # Note: This script depends on a block-iscsi.conf file # located in the same directory. This file contains # an array of available iSCSI target IPs # dir=$(dirname "$0") . "$dir/block-common.sh" . "$dir/block-iscsi.conf" # Log which mode we are in logger -t block-iscsi "*** Beginning device $command ***" # Fetch the iqn we specify in the domu config file # IQN=$(xenstore_read "$XENBUS_PATH/params") logger -t block-iscsi "IQN: ${IQN}" # We define portal ip in order to support new luns which don't yet have # /var/lib/iscsi/node entrys yet, not dynamic but avoids manual discovery # for PORTAL in ${PORTALS[@]}; do logger -t block-iscsi `iscsiadm -m discovery -t st -p $PORTAL` done # Using the iscsi node directory we can determine the ip and port of # our iscsi target on a lun by lun basis # IP=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 1` PORT=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 2` logger -t block-iscsi "TARGET: ${IP}:${PORT}" # This is called by each command to determine which multipath map to use # function get_mpath_map { # Re-run multipath to ensure that maps are up to date # multipath sleep 2 # Now we determine which /dev/sd* device belongs to the iqn # SCSI_DEV="/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\``" logger -t block-iscsi "scsi device: ${SCSI_DEV}" # And using the /dev/sd* device we can determine its corresponding multipath entry # MPATH_MAP="/dev/mapper/`multipath -ll ${SCSI_DEV} | head -1 | awk '{ print $1}'`" logger -t block-iscsi "mpath device: ${MPATH_MAP}" } case $command in add) # Login to the target logger -t block-iscsi "logging in to ${IQN} on ${IP}:${PORT}" sleep 5 #FIXME needs more advanced race condition logic iscsiadm -m node -T ${IQN} -p ${IP}:${PORT} --login | logger -t block-iscsi sleep 5 #FIXME needs more advanced race condition logic get_mpath_map if [ -a ${MPATH_MAP} ]; then logger -t block-iscsi "${command}ing device: ${MPATH_MAP}" write_dev ${MPATH_MAP} fi ;; remove) get_mpath_map if [ -a ${MPATH_MAP} ]; then logger -t block-iscsi "flushing buffers on ${MPATH_MAP}" blockdev --flushbufs ${MPATH_MAP} logger -t block-iscsi "attempting logout of ${IQN} on ${IP}:${PORT}" iscsiadm -m node -T ${IQN} -p ${IP}:${PORT} --logout | logger -t block-iscsi sleep 10 #FIXME needs more advanced race condition logic fi sleep 5 #FIXME needs more advanced race condition logic ;; esac
/etc/xen/scripts/block-iscsi.conf
# block-iscsi.conf - 2009 Keith Herron <keith@backdrift.org> # # Note: Config file for block-iscsi xen block script /etc/xen/scripts/block-iscsi # Define iSCSI portal addresses here, necessary for discovery PORTALS[0]="10.241.34.100"
To make use of this script you’ll need to update your xen guest config file to specify “iscsi” in the disk line instead of “phy” or similar.
domU configuration example
# disk = [ 'iscsi:iqn.2001-05.com.equallogic:0-8a0906-23fe93404-c82797962054a96d-examplehost,xvda,w' ]; #
April 13th, 2010 at 7:32 pm
Very nice, but I’m missing how you get beyond the 512 LUN limit.
Out of curiosity, did you consider using LVM on top of the LUNs? If you’re careful on each Xen server that’s sharing the storage (so you don’t run parallel LVM commands) you should be able to simplify your config a bit, no?
April 13th, 2010 at 11:49 pm
Hi Some Guy, thanks for your comment.
It’s not so much a limit OS 512 LUNs but rather a limit of 512 simultaneous connections to the iSCSI SAN at a time. This helps make the most of that limitation by using only the minimum number of connections necessary at a given time.
As an example let’s say I have 50 LUNs (read virtual machine disks) and 5 Xen dom0 machines with these machines spread evenly across them and where I want to live migrate to any of the dom0s.
Without using block-iscsi I would need to keep iSCSI connections open to all 25 volumes from all 5 hosts at all times (resulting in 25 * 5 = 125 connections). But when using block-iscsi the iSCSI connection is only open on the host where the VM is active (resulting in 25 * 1 = 25 connections).
So using block-iscsi saves us a lot of connections, and on the iSCSI SAN I’m using this is critical.
With regard to LVM, yes I did consider it but I decided that it introduced too much complexity and performance issues for this deployment. LVM snapshots significantly degrade performance and the complexity involved in recovering an LVM physical volume from a SAN snapshot or replica to a host where the same physical volume may already be serving live data did not appeal to me. After all, a SAN is a logical volume manager, right?
April 14th, 2010 at 3:46 am
Ahh, I see. That makes sense. When I first read your posting I had thought you magically increased that limit.
In my case, when we were evaluating EQL we had over 1000 small VMs so we had to move away from the 1 LUN per VM approach. LVM has worked ok so far, but it would have been nice to not need it…
Out of curiosity, how many VMs are you running per machine and how many paths?
April 14th, 2010 at 11:14 am
I see what you mean. Yeah the only way I know of to raise the limit is to use multiple equallogic shelves which I think maxes out at 2048 connections per group assuming you had 4 shelves to work with.
I’m running approx 25 VMs per machine and there is actually only one path to the disk presented over a 2×1G LACP bond. I’m using multipath solely for the “queue_if_no_path” feature which allows our VMs to survive extended connectivity problems on the storage network. It has prevented so much frustration!
April 14th, 2010 at 6:55 pm
Interesting use of queue_if_no_path! If I’m understanding, you are using this to cause your VMs to block in the event that the storage is down for a long time? Can you not simply bump up the iscsi timeout instead? I appreciate your conversation. It’s so nice to see how other people do things and solve the same problems.
April 14th, 2010 at 7:20 pm
You definitely could adjust the iSCSI timeouts to achieve a similar result but I felt that it was more appropriate to implement this logic in a central place with multipath. I want to be as flexible as possible and ready for the day where I do have multiple independent paths to storage. I also used to depend heavily on multipath aliasing to identify volumes.
Thanks for your feedback, its much appreciated!
April 21st, 2010 at 1:50 am
A very interesting article!! I have a couple of questions, I would want you to answer:
When you create a new vps in the cluster what do you do in order to create the corresponding partition, do you create it manually? Do you use the scripts(Host Scripting Tools) EqualLogic offers to its clients?
Those scripts let users to make connections to SAN in order to perform(through Telnet) tasks (such as create volumes, snapshots, list users, etc…).
April 21st, 2010 at 8:22 am
Delegating access to users is a great idea and definitely something I want to incorporate down the road. However, I don’t feel comfortable with the security implications of opening up the telnet interface on the equallogic directly to end users. I was planning to write a small web front end using the host scripting tools to delegate access but with the recent release of the Xen Cloud Platform and its free equallogic storage driver this may all change!
May 31st, 2010 at 5:11 am
Nice Post!!
Do you use Jumbo Frames to perform ISCSI connections?
When I try to start my DomUs with activated Jumbo Frames, it throws an error but with the MTU set to 1500 it start correctly.
Best regards!!!
May 31st, 2010 at 7:26 am
Nice work!
I am planing to get an EqualLogic PS4000 for our visualization setup and your iscsi-block script is a great help in setting up a proof of concept (using old surplus hardware and the IETD).
I’ll still have to extend my proof of concept to have real multipath capability but I’ll get there.
Since we will be using Debian, I adapted your script to Debian Lenny and made a couple of changes in order to get it to work. Some are rather cosmetic but others may help to make it more robust in your setup too.
I’ll try to post a unified diff in here and add some words of explanation below. Lets see how wordpress messes up the formating
============
— block-iscsi-20100527 2010-05-31 12:29:39.000000000 +0200
+++ block-iscsi 2010-05-31 13:20:33.000000000 +0200
@@ -30,8 +30,8 @@
# Using the iscsi node directory we can determine the ip and port of
# our iscsi target on a lun by lun basis
#
- IP=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 1`
-PORT=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 2`
+ IP=`ls /etc/iscsi/nodes/${IQN} | cut -d , -f 1`
+PORT=`ls /etc/iscsi/nodes/${IQN} | cut -d , -f 2`
logger -t block-iscsi “TARGET: ${IP}:${PORT}”
@@ -45,12 +45,12 @@
# Now we determine which /dev/sd* device belongs to the iqn
#
- SCSI_DEV=”/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”
+ SCSI_DEV=”/dev/`basename \`readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”
logger -t block-iscsi “scsi device: ${SCSI_DEV}”
# And using the /dev/sd* device we can determine its corresponding multipath entry
#
- MPATH_MAP=”/dev/mapper/`multipath -ll ${SCSI_DEV} | head -1 | awk ‘{ print $1}’`”
+ MPATH_MAP=”/dev/mapper/`multipath -ll -v 1 ${SCSI_DEV}`”
logger -t block-iscsi “mpath device: ${MPATH_MAP}”
}
@@ -65,7 +65,7 @@
#FIXME needs more advanced race condition logic
get_mpath_map
- if [ -a ${MPATH_MAP} ]; then
+ if [ -e ${MPATH_MAP} ]; then
logger -t block-iscsi “${command}ing device: ${MPATH_MAP}”
write_dev ${MPATH_MAP}
fi
@@ -73,7 +73,7 @@
remove)
get_mpath_map
- if [ -a ${MPATH_MAP} ]; then
+ if [ -e ${MPATH_MAP} ]; then
logger -t block-iscsi “flushing buffers on ${MPATH_MAP}”
blockdev –flushbufs ${MPATH_MAP}
logger -t block-iscsi “attempting logout of ${IQN} on ${IP}:${PORT}”
===========
The first change is due to Debian’s saving the known targets in /etc instead of /var. Thats ugly but thats the way it (currently) is.
The “readlink” binary on Debian is in /bin/.
Changing the multipath call was important as the output with the default (-v 2), was broken on my system due to a very long uuid running into the next field:
# multipath -ll /dev/sdc
149455400000000000000000001000000f71300000d000000dm-3 IET ,VIRTUAL-DISK
[size=2.0G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][active]
\_ 48:0:0:0 sdc 8:32 [active][ready]
Fortunately “-v 1″ is a way of getting just the multipath name, and according to the man page it is there for the use by other tools. So it will probably be a stable interface.
Changing the test from “-a” to “-e” is rather cosmetic as you insist on #!/bin/bash but it is still nicer
to use “-e” as it is more portable if somebody should decide to port it to a generic /bin/sh.
June 1st, 2010 at 12:38 pm
Hi Carlos, I too ran into issues when attempting to utilize jumbo frames from the dom0. I am currently using an MTU of 1500 and additionally my network interfaces are lacp bonded and use vlan tags to present different networks to different VMs. iSCSI is just using its own dedicated vlan, contending for physical bandwidth with whatever else is on the wire at the time. This has worked remarkably well to date!
I plan to experiment with more recent Xen and kernel versions as time allows. I’d be really interested to hear if you are able to get it working! let me know!
June 1st, 2010 at 12:46 pm
Henrik, This is excellent work. Thank you for providing a diff of your modifications! I will incorporate your optimizations into the el5 script and post your contributed Debian compatible version as well. If you would like, send me your name, e-mail, etc. so that I can give you proper credit.
Thanks again!
June 2nd, 2010 at 6:49 am
Thanks for the info Keith
. I’m testing with last versions of Xen and dom0 pv_ops kernel from git repository.
Xen 4.0
Xen 4.1-unstable.
Dom0 Kernel xen/stable-2.6.32.x pv_ops git
Dom0 Kernel xen/stable-2.6.33.x pv_ops git
I have 6 NICS card in the servers.
In order to perform iscsi connections, I’m using 4 Gb/s nics in bond-mode balance-alb. So I’ve got an available bandwidth of 4 Gb/s.
I use another NIC for the server administration (Ssh, live migration, etc…)
The last one is used for DomUs’ connectivity. This NIC peforms two vlans related to two bridges, one for internet connection and the other one for a DomUs’ private LAN.
I’m interesting to test http://openvswitch.org/ for manage vlans. XCP use openvswitch.
Sending you feedback