Xen block iSCSI script with multipath support

Tags: , , , ,


When connecting a server to a storage area network (SAN) its important to make certain that you’re hosts are prepared for the occasional blip in SAN connectivity. Device mapper multipath to the rescue! Multipath is an abstraction layer between you and the raw block devices which allows for multiple I/O paths or networks (IO multipathing) and gives you an increased level of control over what happens should a block device start reporting errors. Best of all its built right in to the modern Linux kernel.

I maintain a cluster of Xen servers that store VM images on an EqualLogic PS6000 Series iSCSI SAN as raw LUNs. Its super-stable and makes it very simple to manage, snapshot and replicate storage. The only drawback is EqualLogic’s limitation of 512 connections per storage pool. This means that for every LUN (read VM) created we consume a connection. Multiply this by the number of dom0s and you’ll quickly see that the available connections would get eaten up in no time. In order to step around this boundary I made some significant modifications the block-iscsi Xen block script I found on an e-mail thread. Sorry, I don’t remember where it came from and there are many variations floating around.

I’ve tested this script on RHEL5 running Xen 3.1.4, your mileage may vary but as always, I’d love to hear your feedback!

/etc/xen/scripts/block-iscsi

#!/bin/bash
# block-iscsi  -  2009 Keith Herron <[email protected]>
#
# multipath enabled block-iscsi xen block script.
#
# Note: This script depends on a block-iscsi.conf file
#       located in the same directory.  This file contains
#       an array of available iSCSI target IPs
#
 
dir=$(dirname "$0")
. "$dir/block-common.sh"
. "$dir/block-iscsi.conf"
 
# Log which mode we are in
logger -t block-iscsi "*** Beginning device $command ***"
 
# Fetch the iqn we specify in the domu config file
#
IQN=$(xenstore_read "$XENBUS_PATH/params")
logger -t block-iscsi "IQN: ${IQN}"
 
# We define portal ip in order to support new luns which don't yet have
# /var/lib/iscsi/node entrys yet, not dynamic but avoids manual discovery 
#
for PORTAL in ${PORTALS[@]}; do
  logger -t block-iscsi `iscsiadm -m discovery -t st -p $PORTAL`
done
 
# Using the iscsi node directory we can determine the ip and port of 
# our iscsi target on a lun by lun basis
#
  IP=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 1`
PORT=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 2`
 
logger -t block-iscsi "TARGET: ${IP}:${PORT}"
 
# This is called by each command to determine which multipath map to use
#
function get_mpath_map {
   # Re-run multipath to ensure that maps are up to date
   #
   multipath
   sleep 2
 
   # Now we determine which /dev/sd* device belongs to the iqn
   #
   SCSI_DEV="/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\``"
   logger -t block-iscsi "scsi device: ${SCSI_DEV}"
 
   # And using the /dev/sd* device we can determine its corresponding multipath entry
   #
   MPATH_MAP="/dev/mapper/`multipath -ll ${SCSI_DEV} | head -1 | awk '{ print $1}'`"
   logger -t block-iscsi "mpath device: ${MPATH_MAP}"
}
 
case $command in
   add)
      # Login to the target
      logger -t block-iscsi "logging in to ${IQN} on ${IP}:${PORT}"
      sleep 5
      #FIXME needs more advanced race condition logic
      iscsiadm -m node -T ${IQN} -p ${IP}:${PORT} --login | logger -t block-iscsi
      sleep 5
      #FIXME needs more advanced race condition logic
      get_mpath_map
 
      if [ -a ${MPATH_MAP} ]; then
         logger -t block-iscsi "${command}ing device: ${MPATH_MAP}"
         write_dev ${MPATH_MAP}
      fi
   ;;
 
   remove)
      get_mpath_map
      if [ -a ${MPATH_MAP} ]; then
         logger -t block-iscsi "flushing buffers on ${MPATH_MAP}"
         blockdev --flushbufs ${MPATH_MAP}
         logger -t block-iscsi "attempting logout of ${IQN} on ${IP}:${PORT}"
         iscsiadm -m node -T ${IQN} -p ${IP}:${PORT} --logout | logger -t block-iscsi
         sleep  10
         #FIXME needs more advanced race condition logic
      fi
      sleep 5
      #FIXME needs more advanced race condition logic
   ;;
esac

/etc/xen/scripts/block-iscsi.conf

# block-iscsi.conf  -  2009 Keith Herron <[email protected]>
# 
# Note: Config file for block-iscsi xen block script /etc/xen/scripts/block-iscsi
 
# Define iSCSI portal addresses here, necessary for discovery
PORTALS[0]="10.241.34.100"

To make use of this script you’ll need to update your xen guest config file to specify “iscsi” in the disk line instead of “phy” or similar.

domU configuration example

#
disk = [ 'iscsi:iqn.2001-05.com.equallogic:0-8a0906-23fe93404-c82797962054a96d-examplehost,xvda,w' ];
#

41 Responses to “Xen block iSCSI script with multipath support”

  1. Some Guy Says:

    Very nice, but I’m missing how you get beyond the 512 LUN limit.

    Out of curiosity, did you consider using LVM on top of the LUNs? If you’re careful on each Xen server that’s sharing the storage (so you don’t run parallel LVM commands) you should be able to simplify your config a bit, no?

    [Reply]

  2. admin Says:

    Hi Some Guy, thanks for your comment.

    It’s not so much a limit OS 512 LUNs but rather a limit of 512 simultaneous connections to the iSCSI SAN at a time. This helps make the most of that limitation by using only the minimum number of connections necessary at a given time.

    As an example let’s say I have 50 LUNs (read virtual machine disks) and 5 Xen dom0 machines with these machines spread evenly across them and where I want to live migrate to any of the dom0s.

    Without using block-iscsi I would need to keep iSCSI connections open to all 25 volumes from all 5 hosts at all times (resulting in 25 * 5 = 125 connections). But when using block-iscsi the iSCSI connection is only open on the host where the VM is active (resulting in 25 * 1 = 25 connections).

    So using block-iscsi saves us a lot of connections, and on the iSCSI SAN I’m using this is critical.

    With regard to LVM, yes I did consider it but I decided that it introduced too much complexity and performance issues for this deployment. LVM snapshots significantly degrade performance and the complexity involved in recovering an LVM physical volume from a SAN snapshot or replica to a host where the same physical volume may already be serving live data did not appeal to me. After all, a SAN is a logical volume manager, right? 🙂

    [Reply]

  3. Some Guy Says:

    Ahh, I see. That makes sense. When I first read your posting I had thought you magically increased that limit. 🙂 In my case, when we were evaluating EQL we had over 1000 small VMs so we had to move away from the 1 LUN per VM approach. LVM has worked ok so far, but it would have been nice to not need it…

    Out of curiosity, how many VMs are you running per machine and how many paths?

    [Reply]

  4. admin Says:

    I see what you mean. Yeah the only way I know of to raise the limit is to use multiple equallogic shelves which I think maxes out at 2048 connections per group assuming you had 4 shelves to work with.

    I’m running approx 25 VMs per machine and there is actually only one path to the disk presented over a 2x1G LACP bond. I’m using multipath solely for the “queue_if_no_path” feature which allows our VMs to survive extended connectivity problems on the storage network. It has prevented so much frustration!

    [Reply]

  5. Some Guy Says:

    Interesting use of queue_if_no_path! If I’m understanding, you are using this to cause your VMs to block in the event that the storage is down for a long time? Can you not simply bump up the iscsi timeout instead? I appreciate your conversation. It’s so nice to see how other people do things and solve the same problems.

    [Reply]

  6. admin Says:

    You definitely could adjust the iSCSI timeouts to achieve a similar result but I felt that it was more appropriate to implement this logic in a central place with multipath. I want to be as flexible as possible and ready for the day where I do have multiple independent paths to storage. I also used to depend heavily on multipath aliasing to identify volumes.

    Thanks for your feedback, its much appreciated!

    [Reply]

  7. chewaka Says:

    A very interesting article!! I have a couple of questions, I would want you to answer:
    When you create a new vps in the cluster what do you do in order to create the corresponding partition, do you create it manually? Do you use the scripts(Host Scripting Tools) EqualLogic offers to its clients?

    Those scripts let users to make connections to SAN in order to perform(through Telnet) tasks (such as create volumes, snapshots, list users, etc…).

    [Reply]

  8. keith Says:

    Delegating access to users is a great idea and definitely something I want to incorporate down the road. However, I don’t feel comfortable with the security implications of opening up the telnet interface on the equallogic directly to end users. I was planning to write a small web front end using the host scripting tools to delegate access but with the recent release of the Xen Cloud Platform and its free equallogic storage driver this may all change!

    [Reply]

  9. Carlos Says:

    Nice Post!!

    Do you use Jumbo Frames to perform ISCSI connections?

    When I try to start my DomUs with activated Jumbo Frames, it throws an error but with the MTU set to 1500 it start correctly.

    Best regards!!!

    [Reply]

  10. henrik Says:

    Nice work!
    I am planing to get an EqualLogic PS4000 for our visualization setup and your iscsi-block script is a great help in setting up a proof of concept (using old surplus hardware and the IETD).

    I’ll still have to extend my proof of concept to have real multipath capability but I’ll get there.

    Since we will be using Debian, I adapted your script to Debian Lenny and made a couple of changes in order to get it to work. Some are rather cosmetic but others may help to make it more robust in your setup too.
    I’ll try to post a unified diff in here and add some words of explanation below. Lets see how wordpress messes up the formating 🙂

    ============
    — block-iscsi-20100527 2010-05-31 12:29:39.000000000 +0200
    +++ block-iscsi 2010-05-31 13:20:33.000000000 +0200
    @@ -30,8 +30,8 @@
    # Using the iscsi node directory we can determine the ip and port of
    # our iscsi target on a lun by lun basis
    #
    – IP=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 1`
    -PORT=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 2`
    + IP=`ls /etc/iscsi/nodes/${IQN} | cut -d , -f 1`
    +PORT=`ls /etc/iscsi/nodes/${IQN} | cut -d , -f 2`

    logger -t block-iscsi “TARGET: ${IP}:${PORT}”

    @@ -45,12 +45,12 @@

    # Now we determine which /dev/sd* device belongs to the iqn
    #
    – SCSI_DEV=”/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”
    + SCSI_DEV=”/dev/`basename \`readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”
    logger -t block-iscsi “scsi device: ${SCSI_DEV}”

    # And using the /dev/sd* device we can determine its corresponding multipath entry
    #
    – MPATH_MAP=”/dev/mapper/`multipath -ll ${SCSI_DEV} | head -1 | awk ‘{ print $1}’`”
    + MPATH_MAP=”/dev/mapper/`multipath -ll -v 1 ${SCSI_DEV}`”
    logger -t block-iscsi “mpath device: ${MPATH_MAP}”
    }

    @@ -65,7 +65,7 @@
    #FIXME needs more advanced race condition logic
    get_mpath_map

    – if [ -a ${MPATH_MAP} ]; then
    + if [ -e ${MPATH_MAP} ]; then
    logger -t block-iscsi “${command}ing device: ${MPATH_MAP}”
    write_dev ${MPATH_MAP}
    fi
    @@ -73,7 +73,7 @@

    remove)
    get_mpath_map
    – if [ -a ${MPATH_MAP} ]; then
    + if [ -e ${MPATH_MAP} ]; then
    logger -t block-iscsi “flushing buffers on ${MPATH_MAP}”
    blockdev –flushbufs ${MPATH_MAP}
    logger -t block-iscsi “attempting logout of ${IQN} on ${IP}:${PORT}”
    ===========

    The first change is due to Debian’s saving the known targets in /etc instead of /var. Thats ugly but thats the way it (currently) is.

    The “readlink” binary on Debian is in /bin/.

    Changing the multipath call was important as the output with the default (-v 2), was broken on my system due to a very long uuid running into the next field:

    # multipath -ll /dev/sdc
    149455400000000000000000001000000f71300000d000000dm-3 IET ,VIRTUAL-DISK
    [size=2.0G][features=0][hwhandler=0]
    \_ round-robin 0 [prio=1][active]
    \_ 48:0:0:0 sdc 8:32 [active][ready]

    Fortunately “-v 1” is a way of getting just the multipath name, and according to the man page it is there for the use by other tools. So it will probably be a stable interface.

    Changing the test from “-a” to “-e” is rather cosmetic as you insist on #!/bin/bash but it is still nicer
    to use “-e” as it is more portable if somebody should decide to port it to a generic /bin/sh.

    [Reply]

  11. keith Says:

    Hi Carlos, I too ran into issues when attempting to utilize jumbo frames from the dom0. I am currently using an MTU of 1500 and additionally my network interfaces are lacp bonded and use vlan tags to present different networks to different VMs. iSCSI is just using its own dedicated vlan, contending for physical bandwidth with whatever else is on the wire at the time. This has worked remarkably well to date!

    I plan to experiment with more recent Xen and kernel versions as time allows. I’d be really interested to hear if you are able to get it working! let me know!

    [Reply]

  12. keith Says:

    Henrik, This is excellent work. Thank you for providing a diff of your modifications! I will incorporate your optimizations into the el5 script and post your contributed Debian compatible version as well. If you would like, send me your name, e-mail, etc. so that I can give you proper credit.

    Thanks again!

    [Reply]

  13. Carlos Says:

    Thanks for the info Keith :). I’m testing with last versions of Xen and dom0 pv_ops kernel from git repository.

    Xen 4.0
    Xen 4.1-unstable.

    Dom0 Kernel xen/stable-2.6.32.x pv_ops git
    Dom0 Kernel xen/stable-2.6.33.x pv_ops git

    I have 6 NICS card in the servers.

    In order to perform iscsi connections, I’m using 4 Gb/s nics in bond-mode balance-alb. So I’ve got an available bandwidth of 4 Gb/s.

    I use another NIC for the server administration (Ssh, live migration, etc…)

    The last one is used for DomUs’ connectivity. This NIC peforms two vlans related to two bridges, one for internet connection and the other one for a DomUs’ private LAN.

    I’m interesting to test http://openvswitch.org/ for manage vlans. XCP use openvswitch.

    Sending you feedback 🙂

    [Reply]

  14. Kenneth Kalmer Says:

    This really looks like an awesome script, I just cannot get it working with Xen 4. It doesn’t even appear to execute, no information is logged, and “xm create” terminates immediately with a “Error: Disk isn’t accessible” error.

    Any pointers? Carlos did you have to change anything for Xen 4 ?

    Best

    [Reply]

  15. Carlos Says:

    I think Xen 4 isn’t the problem.

    I had make changes in sript paths. For example in Debian/Ubuntu nodes path is “/etc/iscsi/nodes”.

    You can add “-x” to #!/bin/bash on top of script to obtain more debug info.

    This is my script, i don’t use multipaht .

    http://pastebin.com/xpn1APN9

    [Reply]

  16. henrik Says:

    @Kenneth: Are you sure your /etc/xen/scripts/block-iscsi file is executable?

    [Reply]

  17. henrik Says:

    Hi keith

    I finally got around to do some more testing with my PS4000 and to get it rock-solid I tried some vm-ping-pong.

    Basically what I do is to have one virtual machine migrated back and forth all night while under heavy load (both CPU and disc IO).

    Xenhost1 runs:

    while true ; do sleep 10 ; if xm list | grep foo | grep 'r\-\-\-\-\-' ; then echo found foo ; sleep 30 ; echo migrating foo ; xm migrate -l foo.example.com 192.168.0.30 ; else echo no foo ; fi ; done

    Xenhost2 runs:

    while true ; do sleep 10 ; if xm list | grep foo | grep 'r\-\-\-\-\-' ; then echo found foo ; sleep 30 ; echo migrating foo ; xm migrate -l foo.example.com 192.168.0.40 ; else echo no foo ; fi ; done

    all the while vm1 (which has 2 VCPUs) runs two instances of cpuburn (burnP6) and badblocks on a 1 gb test file to generate io load:

    while true ; do badblocks -svw test1gb ; done

    I highly recommend something like that to anybody who wants to run this kind of setup.

    The problem I have is that apparently sometimes io requests get stuck in the queue while the virtual machine is long gone over to the other xen host.

    dmesg says something like this:

    [96592.604813] scsi 331:0:0:0: rejecting I/O to dead device
    [96592.604901] scsi 335:0:0:0: rejecting I/O to dead device
    [96592.605227] device-mapper: table: 254:2: multipath: error getting device
    [96592.605327] device-mapper: ioctl: error adding target to table
    [96593.655816] scsi 331:0:0:0: rejecting I/O to dead device
    [96593.655887] scsi 335:0:0:0: rejecting I/O to dead device
    [96593.659854] device-mapper: table: 254:2: multipath: error getting device
    [96593.659854] device-mapper: ioctl: error adding target to table
    [96594.711813] scsi 331:0:0:0: rejecting I/O to dead device
    [96594.711873] scsi 335:0:0:0: rejecting I/O to dead device
    [96594.712098] device-mapper: table: 254:2: multipath: error getting device
    [96594.712131] device-mapper: ioctl: error adding target to table
    [96595.753012] scsi 331:0:0:0: rejecting I/O to dead device
    [96595.753144] scsi 335:0:0:0: rejecting I/O to dead device
    [96595.753321] device-mapper: table: 254:2: multipath: error getting device
    [96595.753353] device-mapper: ioctl: error adding target to table
    [96596.795550] scsi 331:0:0:0: rejecting I/O to dead device
    [96596.795609] scsi 335:0:0:0: rejecting I/O to dead device
    [96596.795826] device-mapper: table: 254:2: multipath: error getting device
    [96596.795858] device-mapper: ioctl: error adding target to table

    The output of multipath -ll will look like this:

    # multipath -ll
    36090a068302e8e60eed3b40100002035dm-3 ,
    [size=525M][features=1 queue_if_no_path][hwhandler=0]
    \_ round-robin 0 [prio=0][active]
    \_ #:#:#:# - #:# [active][faulty]
    \_ #:#:#:# - #:# [active][faulty]
    36090a068302ecec3eed3e401000040d4dm-4 ,
    [size=4.0G][features=1 queue_if_no_path][hwhandler=0]
    \_ round-robin 0 [prio=0][active]
    \_ #:#:#:# - #:# [active][faulty]
    \_ #:#:#:# - #:# [active][faulty]

    Even after doing “multipath -F” the scsi subsystem seems to try to finish some io:

    [96855.050022] scsi 331:0:0:0: rejecting I/O to dead device
    [96855.050077] scsi 335:0:0:0: rejecting I/O to dead device
    [96856.056534] scsi 331:0:0:0: rejecting I/O to dead device
    [96856.056604] scsi 335:0:0:0: rejecting I/O to dead device
    [96857.060426] scsi 331:0:0:0: rejecting I/O to dead device
    [96857.060475] scsi 335:0:0:0: rejecting I/O to dead device
    [96858.065496] scsi 331:0:0:0: rejecting I/O to dead device
    [96858.065556] scsi 335:0:0:0: rejecting I/O to dead device
    [96859.070750] scsi 331:0:0:0: rejecting I/O to dead device
    [96859.070808] scsi 335:0:0:0: rejecting I/O to dead device

    syslog seems to indicate that multipathd is still trying to work on a dead device:

    Jan 26 17:10:20 janus03 multipathd: sdb: checker msg is "readsector0 checker reports path is down"
    Jan 26 17:10:20 janus03 multipathd: sde: checker msg is "readsector0 checker reports path is down"
    Jan 26 17:10:20 janus03 multipathd: 36090a068302ecec3eed3e401000040d4: failed in domap for addition of new path sde
    Jan 26 17:10:21 janus03 kernel: [97460.880818] scsi 331:0:0:0: rejecting I/O to dead device
    Jan 26 17:10:21 janus03 kernel: [97460.880878] scsi 335:0:0:0: rejecting I/O to dead device
    Jan 26 17:10:21 janus03 multipathd: sdb: checker msg is "readsector0 checker reports path is down"
    Jan 26 17:10:21 janus03 multipathd: sde: checker msg is "readsector0 checker reports path is down"
    Jan 26 17:10:21 janus03 multipathd: 36090a068302ecec3eed3e401000040d4: failed in domap for addition of new path sde
    Jan 26 17:10:21 janus03 kernel: [97461.890697] scsi 331:0:0:0: rejecting I/O to dead device
    Jan 26 17:10:21 janus03 kernel: [97461.890767] scsi 335:0:0:0: rejecting I/O to dead device

    The vm seems to go on doing its job while constantly being migrated, but that condition in the xen hosts system remains. It doesn’t seem to affect the VM, but i wonder if it will eventually cause problems.

    Did you see anything like that in your setup?

    [Reply]

    Keith Reply:

    Hi Henrik,

    You know, I have seen that sort of error, but it was never the result of a live migration for me. This is what would happen when an iscsi session was logged out or network connection severed while still in use or something of that nature. If iscsi came back I could just run ‘multipath’ and things would come back to life. You might have run into a race condition here where the iscsi session is being torn down before its no longer in use.

    I wonder if these errors, for example “scsi 331:0:0:0: rejecting I/O to dead device”, could be cleaned up by forcing a removal of the failing SCSI devices. Does running something like echo “scsi remove-single-device 331 0 0 0” > /proc/scsi/scsi help?

    [Reply]

  18. henrik Says:

    Hi Keith,

    I don’t think it is a result of migration either. At least not a direct result. It seems to be a timing issue that has to do with the removal of iscsi devices by xen (calling block-iscsi).

    There is no synchronization when removing a device. When adding a device the block script has to call

    write_dev /dev/somedevice

    but there is mothing like this for removing devices.

    I’ll try to provoke that error situation again and then I’ll try your suggestion for removal of defunct scsi devices. However while writing this I realized that the pending i/o might very well be caused by multipath itself. Or rather by multipathd’s path checks.

    The removal part of block-iscsi does try to flush the multipath block device before iscsiadm –logout, but there is a chance of multipathd initiating a path check right between those commands.

    In my setup path check is done by path_checker readsector0 !

    In fact the multipath device is not taken down by the script at all.
    That task is left to the multipathd and might take place some time later.
    If I run

    xm migrate -l foo.example.com 192.168.0.30 ; multipath -ll

    I still see both devices that belong to this vm

    36090a068302e8e60eed3b40100002035dm-3 EQLOGIC ,100E-00
    [size=525M][features=1 queue_if_no_path][hwhandler=0]
    \_ round-robin 0 [prio=2][active]
    \_ 51:0:0:0 sdd 8:48 [active][ready]
    \_ 50:0:0:0 sdc 8:32 [active][ready]
    36090a068302ecec3eed3e401000040d4dm-4 EQLOGIC ,100E-00
    [size=4.0G][features=1 queue_if_no_path][hwhandler=0]
    \_ round-robin 0 [prio=2][active]
    \_ 52:0:0:0 sde 8:64 [active][ready]
    \_ 53:0:0:0 sdf 8:80 [active][ready]

    So it might be that a simple multipath -f ${MPATH_MAP} will avoid the error.

    [Reply]

  19. henrik Says:

    In regard to remove dead devices… (getting some of those happens easily if you login and logout to many siscsi targets at the same time while multipath is trying to get it all up and running)


    [ 6554.704173] scsi 24:0:0:0: rejecting I/O to dead device
    [ 6554.704229] scsi 45:0:0:0: rejecting I/O to dead device
    [ 6555.708163] scsi 24:0:0:0: rejecting I/O to dead device
    [ 6555.708220] scsi 45:0:0:0: rejecting I/O to dead device
    janus04:~# echo "scsi remove-single-device 24 0 0 0" > /proc/scsi/scsi
    -bash: echo: write error: No such device or address

    no luck there. and it seems that parts of the kernel don’t know the device anymore:

    janus04:~# cat /proc/scsi/scsi
    Attached devices:
    Host: scsi0 Channel: 00 Id: 00 Lun: 00
    Vendor: ATA Model: WDC WD5002ABYS-1 Rev: 02.0
    Type: Direct-Access ANSI SCSI revision: 05
    Host: scsi1 Channel: 00 Id: 00 Lun: 00
    Vendor: ATA Model: WDC WD5002ABYS-1 Rev: 02.0
    Type: Direct-Access ANSI SCSI revision: 05
    Host: scsi3 Channel: 00 Id: 00 Lun: 00
    Vendor: TEAC Model: DVD-ROM DV-28SW Rev: R.2A
    Type: CD-ROM ANSI SCSI revision: 05

    [Reply]

  20. Eduardo Bragatto Says:

    Hi,

    I’ve been trying to use your script on CentOS 5.5 with Xen 3.0.3:

    # cat /etc/redhat-release
    CentOS release 5.5 (Final)

    # rpm -qa | grep xen
    xen-3.0.3-105.el5_5.5
    kernel-xen-2.6.18-194.32.1.el5
    kernel-xen-devel-2.6.18-194.32.1.el5
    xen-libs-3.0.3-105.el5_5.5

    I’ve also tried using the packages from gitco.de to update the hypervisor to 3.4.3:

    # cat /etc/redhat-release
    CentOS release 5.5 (Final)

    # rpm -qa |grep xen
    kernel-xen-2.6.18-194.32.1.el5
    xen-libs-3.4.3-4.el5
    kernel-xen-devel-2.6.18-194.32.1.el5
    xen-3.4.3-4.el5

    In both cases, the “xm create” command ends immediately with the following error:

    Error: Disk isn’t accessible

    As reported by Kenneth.

    Using “strace”, I noticed “xm create” never touches the script /etc/xen/scripts/block-iscsi (yes, it is executable and I’ve added a simple line “echo $(date -R) >> /tmp/iscsi.log” in the beginning that never got executed). “xm create” never tries to open or execute the script.

    According to this URL:

    http://lists.xensource.com/archives/html/xen-devel/2010-02/msg01209.html

    Looks like we have to patch xenstore.c in order to be able to use your script. Can you please elaborate on how to properly install block-iscsi? Simply placing the script under /etc/xen/scripts is not sufficient.

    [Reply]

  21. witek Says:

    It would be nice if this script will be improved in quality to be included in standard xen release. Better error handling (like missing iscsiadm), better race condition handling. This script is much better than manually importing iSCSI devices, and helps in migration.

    @Eduardo: Your disk = [ ‘iscsi:/dev/disk/by-path/ip-10.0.0.1:3260-iscsi-iqn.2001-01.ro.test:test3-lun-1,hda,w’ ], is incorrect. Look at example in the article.

    [Reply]

  22. Keith Says:

    @witek I agree, that would be great! I created a github repo for my xen scripts some time ago and have been meaning to update this page to refer to it. This way people could contribute improvements to the script like you are saying.

    https://github.com/keithherron/xen/tree/master/scripts

    [Reply]

  23. witek Says:

    henrik & Carlos, thanks for Debian squeezy and Xen 4.x compatible block-iscsi! Really elpful.

    [Reply]

  24. tommics Says:

    Hi,

    we experience the same problems as Kenneth and Eduardo had.
    We are running a Debian Squeeze Xen Installation.
    I added the missing hotplugpath.sh to /etc/xen/scripts but it seems Xen is never touching the block-iscsi script.

    i added a touch /tmp/XXXX in top of it and it never gets created.
    It is executable and my child config uses the exact same syntax as discribed in the article. We are running an older version of the block-iscsi script without a problem on lenny (xen3.1).

    anyone found a solution for this issue?

    Regards
    Thomas

    [Reply]

  25. tommics Says:

    i found out that this only happens in vm configs using pygrub as bootloader. When i add kernel and ramdisk to the config itself there is no problem. seems like pygrub is not capable of non phy/file backends 🙁

    [Reply]

  26. tommics Says:

    next thing i tried is the safer way of pygrub -> pvgrub. Cuz pvgrub is not acting in dom0, only in domU the block-iscsi script is triggered to log in iscsi targets. So everything should be working as expected now.

    you will not find pv-grub in any debian distributions. theres a wishlist entry filed but not published in any xen-utils package. So you have to compile it yourself using make stubdom from the xen sources.

    [Reply]

  27. Baylink Says:

    > Without using block-iscsi I would need to keep iSCSI connections open to all 25 volumes from all 5 hosts at all times (resulting in 25 * 5 = 125 connections). But when using block-iscsi the iSCSI connection is only open on the host where the VM is active (resulting in 25 * 1 = 25 connections).

    I *think* that by that you mean “if I didn’t do it this way, I’d have to map each of my target volumes into *all* of my possible domUs, with the resultant combinatorial explosion of connections, and this lets me shove the iSCSI connection inside the domU config file, so it’s only connected when the domU boots”.

    Is that correct?

    Cause that is the problem I’d forseen; so this would be my solution. 🙂

    [Reply]

    Keith Reply:

    @Baylink, I think so… The SAN in use only can handle so many simultaneous iSCSI sessions (512), and all of my dom0s need access to every iSCSI volume in order to support live migration. Letting the Xen block scripts set up and tear down the iSCSI session on demand was the most efficient way to use those limited resources without making significant sacrifices to the architecture.

    [Reply]

  28. Baylink Says:

    > Letting the Xen block scripts set up and tear down the iSCSI session on demand

    Yeah, that sounds like what I was chasing after. I read TBOX over the weekend, but it merely hinted at this kind of depth. I’ll snag your script and look at it in more detail tomorrow. Thanks a bunch for posting this. You may have saved my butt, by providing as an add-on a facility I would have assumed would be baked-in. 🙂

    [Reply]

  29. Baylink Says:

    Well, maybe not so much.

    Any ideas what might have caused this, oh person-who-knows-xen-well-enough-to-write-block-scripts? 🙂

    Sep 27 23:39:41 banner kernel: [288438.536745] blkback: ring-ref 8, event-channel 70, protocol 1 (x86_32-abi)
    Sep 27 23:39:41 banner kernel: [288438.539718] blkback: ring-ref 8, event-channel 70, protocol 1 (x86_32-abi)
    Sep 27 23:39:41 banner logger: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/0/268441856
    Sep 27 23:39:41 banner logger: /etc/xen/scripts/block-iscsi: add XENBUS_PATH=backend/vbd/0/268441856
    Sep 27 23:39:41 banner block-iscsi: *** Beginning device add ***
    Sep 27 23:39:41 banner block-iscsi: IQN: iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0
    Sep 27 23:39:41 banner block-iscsi: 192.168.1.190:3260,1 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0 192.168.1.190:3260,1 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot1
    Sep 27 23:39:41 banner block-iscsi: TARGET: :
    Sep 27 23:39:41 banner block-iscsi: logging in to iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0 on :
    Sep 27 23:39:51 banner logger: /etc/xen/scripts/block: Writing backend/vbd/0/268441856/hotplug-error /etc/xen/scripts/block failed; error detected. backend/vbd/0/268441856/hotplug-status error to xenstore.
    Sep 27 23:39:51 banner logger: /etc/xen/scripts/block: /etc/xen/scripts/block failed; error detected.
    Sep 27 23:39:51 banner logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/0/268441856
    Sep 27 23:39:51 banner logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/0/268441856

    I’d just copied a local diskimage file to an openfiler iSCSI share the same size; the entire subnet has access permissions on it — and the dom0 in particular, cause I mounted it by hand to (successfully) dd it over to the filer.

    [Reply]

  30. Baylink Says:

    I guess I should have included this:

    name=”centos57base”
    vcpus=1
    memory=1024
    # disk=[‘tap:aio:/appl/xen/images/centos57base/centos57base.img,xvda,w’]
    disk=[‘iscsi:iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0,xvda,w’]
    vif=[‘ip=67.78.195.9,bridge=br0,mac=00:16:3E:FF:00:4E’]
    bootloader = ‘/usr/bin/pygrub’
    extra=”(hd0,0)/boot/grub/menu.lst”

    The tap image is the one I DDd over; it works when local; the IQN is what comes out of the discovery, copy/pasta.

    The image is CentOS5.7, installed via HTTP from … OSU, I think. Some mirror.

    [Reply]

    Keith Reply:

    @Baylink, I notice that your $IP and $PORT variables don’t appear to be working.

    This snippet from the script:
    # Login to the target
    logger -t block-iscsi “logging in to ${IQN} on ${IP}:${PORT}”

    Is yielding the following output:
    Sep 27 23:39:41 banner block-iscsi: logging in to iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0 on :

    As you can see, the $IP:$PORT is coming out blank (is /etc/xen/scripts/block-iscsi.conf in place?). The below section is where the IP and port are assigned, I would try running these commands by hand to ensure that they provide the expected output:

    # We define portal ip in order to support new luns which don’t yet have
    # /var/lib/iscsi/node entrys yet, not dynamic but avoids manual discovery
    #
    for PORTAL in ${PORTALS[@]}; do
    logger -t block-iscsi `iscsiadm -m discovery -t st -p $PORTAL`
    done

    # Using the iscsi node directory we can determine the ip and port of
    # our iscsi target on a lun by lun basis
    #
    IP=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 1`
    PORT=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 2`

    logger -t block-iscsi “TARGET: ${IP}:${PORT}”

    [Reply]

  31. Baylink Says:

    /etc/xen/scripts/block-iscsi.conf:

    # block-iscsi.conf – 2009 Keith Herron
    #
    # Note: Config file for block-iscsi xen block script /etc/xen/scripts/block-iscsi

    # Define iSCSI portal addresses here, necessary for discovery
    PORTALS[0]=”192.168.1.190″

    It isn’t executable, though; as the script also wasn’t. Since it’s picking up the portal, I assume that’s not the problem.

    run manually:

    banner:/etc/xen/scripts # iscsiadm -m discovery -t st -p 192.168.1.190
    192.168.1.190:3260,1 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0
    192.168.1.190:3260,1 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot1

    and as I said: I can log in to the filer manually, and mount one of those — I did so to copy the local image over to the filer with dd, logged out again afterwards.

    So I know the connection proper is working.

    My open-iscsi:

    open-iscsi-2.0.870-37.38.1.i586

    which is the distro package on SuSE 11.4.

    [Reply]

    Keith Reply:

    @Baylink, what do you see when you run the commands from the second part manually?

    IP=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 1`
    PORT=`ls /var/lib/iscsi/nodes/${IQN} | cut -d , -f 2`

    [Reply]

  32. Baylink Says:

    I see that your script assumes directories not in evidence for SuSE’s implementation. 🙂

    /var/lib/iscsi doesn’t exist.

    It’s /etc/iscsi/nodes in this build.

    banner:/etc/xen/scripts # cd /etc/iscsi/nodes/
    banner:/etc/iscsi/nodes # l
    total 16
    drw——- 4 root root 4096 Sep 28 15:09 ./
    drwxr-xr-x 5 root root 4096 Sep 28 15:09 ../
    drw——- 3 root root 4096 Sep 28 15:09 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0/
    drw——- 3 root root 4096 Sep 28 15:09 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot1/

    Let me go change that path and try again.

    [Reply]

  33. Baylink Says:

    Ok. I abstracted out ${NODES} at the top of the script (patch to follow), and now I get this instead:

    Sep 28 15:18:26 banner kernel: [344763.765160] blkback: ring-ref 8, event-channel 70, protocol 1 (x86_32-abi)
    Sep 28 15:18:26 banner kernel: [344763.767690] blkback: ring-ref 8, event-channel 70, protocol 1 (x86_32-abi)
    Sep 28 15:18:26 banner logger: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/0/268441856
    Sep 28 15:18:26 banner logger: /etc/xen/scripts/block-iscsi: add XENBUS_PATH=backend/vbd/0/268441856
    Sep 28 15:18:26 banner block-iscsi: *** Beginning device add ***
    Sep 28 15:18:26 banner block-iscsi: IQN: iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0
    Sep 28 15:18:26 banner block-iscsi: 192.168.1.190:3260,1 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0 192.168.1.190:3260,1 iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot1
    Sep 28 15:18:26 banner block-iscsi: TARGET: 192.168.1.190:3260
    Sep 28 15:18:26 banner block-iscsi: logging in to iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0 on 192.168.1.190:3260
    Sep 28 15:18:31 banner kernel: [344769.234912] scsi6 : iSCSI Initiator over TCP/IP
    Sep 28 15:18:32 banner kernel: [344769.491819] scsi 6:0:0:0: Direct-Access Openfile Virtual disk 0 PQ: 0 ANSI: 4
    Sep 28 15:18:32 banner block-iscsi: Logging in to [iface: default, target: iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0, portal: 192.168.1.190,3260]
    Sep 28 15:18:32 banner block-iscsi: Login to [iface: default, target: iqn.2006-01.com.openfiler:zing_atlas.lv-vmroot0, portal: 192.168.1.190,3260]: successful
    Sep 28 15:18:32 banner kernel: [344769.492445] sd 6:0:0:0: Attached scsi generic sg2 type 0
    Sep 28 15:18:32 banner kernel: [344769.494970] sd 6:0:0:0: [sda] 16777216 512-byte logical blocks: (8.58 GB/8.00 GiB)
    Sep 28 15:18:32 banner kernel: [344769.495315] sd 6:0:0:0: [sda] Write Protect is off
    Sep 28 15:18:32 banner kernel: [344769.495322] sd 6:0:0:0: [sda] Mode Sense: 77 00 00 08
    Sep 28 15:18:32 banner kernel: [344769.497558] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn’t support DPO or FUA
    Sep 28 15:18:32 banner kernel: [344769.520537] sda: sda1 sda2 sda3
    Sep 28 15:18:32 banner kernel: [344769.524294] sd 6:0:0:0: [sda] Attached SCSI disk
    Sep 28 15:18:32 banner iscsid: connection2:0 is operational now
    Sep 28 15:18:37 banner logger: /etc/xen/scripts/block: Writing backend/vbd/0/268441856/hotplug-error /etc/xen/scripts/block failed; error detected. backend/vbd/0/268441856/hotplug-status error to xenstore.
    Sep 28 15:18:37 banner logger: /etc/xen/scripts/block: /etc/xen/scripts/block failed; error detected.
    Sep 28 15:18:37 banner logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/0/268441856
    Sep 28 15:18:37 banner logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/0/268441856

    The modified script:

    banner:/appl/xen/cent/57 # cat /etc/xen/scripts/block-iscsi
    #!/bin/bash
    # block-iscsi – 2009 Keith Herron
    #
    # multipath enabled block-iscsi xen block script.
    #
    # Note: This script depends on a block-iscsi.conf file
    # located in the same directory. This file contains
    # an array of available iSCSI target IPs
    #

    dir=$(dirname “$0”)
    . “$dir/block-common.sh”
    . “$dir/block-iscsi.conf”

    # Where do your node files live?
    #NODES=/var/lib/iscsi
    NODES=/etc/iscsi

    # Log which mode we are in
    logger -t block-iscsi “*** Beginning device $command ***”

    # Fetch the iqn we specify in the domu config file
    #
    IQN=$(xenstore_read “$XENBUS_PATH/params”)
    logger -t block-iscsi “IQN: ${IQN}”

    # We define portal ip in order to support new luns which don’t yet have
    # ${NODES}/node entrys yet, not dynamic but avoids manual discovery
    #
    for PORTAL in ${PORTALS[@]}; do
    logger -t block-iscsi `iscsiadm -m discovery -t st -p $PORTAL`
    done

    # Using the iscsi node directory we can determine the ip and port of
    # our iscsi target on a lun by lun basis
    #
    IP=`ls ${NODES}/nodes/${IQN} | cut -d , -f 1`
    PORT=`ls ${NODES}/nodes/${IQN} | cut -d , -f 2`

    logger -t block-iscsi “TARGET: ${IP}:${PORT}”

    # This is called by each command to determine which multipath map to use
    #
    function get_mpath_map {
    # Re-run multipath to ensure that maps are up to date
    #
    multipath
    sleep 2

    # Now we determine which /dev/sd* device belongs to the iqn
    #
    SCSI_DEV=”/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”
    logger -t block-iscsi “scsi device: ${SCSI_DEV}”

    # And using the /dev/sd* device we can determine its corresponding multipath entry
    #
    MPATH_MAP=”/dev/mapper/`multipath -ll ${SCSI_DEV} | head -1 | awk ‘{ print $1}’`”
    logger -t block-iscsi “mpath device: ${MPATH_MAP}”
    }

    case $command in
    add)
    # Login to the target
    logger -t block-iscsi “logging in to ${IQN} on ${IP}:${PORT}”
    sleep 5
    #FIXME needs more advanced race condition logic
    iscsiadm -m node -T ${IQN} -p ${IP}:${PORT} –login | logger -t block-iscsi
    sleep 5
    #FIXME needs more advanced race condition logic
    get_mpath_map

    if [ -a ${MPATH_MAP} ]; then
    logger -t block-iscsi “${command}ing device: ${MPATH_MAP}”
    write_dev ${MPATH_MAP}
    fi
    ;;

    remove)
    get_mpath_map
    if [ -a ${MPATH_MAP} ]; then
    logger -t block-iscsi “flushing buffers on ${MPATH_MAP}”
    blockdev –flushbufs ${MPATH_MAP}
    logger -t block-iscsi “attempting logout of ${IQN} on ${IP}:${PORT}”
    iscsiadm -m node -T ${IQN} -p ${IP}:${PORT} –logout | logger -t block-iscsi
    sleep 10
    #FIXME needs more advanced race condition logic
    fi
    sleep 5
    #FIXME needs more advanced race condition logic
    ;;
    esac

    [Reply]

    Keith Reply:

    @Baylink, I see, thats progress! So I would then move on to the get_mpath_map function, testing those commands and resolving any potential differences between redhat and suse. These two variables are particularly important:

    SCSI_DEV=”/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”

    MPATH_MAP=”/dev/mapper/`multipath -ll ${SCSI_DEV} | head -1 | awk ‘{ print $1}’`”

    [Reply]

  34. Baylink Says:

    Well, the first one appears to have mismatched quoting. Let me go back to my local copy of the script.

    Oh. Nope; just a typography problem.banner.

    /appl/xen/cent/57 # set -vx
    banner:/appl/xen/cent/57 # SCSI_DEV=”/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”
    SCSI_DEV=”/dev/`basename \`/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0\“”
    basename `/usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0`
    /usr/bin/readlink /dev/disk/by-path/ip-${IP}:${PORT}-iscsi-${IQN}-lun-0
    +++ /usr/bin/readlink /dev/disk/by-path/ip-:-iscsi–lun-0
    ++ basename
    basename: missing operand
    Try `basename –help’ for more information.
    + SCSI_DEV=/dev/

    Aha.

    The symlinks in the /dev/disk/by-path directory on SUSE are *relative*; they’re “../../sda”. Is that what’s causing the problem?

    [Reply]

  35. Baylink Says:

    So, as it turns out, I’ve been sort’ve leading you down the garden path for not much return.

    I am not actually set up yet to take advantage of the multipath part of the equation… and I didn’t realize (because *none* of the doco sources I’ve been looking at bothered to say so) that the stock install comes with a block-iscsi script that does *not* do multipath, but apparently works ok.

    Oops. :-}

    I’ll come back to this, but probably not for a week or two. I’ll send you patches when I figure them out. Thanks for the help, though, sir.

    [Reply]

  36. Stacie Says:

    Oh my goodness! Infredible article dude! Many thanks, However I
    am going through issues with your RSS. I don’tknow the
    reason why I am unable to join it. Is there anybody else getting the
    same RSS issues? Anyone who knows thhe answer will you kindly respond?

    Thanx!!

    [Reply]

Join the Conversation