Table of Contents

Howto: Boot from ipxe/iscsi target using ibft with debian and dracut based Linux

Intro

This guide is intended to be a short writeup of what i learned about booting from iscsi using ipxe.

At the time of writing this some time has passed since i implemented these settings on my systems, so a few details might be incorrect or missing, always check documentation yourself.. i'll try to include as many reference links i can remember.

The goal of my setup was to implement remote booting of diskless hypervisor nodes running dracut or debian based distro's, with the specific requirement that no host/node specific settings like ip addresses, paths, iscsi initiators or targets, need to be configured on the diskless node.

This enables setting up a infrastructure where physical baremetal nodes can be added and removed from a cluster or infrastructure requiring no configuration other than setting the node bios to boot from nic via pxe/ipxe, and adding the MAC address of the booting nic to your DHCP server. The name of the iscsi initiator and target and all ip settings are configured only once in ipxe menu, after which they automatically follow through the whole boot process thanks to ibft.

A standard iscsi boot process requires 4 individual logins to the iscsi target hosting the root fs. Following this guide all of those 4 logins will have the same ip and a predictable initiatorname, allowing for less complicated NAS/SAN firewall rules and less convoluted ACL's on istgt style initiator groups.

The administrator has full control of the physical node remotely without needing any ipmi-like remote bios access, just using wake-on-lan magic packets, ipxe MAC/hostname based boot menu defaults, and iscsi ACL's and target groups, administrator can instruct the node to boot from any type of block device that can be exposed via iscsi, and set different boot defaults based on various conditions using any shell scripts, and or ipxe scripting, never needing to physically visit the node.

After hundreds of reboots i can say that these methods when working correctly are 100% predictable from the viewpoint of the remote admin, meaning that every single boot has been successful and no non-reproducable booting errors or failures have occurred yet, barring hardware failures.

IMPORTANT NOTE:

Using other than intel brand nic's can mess with nic interface ordering, especially in debian. If you have nic brands other than intel, and need to boot debian, you must read the important note at the end of the guide in addition to the debian section. You also need a debian installer that includes non-OSS firmware, or you need to include the drivers for your nic in the installer, or you need to add the kernel module manually afterwards (see reference wiki url).

Step 1:

Set up robinsmidsrod's ipxe scripts at https://gist.github.com/robinsmidsrod/2234639. Ask in #ipxe @ freenode or this thread if you have issues. You need to define the variables and defaults, and have the /boot directory for adding node defaults. Don't worry, that's actually the hard part. Setting up debian and dracut is the easy part :)

Debian Jessie and Wheezy based distros:

Jessie:

1. Install standard debian distro, it doesn't matter how: bootstrap, install to iscsi target via debian installer, install to iscsi backed KVM volume, install to raw file, install to disk. As long as you can export the resulting block device via iscsi you're good.

Easiest way is to start debian net-installer in KVM/qemu VM with your working iscsi LUN target directly attached as kvm block storage. (you can also run the installer with no local VM disk configured, only difference is you need to manually login to iscsi target after install, mount it, chroot into it to activate ibft booting)

2. After installation reboot into system and do as root:

apt-get install open-iscsi -y
echo "ISCSI_AUTO=true" > /etc/iscsi/iscsi.initramfs
nano /etc/initramfs-tools/initramfs.conf

scroll down and set DEVICE=eth0

IMPORTANT

A: this setting is only needed when you have more than one nic's in the system afaik, but i would set it anyway as system would otherwise become unbootable if more nics are added later.

B: this setting is CRITICAL if you run more than one nic AND/OR ESPECIALLY if you run other than intel brand nic, in which case you must read important notice at the end of this guide.

3.

update-initramfs -u

Done, your system will now boot from iscsi with static or dhcp based ip settings, which are defined only once server-side in ipxe boot scripts. (edit: might not be entirely true using static ip, but static ip setup kinda defies the whole idea of this guide)

Wheezy:

The procedure is exactly the same as above with the addition that changes are needed in halt and reboot scripts to prevent them from hanging while waiting for iscsi initiator to disconnect before unmounting root, which of course is impossible since root is on the remote target it wants to disconnect.

this is the part where the reader has to do his or her own googling, as i can't remember exactly the needed changes, and have no wheezy to test at time of writing. All i remember is one switch, probably -f (force) needed to be added to /etc/init.d/reboot , (command is 'reboot -d -f -i' in jessie), and in /etc/init.d/halt edit NETDOWN=yes or no, just flip it, i can't remember what it was as default.

Dracut:

1. Install as usual to any block device, as with debian.

2. after installation do:

yum update && yum install -y iscsi-initiator-utils
echo "add_dracutmodules+="iscsi"" > /etc/dracut.conf.d/ibft.conf
nano /etc/default/grub

add the following 3 entries to GRUB_CMDLINE_LINUX=“rd.iscsi.firmware=1 rd.iscsi.ibft=1 netroot=iscsi:ibft”

update-grub
dracut -v

Done. (i'm not sure if rd.iscsi.firmware=1 is needed on machines with OSS nic drivers in kernel, nor am i sure if that entry causes problems in configurations with only intel nics yet due to lack of testing at the time of writing)

notes:

Note 1: Assigning a host name to MAC address in DHCP server will result in robinsmidsrod's script to use hostname in iscsi initiator name (for example iqn.2007-09.jp.ne.peach.istgt:debian), which in conjuncture with ibft further improves and beautifies (!) the infrastructure, hostname and MAC are the only variables needed to be configured only once on the backend for each new node added to the network. Without DHCP assigned hostname initiator address will have MAC address (iqn.2007-09.jp.ne.peach.istgt:XX:XX:XX:XX:XX:XX).

Either way, set ipxe defaults for each node in ipxe menu /boot dir, maintain multiple default boot profiles for all nodes using multiple directories with appropriate default entries, or simply keep defaults static and edit the LUN in menu.ipxe:, or do both and add automation with CGI/Perl scripting and publish your scripts..

:iscsi-node-01

echo Booting iscsi node 01 for ${initiator-iqn}

set base-iscsi iscsi:${iscsi-server}::::${base-iqn}

set root-path ${base-iscsi}:iscsi-LUN-01

sanboot ${root-path} || goto failed

goto start

Note 2: if you want to increase the fault tolerance of the process, you can add a loop in ipxe undionly.kpxe, or custom ROM that keeps querying the dhcp and/or http servers like this:

#!ipxe
goto dhcp_retry
:dhcp_retry
sleep 3
dhcp && goto chain_retry || goto dhcp_retry
:chain_retry
sleep 3
chain http://[your http server]/boot.ipxe && goto exit || goto chain_retry

- this incase you have a powerout, when the power comes back on, the DHCP/tftp/http servers might not be ready. The loop will allow the nodes to wait instead of failing to boot with a timeout. If chainloading undionly.ipxe, the DHCP loop only works if you can control how the nic ROM and/or BIOS handles DHCP timeouts. The only drawback is it causes a few second delay in script even if all servers are up (removing sleep command does not look pretty on the screen, it will flood requests as fast as it can).

Currently, i chainload undionly.kpxe for my integrated nics which can't be flashed, and i have replaced the dhcp loop with just 'ifopen net0' to reduce total DHCP queries for the whole boot process to 2 (the nic PXE rom dhcp query, and another after grub, before chroot. This because i can't control BIOS behaviour on DHCP timeout, so without dhcp my nodes would fail to boot and require physical access. However if DHCP/tftp server was up and just http server wasn't, the loop would wait until boot.ipxe becomes available. If you flash a custom ipxe with the above loops included,it will enable you to configure a node that waits indefinitely for dhcp and http servers to become available, reducing the risk of having to physically visit the node after a powerout or similar. In extreme cases loops can be added to sanboot commands in menu.ipxe as well to keep polling iscsi targets until available.

Note 3: chainloading undionly.ipxe from tftp, and setting all clients in the infrastructure to boot via pxe/ipxe regardless of local HDD, then defining a default to escape ipxe menu and continue local BIOS boot for the nodes that are not supposed to boot over the network, would enable you to control the bootprocess of those nodes remotely as well, if needed.

Note 4: For even more redundancy host the iscsi LUN's on a RBD or DRBD mirrored pool, set up tftp/http/iscsi servers for High Availability.

Note 5: If you are booting hypervisor nodes be aware that you cannot bridge the ibft/iscsi interface, this will break the connection to root volume and catastrophically halt the process.

Note 6. Help is needed to make the iscsi connection hosting root volume more robust. currently, doing a iscsiadm -m node -u in debian or dracut, OS happily disconnects root volume and system halts. In dracut the ibft interface is renamed to ibft0, which is a step in the right direction i guess, but more work is needed.

Note 7. If your network interface names are messed up because udev renaming during boot ( dmesg | grep renamed ), then delete the file named 70-* in /etc/udev/rules.d/ and reboot.

Note 8. both debian and dracut installations are bootable from local disk/KVM hosted virtio device as well, as no changes to the grub root device definitions are made.

Note 9. It's scary how many times i have typed 'boot' and 'iscsi' during this article.


references:

http://linux.die.net/man/8/dracut

http://pve.proxmox.com/wiki/Proxmox_ISCSI_installation

https://wiki.debian.org/Firmware

IMPORTANT NOTICE

For example adding a broadcom based nic to a node with one integrated intel nic, will probably flip the order of the nics at grub kernel boot, making the broadcom nic eth0 and the intel nic eth1. You need to be aware of this because in that case you need to change which MAC address is granted the DHCP configured hostname at boot, and also because in that case ipxe will see the intel MAC first, but after grub phase the linux kernel will switch the interfaces and thus the MAC and ip address of the client logging into the iscsi target _after_ grub phase will be different, meaning 2 entries for each node is needed in SAN firewall and/or iscsi target LUN access permissions.

During my testing this affected only debian when using other than all intel nics. luckily you can instruct initramfs which interface to use in netbooting, and thus combined with MAC settings on DHCP server, it's possible to mitigate this issue, which can be a real pain otherwise.

Remember when setting DEVICE= in /etc/initramfs-tools/initramfs.conf , that it is the device used AFTER grub kernel stage, meaning if you boot via PXE/IPXE using the integrated intel nic, that nic would be eth1 in initramfs.conf, and the other brand additional nic would be eth0. Thus, setting DEVICE=eth1 in a machine with more than one nics of which some are non-intel, combined with switching the MAC address assigned for the hostname on the DHCP server, would result in a “clean” boot process where the same MAC address requests every dhcp lease, keeping the configuration persistent across infra. Any deviation from this and you will probably have some issue somewhere. Ofcourse, the best solution is to only use intel nics.

———–^^^^^^^^^^^———–