Table of contents
My team is running a small OpenStack cluster with reposnsibility for providing bare metal nodes via Ironic. Currently, we have a handful of nodes that are not usable. They show up as “Cleaning failed.” I’m learning how to debug this process.
Tools
The following ipmtool commands allow us to set the machine to PXE boot, remote power cycle the machine, and view what happens during the boot process.
Power stuff:
ipmitool -H $H -U $U -I lanplus -P $P chassis power status ipmitool -H $H -U $U -I lanplus -P $P chassis power on ipmitool -H $H -U $U -I lanplus -P $P chassis power off ipmitool -H $H -U $U -I lanplus -P $P chassis power cycle |
Serial over LAN (SOL)
ipmitool -H $H -U $U -I lanplus -P $P sol activate |
PXE Boot
ipmitool -H $H -U $U -I lanplus -P $P chassis bootdev pxe #Set Boot Device to pxe |
Conductor Log
To tail the log and only see entries relevant to the UUID of the node I am cleaning:
tail -f /var/log/kolla/ironic/ironic-conductor.log | grep $UUID |
OpenStack baremetal node commands
What is the IPMI address for a node?
openstack baremetal node show fab1bcf7-a7fc-4c19-9d1d-fc4dbc4b2281 -f json | jq '.driver_info | .ipmi_address' "10.76.97.171" |
Cleaning Commands
We have a script that prepares the PXE server to accept a cleaning request from a node. It performs the following three actions (don’t do these yet):
openstack baremetal node maintenance unset ${i} openstack baremetal node manage ${i} openstack baremetal node provide ${i} |
Getting ipmi addresses for nodes
To look at the IPM power status (and confirm that IPMI is set up right for the nodes)
for node in `openstack baremetal node list -f json | jq -r '.[] | select(."Provisioning State"=="clean failed") | .UUID' ` ; do echo $node ; METAL_IP=`openstack baremetal node show $node -f json | jq -r '.driver_info | .ipmi_address' ` ; echo $METAL_IP ; ipmitool -I lanplus -H $METAL_IP -L ADMINISTRATOR -U admin -R 12 -N 5 -P admin chassis power status ; done |
Yes, I did that all on one line, hence the semicolons.
A couple other one liners. This selects all active nodes and gives you their node id and ipmi IP address.
for node in `openstack baremetal node list -f json | jq -r '.[] | select(."Provisioning State"=="active") | .UUID' ` ; do echo $node ; openstack baremetal node show $node -f json | jq -r '.driver_info | .ipmi_address' ;done |
And you can swap out active with other values. For example, if you want to see what nodes are in either error or clean failed states:
openstack baremetal node list -f json | jq -r '.[] | select(."Provisioning State"=="error" or ."Provisioning State"=="manageable") | .UUID' |
Troubleshooting
PXE outside of openstack
If I want to ensure I can PXE boot, out side of the openstack operations, in one terminal, I can track the state in a console. I like to have this running in a dedicated terminal: open the SOL.
ipmitool -H 10.76.97.176 -U ADMIN -I lanplus -P ADMIN sol activateand in another, set the machine to PXE boot, then power cycle it:
ipmitool -H 10.76.97.176 -U ADMIN -I lanplus -P ADMIN chassis bootdev pxe Set Boot Device to pxe [ayoung@ayoung-home keystone]$ ipmitool -H 10.76.97.176 -U ADMIN -I lanplus -P ADMIN chassis power cycle Chassis Power Control: Cycle |
If the Ironic server is not ready to accept the PXE request, your server will let you know with a message like this one:
>>Checking Media Presence...... >>Media Present...... >>Start PXE over IPv4 on MAC: 1C-34-DA-51-D6-C0. PXE-E18: Server response timeout. ERROR: Boot option loading failed |
PXE inside of a clean
openstack baremetal node list --provision-state "clean failed" -f value -c UUID |
Produces output like this:
8470e638-0085-470c-9e51-b2ed016569e1
5411e7e8-8113-42d6-a966-8cacd1554039
08c14110-88aa-4e45-b5c1-4054ac49115a
3f5f510c-a313-4e40-943a-366917ec9e44 |
Clean wait log entries
I’ll track what is going on in the log for a specific node by running tail -f and grepping for the uuid of the node:
tail -f /var/log/kolla/ironic/ironic-conductor.log | grep 5411e7e8-8113-42d6-a966-8cacd1554039 |
If you run the three commands I showed above, the Ironic server should be prepared for cleaning and will accept the PXE request. I can execute these one at a time and track the state in the conductor log. If I kick off a clean, eventually, I see entries like this in the conductor log (I’m removing the time stamps and request ids for readability):
ERROR ironic.conductor.task_manager [] Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "clean failed" from state "clean wait"; target provision state is "available" INFO ironic.conductor.utils [] Successfully set node 5411e7e8-8113-42d6-a966-8cacd1554039 power state to power off by power off. INFO ironic.drivers.modules.network.flat [] Removing ports from cleaning network for node 5411e7e8-8113-42d6-a966-8cacd1554039 INFO ironic.common.neutron [] Successfully removed node 5411e7e8-8113-42d6-a966-8cacd1554039 neutron ports. |
Manual abort
And I can trigger this manually if a run is taking too long by running:
openstack baremetal node abort $UUID |
Kick off clean process
The command to kick off the clean process is
openstack baremetal node provide $UUID |
In the conductor log, that should show messages like this (again, edited for readability)
Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "cleaning" from state "manageable"; target provision state is "available" Adding cleaning network to node 5411e7e8-8113-42d6-a966-8cacd1554039 For node 5411e7e8-8113-42d6-a966-8cacd1554039 in network de931fcc-32a0-468e-8691-ffcb43bf9f2e, successfully created ports (ironic ID: neutron ID): {'94306ff5-5cd4-4fdd-a33e-a0202c34d3d0': 'd9eeb64d-468d-4a9a-82a6-e70d54b73e62'}. Successfully set node 5411e7e8-8113-42d6-a966-8cacd1554039 power state to power on by rebooting. Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "clean wait" from state "cleaning"; target provision state is "available" |
PXE during a clean
At this point, the most interesting thing is to see what is happening on the node. ipmiptool sol activate provides a running log. If you are lucky, the PXE process kicks off and a debian-based kernel should start booting. My company has a specific login set for the machines:
Debugging on the Node
After this, I use sudo -i to run as root.
$ sudo -i ... # ps -ef | grep ironic root 2369 1 1 14:26 ? 00:00:02 /opt/ironic-python-agent/bin/python3 /usr/local/bin/ironic-python-agent --config-dir /etc/ironic-python-agent.d/ |
Looking for logs:
ls /var/log/ btmp ibacm.log opensm.0x9a039bfffead6720.log private chrony lastlog opensm.0x9a039bfffead6721.log wtmp |
No ironic log. Is this thing even on the network?
# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 98:03:9b:ad:67:20 brd ff:ff:ff:ff:ff:ff 3: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 98:03:9b:ad:67:21 brd ff:ff:ff:ff:ff:ff 4: enxda90910dd11e: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether da:90:91:0d:d1:1e brd ff:ff:ff:ff:ff:ff |
Nope. Ok, lets get it on the network:
# dhclient [ 486.508054] mlx5_core 0000:01:00.1 enp1s0f1np1: Link down [ 486.537116] mlx5_core 0000:01:00.1 enp1s0f1np1: Link up [ 489.371586] mlx5_core 0000:01:00.0 enp1s0f0np0: Link down [ 489.394050] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0f1np1: link becomes ready [ 489.400646] mlx5_core 0000:01:00.0 enp1s0f0np0: Link up [ 489.406226] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0f0np0: link becomes ready root@debian:~# [ 500.596626] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive. ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 98:03:9b:ad:67:20 brd ff:ff:ff:ff:ff:ff inet 192.168.97.178/24 brd 192.168.97.255 scope global dynamic enp1s0f0np0 valid_lft 86386sec preferred_lft 86386sec inet6 fe80::9a03:9bff:fead:6720/64 scope link valid_lft forever preferred_lft forever 3: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 98:03:9b:ad:67:21 brd ff:ff:ff:ff:ff:ff inet6 fe80::9a03:9bff:fead:6721/64 scope link valid_lft forever preferred_lft forever 4: enxda90910dd11e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000 link/ether da:90:91:0d:d1:1e brd ff:ff:ff:ff:ff:ff inet6 fe80::d890:91ff:fe0d:d11e/64 scope link valid_lft forever preferred_lft forever |
And…quite shortly thereafter in the conductor log:
Agent on node 5411e7e8-8113-42d6-a966-8cacd1554039 returned cleaning command success, moving to next clean step Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "cleaning" from state "clean wait"; target provision state is "available" Executing cleaning on node 5411e7e8-8113-42d6-a966-8cacd1554039, remaining steps: [] Successfully set node 5411e7e8-8113-42d6-a966-8cacd1554039 power state to power off by power off. Removing ports from cleaning network for node 5411e7e8-8113-42d6-a966-8cacd1554039 Successfully removed node 5411e7e8-8113-42d6-a966-8cacd1554039 neutron ports. Node 5411e7e8-8113-42d6-a966-8cacd1554039 cleaning complete Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "available" from state "cleaning"; target provision state is "None" |
Cause of Failure
So, in our case, the issue seems to be that the IPA image does not have dhcp enabled.