Debugging a Clean Failure in Ironic

My team is running a small OpenStack cluster with reposnsibility for providing bare metal nodes via Ironic. Currently, we have a handful of nodes that are not usable. They show up as “Cleaning failed.” I’m learning how to debug this process.

Tools

The following ipmtool commands allow us to set the machine to PXE boot, remote power cycle the machine, and view what happens during the boot process.

Power stuff:

ipmitool -H $H -U $U -I lanplus -P $P chassis power status
ipmitool -H $H -U $U -I lanplus -P $P chassis power on
ipmitool -H $H -U $U -I lanplus -P $P chassis power off
ipmitool -H $H -U $U -I lanplus -P $P chassis power cycle

Serial over LAN (SOL)

ipmitool -H $H -U $U -I lanplus -P $P sol activate

PXE Boot

ipmitool -H $H -U $U -I lanplus -P $P chassis bootdev pxe
#Set Boot Device to pxe

Conductor Log

To tail the log and only see entries relevant to the UUID of the node I am cleaning:

tail -f /var/log/kolla/ironic/ironic-conductor.log | grep $UUID

OpenStack baremetal node commands

What is the IPMI address for a node?

openstack baremetal node show fab1bcf7-a7fc-4c19-9d1d-fc4dbc4b2281 -f json | jq '.driver_info | .ipmi_address'
"10.76.97.171"

Cleaning Commands

We have a script that prepares the PXE server to accept a cleaning request from a node. It performs the following three actions (don’t do these yet):

 openstack baremetal node maintenance unset ${i}
 openstack baremetal node manage ${i}
 openstack baremetal node provide ${i}

Getting ipmi addresses for nodes

To look at the IPM power status (and confirm that IPMI is set up right for the nodes)

for node in `openstack baremetal node list -f json | jq -r '.[] | select(."Provisioning State"=="clean failed")  | .UUID' ` ; 
do    
echo $node ; 
METAL_IP=`openstack baremetal node show  $node -f json | jq -r  '.driver_info | .ipmi_address' ` ; 
echo $METAL_IP  ; 
ipmitool -I lanplus -H  $METAL_IP  -L ADMINISTRATOR -U admin -R 12 -N 5 -P admin chassis power status   ; 
done

Yes, I did that all on one line, hence the semicolons.

A couple other one liners. This selects all active nodes and gives you their node id and ipmi IP address.

for node in `openstack baremetal node list -f json | jq -r '.[] | select(."Provisioning State"=="active")  | .UUID' ` ; do echo $node ;  openstack baremetal node show  $node -f json | jq -r  '.driver_info | .ipmi_address' ;done

And you can swap out active with other values. For example, if you want to see what nodes are in either error or clean failed states:

openstack baremetal node list -f json | jq -r '.[] | select(."Provisioning State"=="error" or ."Provisioning State"=="manageable")  | .UUID'

Troubleshooting

PXE outside of openstack

If I want to ensure I can PXE boot, out side of the openstack operations, in one terminal, I can track the state in a console. I like to have this running in a dedicated terminal: open the SOL.

ipmitool -H 10.76.97.176 -U ADMIN -I lanplus -P ADMIN sol activate

and in another, set the machine to PXE boot, then power cycle it:

ipmitool -H 10.76.97.176 -U ADMIN -I lanplus -P ADMIN chassis bootdev pxe
Set Boot Device to pxe
[ayoung@ayoung-home keystone]$ ipmitool -H 10.76.97.176 -U ADMIN -I lanplus -P ADMIN chassis power cycle
Chassis Power Control: Cycle

If the Ironic server is not ready to accept the PXE request, your server will let you know with a message like this one:

>>Checking Media Presence......
>>Media Present......
>>Start PXE over IPv4 on MAC: 1C-34-DA-51-D6-C0.
PXE-E18: Server response timeout.
 
ERROR: Boot option loading failed

PXE inside of a clean

openstack baremetal node list --provision-state "clean failed"  -f value -c UUID

Produces output like this:

8470e638-0085-470c-9e51-b2ed016569e1
5411e7e8-8113-42d6-a966-8cacd1554039
08c14110-88aa-4e45-b5c1-4054ac49115a
3f5f510c-a313-4e40-943a-366917ec9e44

Clean wait log entries

I’ll track what is going on in the log for a specific node by running tail -f and grepping for the uuid of the node:

tail -f /var/log/kolla/ironic/ironic-conductor.log | grep 5411e7e8-8113-42d6-a966-8cacd1554039

If you run the three commands I showed above, the Ironic server should be prepared for cleaning and will accept the PXE request. I can execute these one at a time and track the state in the conductor log. If I kick off a clean, eventually, I see entries like this in the conductor log (I’m removing the time stamps and request ids for readability):

ERROR ironic.conductor.task_manager [] Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"
INFO ironic.conductor.utils [] Successfully set node 5411e7e8-8113-42d6-a966-8cacd1554039 power state to power off by power off.
INFO ironic.drivers.modules.network.flat [] Removing ports from cleaning network for node 5411e7e8-8113-42d6-a966-8cacd1554039
INFO ironic.common.neutron [] Successfully removed node 5411e7e8-8113-42d6-a966-8cacd1554039 neutron ports.

Manual abort

And I can trigger this manually if a run is taking too long by running:

openstack baremetal node abort  $UUID

Kick off clean process

The command to kick off the clean process is

openstack baremetal node provide $UUID

In the conductor log, that should show messages like this (again, edited for readability)

Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "cleaning" from state "manageable"; target provision state is "available"
Adding cleaning network to node 5411e7e8-8113-42d6-a966-8cacd1554039
For node 5411e7e8-8113-42d6-a966-8cacd1554039 in network de931fcc-32a0-468e-8691-ffcb43bf9f2e, successfully created ports (ironic ID: neutron ID): {'94306ff5-5cd4-4fdd-a33e-a0202c34d3d0': 'd9eeb64d-468d-4a9a-82a6-e70d54b73e62'}.
Successfully set node 5411e7e8-8113-42d6-a966-8cacd1554039 power state to power on by rebooting.
Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "clean wait" from state "cleaning"; target provision state is "available"

PXE during a clean

At this point, the most interesting thing is to see what is happening on the node. ipmiptool sol activate provides a running log. If you are lucky, the PXE process kicks off and a debian-based kernel should start booting. My company has a specific login set for the machines:

debian login: ampere Password: Linux debian 5.10.0-6-arm64 #1 SMP Debian 5.10.28-1 (2021-04-09) aarch64

Debugging on the Node

After this, I use sudo -i to run as root.

$ sudo -i 
...
# ps -ef | grep ironic
root        2369       1  1 14:26 ?        00:00:02 /opt/ironic-python-agent/bin/python3 /usr/local/bin/ironic-python-agent --config-dir /etc/ironic-python-agent.d/

Looking for logs:

ls /var/log/
btmp	ibacm.log  opensm.0x9a039bfffead6720.log  private
chrony	lastlog    opensm.0x9a039bfffead6721.log  wtmp

No ironic log. Is this thing even on the network?

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 98:03:9b:ad:67:20 brd ff:ff:ff:ff:ff:ff
3: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 98:03:9b:ad:67:21 brd ff:ff:ff:ff:ff:ff
4: enxda90910dd11e: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether da:90:91:0d:d1:1e brd ff:ff:ff:ff:ff:ff

Nope. Ok, lets get it on the network:

# dhclient
[  486.508054] mlx5_core 0000:01:00.1 enp1s0f1np1: Link down
[  486.537116] mlx5_core 0000:01:00.1 enp1s0f1np1: Link up
[  489.371586] mlx5_core 0000:01:00.0 enp1s0f0np0: Link down
[  489.394050] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0f1np1: link becomes ready
[  489.400646] mlx5_core 0000:01:00.0 enp1s0f0np0: Link up
[  489.406226] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0f0np0: link becomes ready
root@debian:~# [  500.596626] sr 0:0:0:0: [sr0] CDROM not ready.  Make sure there is a disc in the drive.
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 98:03:9b:ad:67:20 brd ff:ff:ff:ff:ff:ff
    inet 192.168.97.178/24 brd 192.168.97.255 scope global dynamic enp1s0f0np0
       valid_lft 86386sec preferred_lft 86386sec
    inet6 fe80::9a03:9bff:fead:6720/64 scope link 
       valid_lft forever preferred_lft forever
3: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 98:03:9b:ad:67:21 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9a03:9bff:fead:6721/64 scope link 
       valid_lft forever preferred_lft forever
4: enxda90910dd11e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether da:90:91:0d:d1:1e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::d890:91ff:fe0d:d11e/64 scope link 
       valid_lft forever preferred_lft forever

And…quite shortly thereafter in the conductor log:

Agent on node 5411e7e8-8113-42d6-a966-8cacd1554039 returned cleaning command success, moving to next clean step
Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "cleaning" from state "clean wait"; target provision state is "available"
Executing cleaning on node 5411e7e8-8113-42d6-a966-8cacd1554039, remaining steps: []
Successfully set node 5411e7e8-8113-42d6-a966-8cacd1554039 power state to power off by power off.
Removing ports from cleaning network for node 5411e7e8-8113-42d6-a966-8cacd1554039
Successfully removed node 5411e7e8-8113-42d6-a966-8cacd1554039 neutron ports.
Node 5411e7e8-8113-42d6-a966-8cacd1554039 cleaning complete
Node 5411e7e8-8113-42d6-a966-8cacd1554039 moved to provision state "available" from state "cleaning"; target provision state is "None"

Cause of Failure

So, in our case, the issue seems to be that the IPA image does not have dhcp enabled.

Adam Young's Web Log

The Notebook of a Programmer Climber Musician Ex-Soldier Woodworker and a few other things