After you get something working, you find you might have missed a step in documenting how you got that working. You might have installed a package that you didn’t remember. Or maybe you set up a network connection. In my case, I find I have often brute-forced the SSH setup for later provisioning. Since this is done once, and then forgotten, often in the push to “just get work done” I have had to go back and redo this (again usually manually) when I get to a new machine.
To avoid this, I am documenting what I can do to get a new machine up and running in a state where SSH connections (and forwarding) can be reliably run. This process should be automatable, but at a minimum, it should be understood.
To start with, I want to PXE boot the machine, and reinstall the OS. Unless you are using a provisioning system like Ironic or Cobbler, this is probably a manual process. But you still can automate a good bit of it. The first step is to tell the IPMI based (we are not on Red Fish yet) BMC to boot to PXE upon the next reboot.
This is the first place to introduce some automation. All of our ipmitool commands are going to take the majority of the same parameters. So we can take the easy step of creating a variable for this command, and use environmental variables to fill in the repeated values.
export CMD_IPMITOOL="ipmitool -H $DEV_BMCIP -U $IPMI_USER -I lanplus -P $IPMI_PASS"
One benefit of this is that you can now extract the variables into an environment variable file that you source separate from the function. That makes this command reusable for other machines.
To require PXE booting on the next pass we also make use of a function that can be used to power cycle the system. Note that I include a little bit of feedback in the commands I use so that the user does not get impatient.
dev_power_on(){
CMD_="$CMD_IPMITOOL power on"
echo $CMD_
$CMD_
}
dev_power_off(){
CMD_="$CMD_IPMITOOL power off"
echo $CMD_
$CMD_
}
dev_power_cycle(){
dev_power_off
dev_power_on
dev_power_status
}
dev_pxe(){
CMD_="$CMD_IPMITOOL chassis bootdev pxe"
echo $CMD_
$CMD_
dev_power_cycle
}
Once this function is executed, the machine will boot to PXE mode. What this looks like is very dependent on your setup. There are two things that tend to vary. One is how you connect to the machine in order to handle the PXE setup. If you are lucky, you have a simple interface. We have a serial console concentrator here, so I can connect to the machine using a telnet command: I get this command from our lab manager. IN other stages of life, I have had to use minicom to connect to a physical UART (serial port) to handle PXE boot configuration. I highly recommend the serial concentrator route if you can swing it.
But usually you have an IPMI based option to open the serial console. Just be ware that this might conflict with, and thus disable, a UART based way of connecting. For me, I can do this using:
$CMD_IPMITOOL sol activate
The other thing that varies is your PXE set up. We have a PXE menu that allows us to select between many different Linux distributions with various configurations. My usual preference is to do a minimal install, just enough to get the machine up and on the network accessible via SSH. This is because I will almost always do an upgrade of all packages (.deb/.rpm) on the system once it is booted. I also try to make sure I don’t have any major restrictions on disk space. Some of the automated provisioning approaches make the Root filesystem or the home filesystem arbitrarily small. For development, I need to be able to build a Linux Kernel and often a lot of other software. I don’t want to run out of disk space. A partitioning scheme that is logical for a production system may not work for me. My Ops team provides and option that has Fedora 39 Server GA + Updates, Minimal, Big Root. This serves my needs.
I tend to reuse the same machines, and thus have ssh information in the files under ~/.ssh/known_hosts. After a reprovision, this information is no longer accurate, and needs to be replaced. In addition, the newly provisioned machine will not have an ssh public key on it that corresponds with my private Key. If only they used FreeIPA…but I digress…
If I try to connect to the reprovisioned machine:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:EcPh9oazsjRaC9q8fJqJc8OjHPoF4vtXQljrHJhKDZ8.
Please contact your system administrator.
Add correct host key in /home/ayoung/.ssh/known_hosts to get rid of this message.
Offending ED25519 key in /home/ayoung/.ssh/known_hosts:186
remove with:
ssh-keygen -f "/home/ayoung/.ssh/known_hosts" -R "10.76.111.73"
Host key for 10.76.111.73 has changed and you have requested strict checking.
Host key verification failed.
The easiest way to wipe the information is:
ssh keygen -R $DEV_SYSTEMIP
Coupling this with provisioning the public key makes sense. And, as I wrote in the past, I need to set up ssh-key forwarding for gitlab access. Thus, this is my current ssh prep function:
#Only run this once per provisioning
dev_prep_ssh(){
ssh-keygen -R $DEV_SYSTEMIP
ssh-copy-id -o StrictHostKeyChecking=no root@$DEV_SYSTEMIP
ssh-keyscan gitlab.com 2>&1 | grep -v \# | ssh root@$DEV_SYSTEMIP "cat >> .ssh/known_hosts"
}
The first two could be done via Ansible as well. I need to find a better way to do the last step via Ansible (line_in_file seems to be broken by this), or to bash script it so that it is idempotent.