Something I did in a commit broke my code. I have a git bisect that shows me the last good commit and the first bad one.
The code in question is a device driver for MCTP over PCC. MCTP is a network protocol used to have different components on a single system talk to each other. PCC is a shared buffer mechanism with a “doorbell” or register-that-triggers-an-interrupt. Another team needs to use this driver.
In looking through the output, a co-worker stated “it looks like you are sending the wrong buffer.” I suspected he was correct.
The buffers are pulled from ACPI information in a table specific to PCC, called the PCCT.
This is from the PCCT, which is common to both the good and bad run, and shows the machine (physical) addresses of the memory.
03 [Extended PCC Master Subspace]
Base Address = 00004000000D4004
04 [Extended PCC Slave Subspace]
Base Address = 00004000000D4404
Here is output data I get from a run of code from tip of tree.
[ 644.338469] remap_pcc_comm_addresses outbox pcc_chan->shmem_base_addr = 4000000d4004 mapped to ffff800080621004
[ 644.348632] remap_pcc_comm_addresses inbox pcc_chan->shmem_base_addr = 4000000d4404 mapped to ffff80008061f404
[ 644.338469] remap_pcc_comm_addresses outbox pcc_chan->shmem_base_addr = 4000000d4004 mapped to ffff800080621004
[ 644.348632] remap_pcc_comm_addresses inbox pcc_chan->shmem_base_addr = 4000000d4404 mapped to ffff80008061f404
[ 828.014307] _mctp_get_buffer pcc_chan->shmem_base_addr = 1 ptr = 000000007f3ab582
[ 828.021950] mctp_pcc_tx buffer = ffff80008061f404
What I see is that I am sending data on the inbox address, not the outbox. So where did I swap it?
541392a553a2c2b7f741d89cff0ab8014a367299 is the first bad commit.
Lets take a look at the code in that commit. The diff shows a lot of changes in _mctp_get_buffer. Here’s what it looks like in non-diff form
static unsigned char * _mctp_get_buffer (struct mbox_chan * chan){ void __iomem * pcc_comm_addr; struct pcc_mbox_chan * pchan = (struct pcc_mbox_chan *) chan; if (pchan == mctp_pcc_dev->out_chan){ pcc_comm_addr = mctp_pcc_dev->pcc_comm_outbox_addr; }else{ pcc_comm_addr = mctp_pcc_dev->pcc_comm_inbox_addr; } pr_info("%s pcc_chan->shmem_base_addr = %llx ptr = %p \n", __func__, pchan->shmem_base_addr, pcc_comm_addr ); return (unsigned char * )pcc_comm_addr; } |
How could this fail? Well, what if the cast from chan to pchan is bogus? If that happens, the if block will not match, and we will end up in the else block.
Why would I cast like that? Assume for a moment (as I did) that the definitions looked like this:
struct pcc_mbox_chan { struct mbox_chan chan; } |
Ignore anything after the member variable chan. This is a common pattern in C when you need to specialize a general case.
This is not what the code looks like. Instead, it looks like this:
struct pcc_mbox_chan { struct mbox_chan* chan; } |
The structure that chan points to was allocated outside of this code, and thus cannot be cast.
The else block always matched.
The reason the old code ran without crashing is it was getting a valid buffer, just not the right one. I was trying to send the “outbox” buffer, but instead sent the “inbox” buffer.
Here is the corrected version of the if block.
if (chan == mctp_pcc_dev->out_chan->mchan){ pcc_comm_addr = mctp_pcc_dev->pcc_comm_outbox_addr; }else if (chan == mctp_pcc_dev->in_chan->mchan){ pcc_comm_addr = mctp_pcc_dev->pcc_comm_inbox_addr; } if (pcc_comm_addr){ pr_info("%s buffer ptr = %px \n",__func__, pcc_comm_addr ); return (unsigned char * )pcc_comm_addr; } |
What made this much harder to debug is the fact that the “real” system I was testing against had changed and I was not able to test my code against it for a week or so, and by then I had added additional commits on top of the original one. This shows the importance of testing early, testing often, and testing accurately.
To give a sense of how things got more complicated, here is the current version of the get_buffer function:
static unsigned char * _mctp_get_buffer (struct mbox_chan * chan){ struct mctp_pcc_ndev * mctp_pcc_dev = NULL; struct list_head *ptr; void __iomem * pcc_comm_addr = NULL; list_for_each(ptr, &mctp_pcc_ndevs) { mctp_pcc_dev = list_entry(ptr,struct mctp_pcc_ndev, head); if (chan == mctp_pcc_dev->out_chan->mchan){ pcc_comm_addr = mctp_pcc_dev->pcc_comm_outbox_addr; }else if (chan == mctp_pcc_dev->in_chan->mchan){ pcc_comm_addr = mctp_pcc_dev->pcc_comm_inbox_addr; } if (pcc_comm_addr){ pr_info("%s buffer ptr = %px \n", __func__, pcc_comm_addr ); return (unsigned char * )pcc_comm_addr; } } //TODO this should never happen, but if it does, it will //Oops. Handle the error better pr_err("%s unable to match mctp_pcc_ndev", __func__); return 0; } |
Why is it now looking through a list? One of the primary changes I had to make was to move the driver from only supporting a single device to supporting multiple. Finding which buffer to get from struct mbox_chan is again going from general to specific.
This code is still in flight. The list lookup may well not survive the final cut, as there is possibly a better way to get the buffer from the original structure, but it is not clear cut. This works, and lets the unit tests continue to pass. This allows the team to make progress. It might be technical debt, but that is necessary part of a complicated development integration process.