My first several months at Penguin (Scyld, actually) were spent chasing down a really nasty bug. Our software ran a modified version of the 2.4 Linux Kernel with a kernel module as well. The problem was it would periodically freeze the system. This was such a pronounced problem that my boss, Richard, had put a tape outline of a bug on the floor of his cube. Every time that they though they had squashed the bug, it would reemerge. While the system had many issues to address, none were more critical than solving this stability related issue.
When a system freezes, the primary debugging tools is a message the starts with the word ‘oops’ that are therefore called ‘oops’-es. This is the Linux equivalent of the Blue Screen of Death. The kernel spits out a message and freezes.
The Linux Kernel has come a long way since 2.4.25, the version Scyld shipped when I started (or there abouts). Nowadays, when the kernel oopses, it spits out what is called a stack trace, showing which functions had been called at the time of the problem. By tracing down through the function calls, you can usually figure out fairly quickly what the problem major symptom is, and from there , work backwards to the root cause. Under the 2.4 kernel, we didn’t get an stack trace. Instead, we got a dump of the stack in it’s raw form, and from there had to run it through a post processor (ksymoops) that looked the the data and the layout of the kernel and gave a best guess at the call stack from there.
There were two problems getting out the gate. We needed to reproduce the problem, and we needed to capture the oops message. Since the bug happened intermittently, it usually took several hours to reproduce. Because we ran our head nodes with a graphical front end running, we didn’t necessarily see even the stack trace on a customer system. Periodically someone would get a glimpse, or send a digital photo of some segment of it.
The easier problem to solve was capturing the oops message. You can modify the options that are given to the kernel such that all console output also would be echoed to the serial port. We would connect another computer by cable to the serial port and run a program called minicom that allowed us to display what was happening on the head node’s console.
The harder problem was reproducing. Early on we knew that the problem was related to connection tracking. On certain systems, when connection tracking was turned on, booting a compute node would oops the master.
For people in the enterprise world, it may come as some surprise that this type of behavior was tolerable in a shipped product. However, the high performance computing world is a little different. Our customers are pushing machines to their limit in an attempt to get the most number crunching done in the least amount of time. Much of the system was designed to limit overhead so that as few CPU cycles as possible would be wasted on administrative tasks.
The Bug was triggered when the system was running our code, the Broadcom custom gigabit Ethernet driver (bcm5700), and IP Masquerading,
Since we knew what situation caused it, it was easy to tell people how to work around it. They did so, grudgingly. The usual result was to either run a slower Ethernet driver or not run IP Masquerading. The slower driver meant slower performance. Not running IP Masquerading meant no Firewall around the system. For most people, they did without the firewall.
I mentioned this problem to a friend of mine, who pointed out that I was running in the Linux Kernel, and that stack space was limited. Each process in the (2.4) Linux Kernel has two pages allocated for both the stack (for Kernel mode, not user mode execution) and the structure that represents the process (struct task_struct). The task_struct is allocated at the beginning of these two pages, and the stack starts at the end and grows toward the beginning. If the stack gets too large, it over writes the task_struct.
This became our hypothesis. To test it, we took two approaches. I attempted to trigger a debug break during an overflow. I never succeeded in getting this to work. Meanwhile, a co-worker implemented what we called a DMZ. The area immediately after the task struct was painted with a magic value (0xDEADBEEF). On each context switch, we checked to see if this area had been corrupted. If it was, we knew that the high water mark of the the stack was dangerously close to the task_struct. Practically speaking, we knew we had an overflow. This worked and our bug was cornered.
To avoid the bug, we had to shrink the amount of memory placed on the stack. Static analysis of our code showed that we were allocating a scratch space of 2K on the stack. Since a page is only 4K in length, and there is less than two pages per process, this meant that we were throwing away over a quarter of our free space. Changing this to a vmalloc caused the bug to go away. For a while.
Our code was not alone in its guilt of sloppy stack allocations. Eventually, even with the reduced footprint from our code, the other modules (bcm5700, ipmasq, etc) were enough to cause the stack overflow again. Fortunately, by now we had the stack overflow detection code available and we were able to identify the problem quickly. The end solution was to implement a second stack for our code. Certain deep code paths that used a lot of stack space were modified to use values stored in a vmalloc-ed portion of memory instead of local variables. The only local variable required for each stack frame was the offset into the vmalloc-ed area. This was a fairly costly effort, but removed the bug for good.
I hope someone else can learn from our experiences here.