Working on the Beowulf Process

I am currently listed as one of the maintainers of the BProc project on Sourceforge. Unfortunately, my current life has left me little enough time to do my job and be a father, so other projects fall by the wayside.

BPRoc portrays a cluster of computers a single system from an operating system perspective. A process running anywhere one the cluster shows up in the process tree on the head node. Signals sent on any machine were forwarded to the machine where the process was actually running. A process could voluntarily migrate from one machine to another. All of these techniques take place in the Linux Kernel. Maintaining this code requires understand of operating system concepts such as signal delivery, page table organization, dynamic library linking, as well as network programming. I’ve never had more fun coding.

Linux kernel development is done by taking a copy of the kode published at kernel.org and applying a file that contains the differences between how it looks at the start and how you want it to look at the end. This file is called a patch. THe major Linux distributions all have a version of the Linux Kernel that they select as a starting point, and then a series of patches that they apply to deal with issues they care about. For instance, I am running Red Hat Enterprise Linux 4 machine with a kernel version of 2.6.9-55.0.9. The 2.6.9 is the version that they got from kernel.org. The 55.0.9 indicates the number of major and minor upgrades they have made to that kernel. The number patches applied when last I looked was in the neighborhood of 200. All of the changes we applied to the Linux kernel was maintained in a single patch. As we maintained succeeding version of the kernel, we continued to generate newer versions of that patch. In addition to this code, we had a separate, and much larger, portion of code that was compiled into a binary format that could loaded into the Linux Kernel on demand. The majority of the code in the patch was merely hooks into the code that called out to the loadable kernel modules.

Penguin had branched from the Sourceforge BPRoc before I joined. As such, Sourceforge had already moved on to the 2.6 series Linux Kernel while we were still on the 2.4 series. This was a major difference in the code base, and there was little grounds for sharing. When we did finally start moving to the 2.6 code base, we had one Marketing requirement that the Sourceforge project did not: We needed to interoperate with the Linux Kernel shipped by RedHat for there Enterprise Linux product (RHEL). I spent a long time in research mode trying to make this happen. Two major decisions came out of this. First, PID masquerading had to go. Second, we needed to use binary patching in place of many of the source level patches.

Every process in an operating system has an integer process identifier (PID) that other processes and the kernel can use to access that process. A major mechanism in BProc was the ability to migrate a process from one physical machine to another. PID masquerading is a technique that ensures that the process identified does not have to change during migration. Instead, each process has two identifiers. The first is the ID as allocated on the head node, and is used when reporting information to the head node, other nodes, or user land functions. The second ID is the PID allocated on the local machine, and only used inside to local machines Kernel. When a function like getpid is called, the process identifier returned is the masqueraded PID, not the local PID. PID masquerading has both positive and negative implications. With PID masquerading, a given compute node can actually have two completely separate pools of processes that cannot communicate with each other. each of the pools of processes can be driven from a different head node. This allows the sharing of compute nodes between head nodes. A given machine can actually act as both a head node and a compute node. This was a requirement in early Beowulf clusters, but was no longer necessary by the time I worked on them. The negative impact of PID masquerading was the amount of code required to support it. Every PID reference in the Linux kernel had to be scrutinized for whether it should be a local or remote PID. If it needed to be translated, a hook was inserted that said “If the module is loaded, and this process is a masqueraded process, return the masqueraded PID, otherwise return the real PID.” This type of logic composed approximately a quarter of the BProc Linux Kernel patch. There was no practical way we could inject all of this code without source level patching the kernel.

Binary patching means changing the machine code on a running system. There are two assembly instructions we looked for to see if we could change code. They are CALL and JUMP. Actually, there are two types of jumps, long and short, and we can use either of them. We did analysis of the compiled Linux kernel for places with these instructions near our current set of hooks. The CALL instruction is what maps to a function call in C. In assembly it looks like CALL 0x00000000, where the zeros will be replaced by the linker or the loaded with an address in memory where the function resides. Once we know where the call operation takes place, we can replace the value with our own function. This technique is often used with malicious intent in virus and root kits, but really is not much different than how a debugger or many security software packages work. During the replacement process, we record the original value of the function, so that we can unload our module and return it to it’s original flow. The compiler will often use a JMP instruction in the place of a CALL instruction as an optimization called a “tail call.” All this means is that when the called function returns, instead of returning to the location it was called from, it continues up the call stack. I discussed this in the CS GRE problem set post.

One place that we had to hook to make this work was the function and structure that allocated the process identifiers. The function alloc_pidmap gets a PID from a bitmap. The bitmap is just a page of memory treated as an array of bytes. Bit zero of page[0] represents the PID 0, Bit 1 represents PID 1, and so on. If a given bit is set to 1, there exists a structure in memory that is using that process ID. In standard configuration, a page in Linux is 4k bytes. 1024*4*8=32768, which is the largest number that can be held in a 16 bit signed integer. PIDs have traditionally been 16 bit signed integers in Unix, and Linux. We used a couple tricks to help out here. On the Head node, we set all PIDs less than some threshold (we chose 1000) to be 1, indicating to the system that it should not allocate those pids. On compute nodes, we set all PIDs greater than the threshold to be 1. PIDs to be visible across the entire cluster were allocated on the head node. PIDs allocated for local work on the compute node would be guaranteed not to class with PIDs from the head node.

Aside:Â recent versions of the Linux kernel have expanded PIDs to be 32bit signed integers.Â At first it was tempting to expand the allowable PIDs, statically partition the PID space amongst the compute nodes, and allow local allocation of PIDs.Â We origianlly pursued this approach, but rejected it for several reasons.Â First, the Linux Kernel set an arbitrary limit of 4*1024*1024 on the number of PIDS.Â We wanted to be able to support clusters of 1024.Â This means that any node on the cluster had only 4*1024 PIDs to allocate.Â Since the vast majority of PIDs were handed out on the head node anyway, we had to do some unbalanced scheme where the head node go something in the neighborhood of 16000 PIDs, leaving a very small pool to be handed out on each the compute nodes.Â Additionally, a compute node crash erased all record of the PIDs that had been handed out on that machine.Â Replacing a node meant rebuilding the pidmap from the existing process tree, a very error prone and time consuming activity.Â Also, many applications still assumed a 16 bit PID, and we did not want to break those applications.

We found that there were several race conditions that hit us if we relied solely on the pidmap structure to control PIDs. This we ended up hookingÂ alloc_pidmap, checking for a compute node or head node, and checking that the returned PID was withing the appropriate range.Â This type of code pathis frowned up in the mainline Linux kernel, but we found no noticable performance hit in our benchmark applications.

One benefit to this approach was that we could then slowly remove the PID Masquerading code.Â We continued to track both the masqueraded and real PIDs, but they were assigned the same value.Â Thus we never broke the system as we restructured.

Adam Young's Web Log

The Notebook of a Programmer Climber Musician Ex-Soldier Woodworker and a few other things

Leave a Reply