The Army Team’s The Pride and Dream…

In two hours I board a plane for Baltimore.  Tomorrow I will watch the Army-Navy Football Game live for the first time since I was a Cadet.  In November of 1992, Army beat Navy.  I lost my keys, and was the last person besides the cleaning staff in Veteran’s Memorial Stadium in Philadelphia. This year, I am meeting up with three other classmates for the weekend.  I know that Army has had a tough time the last several years.  I suspect it is easier for Navy to Recruit than it is for Army.  Still, anything can happen in this game.  I just hope it is interesting to watch.

Go ARMY!  Beat NAVY!

Back to BProc

The last check in to the BProc CVS repository on sourceforge happened 16 Months ago. I recently checked out the top of tree and found I was unable to build. Looks like what is there is a mix of 2.6.10 and something in the vicinity of 2.6.20 Linux code bases. I am starting again, this time with the code in the tarball. I’ve built this before and know it compiles. Here is my general plan forward:

1. Get a 2.6.9 Kernel with the BPRoc patch applied to boot on a RHEL4 System.

2. Build the BPRoc and VMADump Kernel modules and load them into the kernel.

3. Build the BPMaster and BPSlave binaries. Make sure BPMaster runs.

4. Build the beoboot code.

This is where it gets tricky. At Penguin we had our own PXE Server (Beoserv) that handled provisioning a compute node. Part of the Beoboot package there was creating the root file system and bring up the slave node binary. So here is a tentative plan instead.

1. Deploy the standard redhat PXE and DHCP servers on my head node. Ensure that the DHCP server only responds to requests from the subnet where the compute node resides. Probably best to unplug from the company network when I do this.

2. Set the PXE server to support the booting of a stripped down RHEL4 system. Really, all I want is to get as far as running init.

3. Replace the init in the PXE IMage with the beoboot binary. Have it bring up BPSlave and see if it can talk to BPMaster on the head node.

If I can get this far, I will consider it a great success.

Update 1: I built a 2.6.9 Linux Kernel with the bproc patch applied. makeoldconfig, selected BProc but none of the other options. Upon BootingI got a panic when it could not find device mapper. Looks like device-mapper got added in the 2.6.10 kernel. Since I have already built that kernel, I guess I’ll start by trying the tarball kernel module code against the 2.6.19 patch.

Update 2: Um, nope. TASK_ZOMBIE and mmlist_nr are showing up as undefined symbols. mmlist_nr seems to be acount of the number of memory managers out there. I suspect that this is something that changed between 2.6.9 and 2.6.10. Probably some better way to keep the ref counts was introduced. I Vaguly remember something about the TASK_ZOMBIE.

Update 3: This was bogus and I removed it.
Update 4: Replaced TASK_ZOMBIE with EXIT_ZOMBIE. Commented out the decrement as it seems like it has just been removed.

Update 5: Error accessing rlim in task_strcut. This is now in the signal struct:

– unsigned long gap = current->rlim[RLIMIT_STACK].rlim_cur;
+ unsigned long gap = current->signal->rlim[RLIMIT_STACK].rlim_cur;

Update 6: OK, back to the point I found before. THe hook for kill_pg_info is now kill_pgrp info, and the hook for kill_proc_info is now kill_pid info. This is a change in the patch, so I have to get the module code in line with the new function call parameters. Looks like the header has been changed, but the old function call names are using in kernel/signal.c. Changing, rebuilding, and redeploying kernel.

Update 7:  Success through building and running bpmaster.    I had to create a config directory, but other than that, nothing was too far out of the ordinary.

Graph Theory

The next step in my preparation for the CS GRE is graph theory. The cool thing about graph theory is it is the basis for representing anything to do with relationships, especially spacial relationships.

I have GPS system in my car that routinely gets me to locations via routes I didn’t know. It is certainly not always the fastest route, but it works. If only I had that as an Platoon Leader. Actually, I did have a GPS, but it didn’t do route finding. Just told you where you were. I wonder how many people have tried to sneak a GPS in to Ranger School.

Back to graph theory. Take any set of entities and relationships between them, and you can represent the whole thing as a graph. Well, to an extent. According to the strict definition of a graph, two nodes can only be connected by at max one edge. Thus if you were representing a relationship between people, showing both family and work relationships, then it would not be possible to show that I worked for my Dad.

Graphs show up all over CS. The places they are discussed the most seem to be Networks, where the nodes are machines and the edges are network connections between them. But graphs show up in language parsing , data structures, graphics, object oriented design, databases. Graphs get hit heavily in algorithm courses as many of the most interesting problems come from graph theories.

Two basic problems in graph theory come up time and again: Minimum spanning tree and traveling salesman. Both problems assume that each edge has a weight. Minimum spanning tree attempts to connect all nodes in a graph into a tree with minimum total weight. Traveling salesman attempts to visit each node with a minimum weight path.

Let’s start with a simpler problem: shortest path. This is the problem that my GPS has to deal with when I say “Plot me a course to the Gym.” Given a weighted graph with one node designated the start node and a different node designated the end node, find the path with the least total weight.

There are several things to keep in mind. First is cycles. If your path visits the same node more than once, there exists a shorter path that visits that node only once. Note that this may not be true of driving in Boston, where you can’t take a left from Commonwealth Ave onto the BU bridge, but instead have to go another block, take a right, and loop around to cross the bridge from the south. But in a simple graph, you have to identify if you are visiting a node that you have visited before. There are two ways to do this depending on whether you are willing to spend more time or more memory on the problem. The time intensive way is to use a list, and add each node to the list. When you visit a new node, you can perform a search of the list to see if you have been there before. If so, disregard that path segment. If not, add it to the list. The alternative is to record a state in each node. When you visit the node, you check to see if the state reflects that you have already visited it. If not, mark the node as visited and continue.

That is what I can remember off the top of my head. Obviously, this is a much discussed topic. There are many Wikipedia articles on this and related topics, so I won’t attempt to reproduce that here.

Deploying for Operation Uphold Democracy-1995

CPT DiChairo called me into his office. “Adam, 4-87 is going to Haiti. They need Lieutenants.”

“Sir, I volunteer.”

The words might have been slightly different. It has been 12 years. It was an important moment in my life, so it has taken on a bit of it’s own weight. Memories like this that are often chewed over end up showing the teethmarks of selective memory. Still, the exchange left me tagged to move from a Platoon in C Company, 4th Battalion, 27th Infantry (Wolfhounds) to fill a slot for the understrength unit tagged to deploy to Haiti for Operation Uphold Democracy. This was early December, 1994.

The media had been showing images of Haitians arriving in Florida on makeshift rafts. Three months earlier, the Forces Armed D’Haiti (FADH, the Army of Haiti) had fled pending a United States lead invasion. The US was staged at Guantanamo Bay and on aircraft carriers in the Caribbean. The FADH had taken control of the country in a coup that prevented the elected President, Jean-Bertrand Aristead, from taking office. Conditions in the impoverished country had deteriorated to the point that the haitians preferred to risk setting off in makeshift rafts to to try and reach the coast of the USA then stay in Haiti.

Haiti has had a rough history. After the Colonies shucked off English rule to establish ourselves as the United States of America, the small Island colony did the same, throwing off French rule. America still practiced slavery itself, and was unwilling to support a slave uprising so close to it’s own borders. Over the years America has intervened in Haiti with military force several times. I remember reading about the early career of Marine General Chesty Puller who first saw combat in Haiti early in the 20th Century.

J.P. Green, my friend and house-mate, as well as a fellow LT in the Wolfhounds, was also tagged, but for the task of Brigade S5, Public Affairs officer. We were both tab-less bastards, having failed to complete the Ranger course the previous winter. I suspect that the selection process took that into account. I still am not sure what he did to get that posting, if it was a good thing or a bad thing. As we had a compressed time-frame, we decided to take a compressed Christmas vacation. Along with Mike Pickett, the Battalion Assistant S-2 (Intelligence officer), JP and I headed over to Maui from Oahu for 5 day weekend. We partied in Lahaina, drove over to the Sacred Pools, flirted with mormon girls. One night, I fell asleep in the back of a night club after a long night. JP and Mike found me there.

With that out of my system, I settled in for the train up. Until this point, the majority of my military training had been for war. As a light infantry platoon leader, I was trained to lead a group of 35 soldiers to “Close with and destroy the enemy by means of fire an maneuver in order to destroy or capture him, or to repel his assault by fire, close combat and counter attack.” as I memorized as a plebe. Now I was training for “Operations other than War” or OOW. We trained in convoy movements, crowd control, cordon and search, and defending food distribution. At the firing range we trained in “Quick Kill” exercises. We learned the graduated responses to riot, from issuing vocal commands, through pepper spray, helicopter down wash, and deadly force. We did a live fire exercise in a range on Schofield where certain targets wore civilian clothes. I donated a shirt of mine to the training, and recieved it back with a bullet hole.

We did a video conference with the folks from the 10th Mountain Division. The conference washeld at the USARPAC Headquarters on Oahu, Fort Shafter. On the way down, the Bde Commander asked what I was going to ask. I replied that I was going to listen and keep my mouth shut. He explained that this was the wrong attitude. I realized I didn’t know what we were doing this for, so I asked for more info. Not the first time I was unprepared for what the Army threw at me.

I started hanging around with Erik Anderson, on of the two other line platoon leaders in D Co, 4-87. We raced around Diamond Heights in our suits looking to find the secluded site that our Company First Sergent was getting married at. We showed up late, but still managed to with Top best of luck. I have a photo of us at the beach, on Christmas day.

Lots of the details have been lost. I know we did deployment prep at the stadium across from D quad on Schofield, took busses down to Hickam AFB and flew Tower Air to Port-Au-Prince (PAP in Milspeak), and then took smaller planes, probably C130s up to Cap Haitian Haiti. Due to the war powers resolution, we knew that the president only had 89 days before having to redeploy us. We shook hands with the outgoing 10th Mountain Division and settled in for the Haitian Vacation.

Build systems

For the non programmers, the word ‘make’ in the following paragraphs refers to a tool used to control the compilation of a software product.

At work we are in transition to using scons as our build driver. I guess enough people were frustrated with Make that they were able to change things over. Here is the problem.

You can never get away from make.

Any build system that uses code from some where else is, eventually going to have to deal with make. Your standard GNU package is a tarball that you unpack and then run:

./configure

make

make install

Or some variation on this. At work, the build team wanted to standardize on the version of autotools that they use. Makes sense, you should not check generated files into a repository if it makes more sense to regenerate them…but does it? Well, you need to be able to modify the Makefile.in and Configure.in files, so yes, every developer needs to run Automake/Libtoolize/Autoconf/etc with the same options and the same version of the tools. I’m not certain that autotools really make sense for a Linux only project, which was why we didn’t use them at Penguin and why we have to use them for the packages we have here.

People don’t like make because they either don’t like rules based programming or they don’t like debugging make files. Fair enough…but deal with it. Instead of trying to rewrite the tool, extend it. I am not talking about your average developers, just the ones that decided they can do it better using X, where X is some other language or some other library.

Personally, I’ve never found a replacement of make that didn’t end up sporting some of make’s warts. If your goal is to do incremental compilation, then the Ant approach, depending on the compiler to identify which file to build, won’t work for most languages. This is a case where you need to recognize in a Makefile that you don’t want to have a .SUFFIXES rule (.java.class) and instead, if you depend on the Java code, you let javac build everything every time. See the difference? One depends on a tool support for something, and fails when it isn’t there. The other allows you to take advantage of the tool support if it is there.

One reason that people don’t like make is that it is sensitive to white space. SCons is built in Python, also sensitive to white space. I guess those people will keep looking for something else.

Renaming Army Units

One thing I have never quite liked is the fact that in the modern US Army, there are Divisions, and there are Brigades, but there are no regiments.  And a Brigade is commanded by a Colonel.

Origianally, a Regiment was commanded by a Colonel, a Brigade by a 1 start, or Brigadier General, and a Division by a two star General.  At some point, the sizing was wrong and they collapse the Regiment and Brigade into a single unit.  But they kept the term Brigade, not Regiment.

I suppose they are able to keep more regiment guidon’s in active usage this way, but it seems to me that they should just rename the modern Army Brigades to Regiments, and then assignon historical Regiment per.  The Colonel would be the regimental commander, just like in history.

Whe I was with the 25th ID, our brigades had battalions from multiple Regiments.  Thrid Brigade, 25tjh ID was st1 and 4 th Battalion, 27th Infantry, and then we had 1-487 th Infantry in between.  We would have been a hell of a lot more cohesive as the Wolfhound Regiment 1,2,3.

I suppose the places where they have separate infantry Brigades commanded by a Brigadier General (I’ve seen this in the National Guard) would be a little awkward, but really, is it any worse than we have now?  We could even keep those units as SIBs.  Now A brigade goes back to being commanded by a General, and a Regiment by a Colonel.  It is even more straightforward.

Duty, Honor, Country

Is there any part of the Military that has a better slogan? While I know West Point jealously guards this motto, it really should be the motto for the entire Army. I mean, it isn’t just the officers who need to live to this standard. An Army of One. Be All That You Can Be. Too much focus on the individual, not on why you are doing what you do. The Army recruiting posters should use this. The Marines have it right. The Few, the Proud, the Marines. We are an exclusive club, we are good, and, if you show us you have what it takes, we’ll accept you. But who in this modern world can talk of Honor with a straight face? With our country so divided over Iraq, who talks of Patriotism not as a sling in a political battle, but with the quiet reverence due to a set of ideals that we strive to attain. What is Duty, and how does it differ from Country. Why does it come first?

I recently read a book call Absolutely American, by David Lipski. David is a reporter for Rolling Stone. I remember reading his article when it first came out. Seems he took a house in Highland Falls and stayed with the Call of 2002 for the three years (Yearling, Cow, Firstie) Until they graduated. For the past several nights I have had trouble falling asleep as the memories it awoke kept running through my head. I graduated from The United States Military Academy at West Point, class of 1993.
Digression: If you ever see someone wearing a shirt that says USMA, you are looking at someone affiliated with my Alma Mater. The funny thing is most people will read it as USMC. The Marine Corps is gets a lot of great Advertising from the Army. when I was at the recruiting station in Boston when my brother was enlisting, a kid told me he was joining the Marines because he identified with what he saw in Washington DC, them guarding the Tomb of the Unknown Soldier and so on. I told him that was the Old Guard, 3rd Regiment, United States Army. I recently Heard that the Army was returning back to wearing Army Blue as the Everyday uniform as well as the Formal. About time. But people will just keep think that Blue means Marines. Oh well. Back to the Main Thought.
Lipsky makes his shares of mistakes, but they are minor. He gets the jargon right, captures how we talked, as it was only 5 years between my graduation and the start of his research. He didn’t discuss academics at all, which is too bad, as it is the major activity at West Point, but maybe he thought that it was too similar to any other college. Too bad that, as I suspect that is not the case. There is a certain degree of academic freedom elsewhere that you just don’t have at USMA. If a student at Harvard writes an inflammatory satire claiming that your teacher is having affairs with both the men and women in his class, the first Amendment and attitude of the instructors fully protects you. This is not libel, as it isn’t published. At West Point, you get written up, punished, and possibly expelled. When I was a cadet, A couple guys did a spoof on the R.E.M. song Losing My Religion called Losing My Femininity and dedicated it to the Women’s Power Lifting Team. A segment I remember:

That’s me in the corner

That’s Me on the Squat-Rack

Losing my Femininity.

The officer in charge of the team, LTC Christopher, wrote them all up. When my class protested to our English Teacher ( a Military Police Officer, CPT P—Damn my memory is shaky.) He invited LTC Christopher to the class. He came in mad and basically bullied us into silence. I understand his desire to protect the people working for him, but there was no chance to have a dialog. At USMA, probably even more than in the Regular Army, when a superior says some thing is so, the is no argument about it.

I once got written up for running the screen save program at a 2 second interval. I had been poking around in our Unix systems and ran a program (probably named something the like xsc). It responded with a usage message (must be run with -s=X where X is a number between 1 and 10000 or something) and I re-ran it like this xsc -s=2. It ran and returned and I though nothing of it until I got called into my Tac’s office. I had been written up by my Networking instructor. My tac didn’t even understand the quill. I went down and explained myself to my P and that was the end of it. He accepted my word (West Point Honor code) that I was not trying to play a joke or anything, and I stopped poking around.

There are two problems here. The first is that a program like a screensaver should not be runnable against the X windows session unless you own the session. If it is supposed to be run regardless of who is logged in, it should require root privileges. Poor security design. But the second is that a school should not shut down the inquisitiveness of it’s students. I’m sure stuff like this happens at many colleges, but it is the force of discipline at West Point that makes it so fearsome.

The memories keep coming, almost overwhelming now. West Point was an intense four year, more so than any that came before, and only rivaled by some of my later experiences in the Army. I was in no way an model Cadet, but I was not a complete dirt bag. I came to believe in what we were there to do, and did my job. I was and still am a fierce individualist, and that worked against me at USMA. I think it is normal for someone as a professional programmer. I was not drawn inside the camaraderie of our Company as closely as some. Where others played Football or joined the drill team, I joined the Band. Music had been a huge part of my life and I wanted to hold on to it. But it was considered a second tier activity, a way to “Get Over” and avoid drill and ceremony and do something dorky. I did a lot of things like that: medieval studies, Jewish Chapel Choir, and so on. Yes, I identified with Cadet George Rash to a degree. But I had four years on the high school wrestling team, and many years working for Jim Young (Thanks Dad!) to get me physically and mentally tough.

I think West Point and the officer Corp in general would be better served by sending Cadets out to regular Army units as junior enlisted before they they make the commitment Junior Year. I was a drill cadet, and got fired up by the opportunity, but I think I would have been really well served by a six month or one year stint as a line infantry soldier. I realize Camp Buckner is tradition and all, but it would be a hell of a lot better if we went to AIT (follows after Basic Training) at Benning and Knox, then shipped to various posts around the country and be a “Joe” for the remainder of the year. North Eastern University has a Co-Op program, and it doesn’t seem to hurt their admissions process. Why can’t West Point graduate people in five years? One added benefit is that people would drop at the end of the year that really didn’t want to serve in the Army. At the end of four years, you get promoted to E-4. If you want to stay in the Army, you can keep that rank. Knock a year off of what you get for college. When you graduate West Point, you have already earned one year toward retirement. There is no reason ROTC shouldn’t/couldn’t do this as well. If you did it at the end of Plebe year, the Cadets would have a year at WP vested as motivation to come back. And, of course, the officers in their unit jacking them up all the harder if they decided to quit. But everyone would know what IT was about. Maybe they could make an exception for the prior service cadets, or let them retain their old rank out there in the Army.

Duty is strange word and concept. We do our duty as an attempt to pay back the debt we owe. Like original sin, we have a debt just by being born American. This is an interesting concept. What debt do we have? To the men of the Armed Forces that defended our freedoms in the past? To the Aboriginal Americans who died from smallpox and wars with the Europeans? To the African tribesman that were kidnapped and shipped to the agricultural hell of the cotton plantation? This is what you get as an American: the opportunity for a public education, good roads to drive on, a police and fire department that work to keep you safe, a legal system that allows you to challenge it, an economic system that allows you to move up in the world, and a medical system that keeps you healthy. Yes, these are unequally distributed. The higher your social standing, the better the community you can live in, and the better health care you can afford. Many Americans have much fewer opportunities. Perhaps we have a debt to them as well. Not to pull them upwards, but to make sure that the opportunities, the pursuit of happiness, are available to them as well.

With slavery, we destroyed the social structures of the Africans when they were brought to America. While many other ethnic groups have come here, and have been treated badly (I heard something on NPR this morning about Chinese Workers in America) they have, for them most part, come willingly. At a minimum, we owe a debt to the African American community to help repair the social devastation we created.

I think we also have a duty to the people of Iraq to find the best possible solution to the crisis there. I won’t get into whether the invasion was right or wrong. The past is prologue. We now stand at the start of the next page of Iraqi history. How can we minimize damage and maximize opportunity for the average Iraqi? Does an American presence in Iraq deter more violence than it incites? I wish I knew the answer to this question. Is the debt we are incurring in trying to support Iraq detracting from our ability to do our Duty elsewhere?

A favorite interview question of mine

Given the following API:

int bproc_numnodes() ;

void bproc_move(int nodenum);

long gettime();

Write a program that will tell me the average clock skew in a cluster.

I gave this problem to a lot of people back when I was at Penguin. The API and simple math functions really are all you need. No malloc, or socket calls. I did allow people to #include <stdio.h> for output.

If anyone reads this blog, and has an answer, post it to the comments section. I’ll post an answer with a full explanation in a bit.

Working on the Beowulf Process

I am currently listed as one of the maintainers of the BProc project on Sourceforge. Unfortunately, my current life has left me little enough time to do my job and be a father, so other projects fall by the wayside.

BPRoc portrays a cluster of computers a single system from an operating system perspective. A process running anywhere one the cluster shows up in the process tree on the head node. Signals sent on any machine were forwarded to the machine where the process was actually running. A process could voluntarily migrate from one machine to another. All of these techniques take place in the Linux Kernel. Maintaining this code requires understand of operating system concepts such as signal delivery, page table organization, dynamic library linking, as well as network programming. I’ve never had more fun coding.

Linux kernel development is done by taking a copy of the kode published at kernel.org and applying a file that contains the differences between how it looks at the start and how you want it to look at the end. This file is called a patch. THe major Linux distributions all have a version of the Linux Kernel that they select as a starting point, and then a series of patches that they apply to deal with issues they care about. For instance, I am running Red Hat Enterprise Linux 4 machine with a kernel version of 2.6.9-55.0.9. The 2.6.9 is the version that they got from kernel.org. The 55.0.9 indicates the number of major and minor upgrades they have made to that kernel. The number patches applied when last I looked was in the neighborhood of 200. All of the changes we applied to the Linux kernel was maintained in a single patch. As we maintained succeeding version of the kernel, we continued to generate newer versions of that patch. In addition to this code, we had a separate, and much larger, portion of code that was compiled into a binary format that could loaded into the Linux Kernel on demand. The majority of the code in the patch was merely hooks into the code that called out to the loadable kernel modules.

Penguin had branched from the Sourceforge BPRoc before I joined. As such, Sourceforge had already moved on to the 2.6 series Linux Kernel while we were still on the 2.4 series. This was a major difference in the code base, and there was little grounds for sharing. When we did finally start moving to the 2.6 code base, we had one Marketing requirement that the Sourceforge project did not: We needed to interoperate with the Linux Kernel shipped by RedHat for there Enterprise Linux product (RHEL). I spent a long time in research mode trying to make this happen. Two major decisions came out of this. First, PID masquerading had to go. Second, we needed to use binary patching in place of many of the source level patches.

 

Every process in an operating system has an integer process identifier (PID) that other processes and the kernel can use to access that process. A major mechanism in BProc was the ability to migrate a process from one physical machine to another. PID masquerading is a technique that ensures that the process identified does not have to change during migration. Instead, each process has two identifiers. The first is the ID as allocated on the head node, and is used when reporting information to the head node, other nodes, or user land functions. The second ID is the PID allocated on the local machine, and only used inside to local machines Kernel. When a function like getpid is called, the process identifier returned is the masqueraded PID, not the local PID. PID masquerading has both positive and negative implications. With PID masquerading, a given compute node can actually have two completely separate pools of processes that cannot communicate with each other. each of the pools of processes can be driven from a different head node. This allows the sharing of compute nodes between head nodes. A given machine can actually act as both a head node and a compute node. This was a requirement in early Beowulf clusters, but was no longer necessary by the time I worked on them. The negative impact of PID masquerading was the amount of code required to support it. Every PID reference in the Linux kernel had to be scrutinized for whether it should be a local or remote PID. If it needed to be translated, a hook was inserted that said “If the module is loaded, and this process is a masqueraded process, return the masqueraded PID, otherwise return the real PID.” This type of logic composed approximately a quarter of the BProc Linux Kernel patch. There was no practical way we could inject all of this code without source level patching the kernel.

 

 

 

 

 

Binary patching means changing the machine code on a running system. There are two assembly instructions we looked for to see if we could change code. They are CALL and JUMP. Actually, there are two types of jumps, long and short, and we can use either of them. We did analysis of the compiled Linux kernel for places with these instructions near our current set of hooks. The CALL instruction is what maps to a function call in C. In assembly it looks like CALL 0x00000000, where the zeros will be replaced by the linker or the loaded with an address in memory where the function resides. Once we know where the call operation takes place, we can replace the value with our own function. This technique is often used with malicious intent in virus and root kits, but really is not much different than how a debugger or many security software packages work. During the replacement process, we record the original value of the function, so that we can unload our module and return it to it’s original flow. The compiler will often use a JMP instruction in the place of a CALL instruction as an optimization called a “tail call.” All this means is that when the called function returns, instead of returning to the location it was called from, it continues up the call stack. I discussed this in the CS GRE problem set post.

One place that we had to hook to make this work was the function and structure that allocated the process identifiers. The function alloc_pidmap gets a PID from a bitmap. The bitmap is just a page of memory treated as an array of bytes. Bit zero of page[0] represents the PID 0, Bit 1 represents PID 1, and so on. If a given bit is set to 1, there exists a structure in memory that is using that process ID. In standard configuration, a page in Linux is 4k bytes. 1024*4*8=32768, which is the largest number that can be held in a 16 bit signed integer. PIDs have traditionally been 16 bit signed integers in Unix, and Linux. We used a couple tricks to help out here. On the Head node, we set all PIDs less than some threshold (we chose 1000) to be 1, indicating to the system that it should not allocate those pids. On compute nodes, we set all PIDs greater than the threshold to be 1. PIDs to be visible across the entire cluster were allocated on the head node. PIDs allocated for local work on the compute node would be guaranteed not to class with PIDs from the head node.

Aside:  recent versions of the Linux kernel have expanded PIDs to be 32bit signed integers.  At first it was tempting to expand the allowable PIDs, statically partition the PID space amongst the compute nodes, and allow local allocation of PIDs.  We origianlly pursued this approach, but rejected it for several reasons.  First, the Linux Kernel set an arbitrary limit of 4*1024*1024 on the number of PIDS.  We wanted to be able to support clusters of 1024.  This means that any node on the cluster had only 4*1024 PIDs to allocate.  Since the vast majority of PIDs were handed out on the head node anyway, we had to do some unbalanced scheme where the head node go something in the neighborhood of 16000 PIDs, leaving a very small pool to be handed out on each the compute nodes.  Additionally, a compute node crash erased all record of the PIDs that had been handed out on that machine.  Replacing a node meant rebuilding the pidmap from the existing process tree, a very error prone and time consuming activity.  Also, many applications still assumed a 16 bit PID, and we did not want to break those applications.

We found that there were several race conditions that hit us if we relied solely on the pidmap structure to control PIDs. This we ended up hooking  alloc_pidmap, checking for a compute node or head node, and checking that the returned PID was withing the appropriate range.  This type of code pathis frowned up in the mainline Linux kernel, but we found no noticable performance hit in our benchmark applications.

One benefit to this approach was that we could then slowly remove the PID Masquerading code.  We continued to track both the masqueraded and real PIDs, but they were assigned the same value.  Thus we never broke the system as we restructured.