High Availability and dealing with failures

Time to use my pubic forum to muddle through some design issues I’m struggling to lay straight.

A Data center is made up of several objects: Servers ( computers, usually horizontal), racks(hold the computers), switches(a network device that connects two or more computers together), power sources, and cables (both electric and network cables, to include fiber optic for storage devices). A server in the data center can serve on or more roles: storage host, computation host, administration, or user interface. If an application is hosted in a data cetner, it is usually important enough that it requires some guarantee of availability. This application will resided on a computation host, and require access to other types of hosts. An email server stores the received messages in a storage host, probably connected to the computation host via fiber optics. It receives and sends messages via a network connection that goes to the outside world. It may also talk to a User Interface machine that runs a web server and an application that allows web access to email. If the computation host loses connectivity with either the public network or the storage host, it cannot process mail. If the web server loses connectivity to the mail server, certain users cannot access their mail.

There are many reasons that connectivity can fail. The major links in the chain are: OS failure, Network interface card (NIC) failure, bad cable, disconnected cable, bad switch, unplugged switch, switch turned off. Once you pass the switch, the same set of potential problems exist on to the other host. To increase reliability, a network will often have two switches, and each server will have two NICs, one plugged into each switch. The same set up goes for storage, although different technologies are used. As a general rule, you don’t want to have things running in hot standby mode. It is a waste of resources, and it doesn’t get tested until an emergency hits. Thus, the double network connectivity usually gets set up also as a way to double bandwidth. Now if one of the cables breaks, that server merely operates in a degraded mode. The second cable has been passing network traffic already, now it just gets all of it.

A typical data center has many machines. Typical server loads are very low, sometimes in the 1-3% range of overall capacity. Thus, if a machine fails, a data center often has plenty of servers that could absorb the load from the failed server. Figuring out how to cross load services in a data center has been a major topic in the IT world over the past decade. This goes by many names, one of which is grid computing. I’ll use that term myself here. There are several problems any grid system has to solve, but most can be clumped under the term application provisioning. This means getting all of the resources together that a given application requires so that they available on the computation host. These resources include the network and storage connections described above, as well as the executables, data files, licenses, and security permissions required to run the application.

When a host fails, some remote monitoring system needs to act. First, it needs to know that the host has failed. This is typically performed through a heartbeat sensor. This is a simple network communication sent by the computation host saying “I’m still alive.” Cue Mike McCready. When a heartbeat fails, the monitor needs to make sure that the application is up online somewhere as soon as possible. Now, the reason the heartbeat failed might have been because of a problem on the heartbeat network, and the application is actually up and running just fine. An advanced solution is to test the system through some alternative method. In the case of the email server, it may be to connect to the email port and send a sample message. This delays the restart of the application, but may minimize downtime.

Sometimes, two copies of the applications can’t run at the same time. In this case, you have to be sure that the original one is gone. To achieve this, you shut off the original server. This is called “Shoot the other node in the head.” or STONITH. Sometimes the word node is replaced with guy and you get STOGITH. If you do this incorrectly, you may take yourself down, a situation referred to as SMITH. Or you take all servers down, and this is called SEITH. But I digest…

Here’s the part that I am currently trying to decide. If an application depends on a resource, and that resource fails, you can bring the application up on a different server. It will take a non-trivial amount of time (estimate it a minute) to shut down the old instance and bring up the new instance. If, on the other hand, the disconnect is temporary, we can have minimal down time by just waiting for the network to come back up. If someone disconnects a cable by accident, that person can just plug the cable back in. If the network is redundant, removing one cable may result in degraded performance, but it may not.
If the failure is due to the switch being down, just selecting another host connected to the same switch will result in downtime and a situation that is no better than the original. If the problem is the storage host being down, there probably is nothing you can do to recover outside of human or divine intervention.

If a switch goes down but there is another set of hosts on a different switch, you can migrate applications to the new host. But you may end up overloading the new switch. This is referred to as the stampeding herd effect. If the lost switch is degrading performance for all applications dependent on that switch, you best strategy is to migrate a subset of applications to balance the load. After each application is moved, recheck the network traffic to determine if you’ve done enough. Or done too much.

A malfunctioning NIC or switch might manifest in intermittent connectivity.  In this case, you want to get the application off of the original server and on to a new server.  The problem is in distinguishing this from the case where the cable just got unplugged once, and then plugged back in.  From the server’s perspective, the network is gone, and then it is back.  This leads to a a lot of questions. What interval, and how many iterations do you let go by before you decide to bail from that server?  If you have redundancy, does the failing connection impact the user’s experience, or does proper routing ensure that they have seamless connectivity?

Doing Unspeakable things with Templates

The C++ Template language is Turing complete.  That means it should easily be able to build an Acceptor for a Regular expression.  Yes, this is headed where you think it is.  Get out while you still can.

Still with me?  Too bad for you:

 #include <string>

using std::string;

namespace acceptor{

template <typename Next> class Expression{
static bool accept(string& s){
return Next::accept(s, s.begin());

class EndType {
static bool accept(string& s, string::iterator itr){
return(itr == s.end());

class NoOp {
static bool accept(string& s, string::iterator itr){
return true;

template <char valid,typename Next = NoOp > class Char{
static bool accept(string& s, string::iterator  itr){
return (valid == *itr) && Next::accept(s,++itr);

template <char first,char last,typename Next = NoOp >
class RangeAcceptor{
static bool accept(string& s, string::iterator itr){
return ((*itr >= first)
&& (*itr <= last)
&& Next::accept(s,++itr));

template <int count, typename Acceptor ,typename Next = NoOp>
class Count{
static bool accept(string& s, string::iterator itr){
int i;
for ( i = 0; i < count ; ++i){
if (Acceptor::accept(s,itr)){
return false;
return Next::accept(s, itr);

template <typename First, typename Second >
class Or {
static bool accept(string& s, string::iterator itr){
if ( First::accept(s,itr)) {
return true;
} else {
return Second::accept(s,itr);

template <typename First, typename Second >
class And {
static bool accept(string& s, string::iterator itr){
return (First::accept(s,itr) && Second::accept(s,itr));

template <typename Next = NoOp >
class Digit: public RangeAcceptor<‘0′,’9’,Next>{};

template <typename Next>
class LowerLetter : public RangeAcceptor<‘a’,’z’,Next>{};

template <typename Next>
class UpperLetter : public RangeAcceptor<‘A’,’Z’,Next>{};

template <typename Next = NoOp >
class Letter : public Or < LowerLetter<NoOp> , UpperLetter<NoOp> > {};

How would you use this?

 void testDigit(){
Digit <
Digit <
Digit <
Char  < ‘-‘ ,
Digit <
Digit <
Char  < ‘-‘ ,
Digit <
Digit <
Digit <
Digit <
> > > > > > > > > > > >ssn;

string s(“000-22-9999”);
assertTrue(s, ssn.accept(s));

s = “abcdefghij”;
assertFalse(s, ssn.accept(s));

Not shown:  Implementation of the Kleene Star, as that would require a writing a greedy algorithm, something I have no desire to do right now.

Graph Theory

The next step in my preparation for the CS GRE is graph theory. The cool thing about graph theory is it is the basis for representing anything to do with relationships, especially spacial relationships.

I have GPS system in my car that routinely gets me to locations via routes I didn’t know. It is certainly not always the fastest route, but it works. If only I had that as an Platoon Leader. Actually, I did have a GPS, but it didn’t do route finding. Just told you where you were. I wonder how many people have tried to sneak a GPS in to Ranger School.

Back to graph theory. Take any set of entities and relationships between them, and you can represent the whole thing as a graph. Well, to an extent. According to the strict definition of a graph, two nodes can only be connected by at max one edge. Thus if you were representing a relationship between people, showing both family and work relationships, then it would not be possible to show that I worked for my Dad.

Graphs show up all over CS. The places they are discussed the most seem to be Networks, where the nodes are machines and the edges are network connections between them. But graphs show up in language parsing , data structures, graphics, object oriented design, databases. Graphs get hit heavily in algorithm courses as many of the most interesting problems come from graph theories.

Two basic problems in graph theory come up time and again: Minimum spanning tree and traveling salesman. Both problems assume that each edge has a weight. Minimum spanning tree attempts to connect all nodes in a graph into a tree with minimum total weight. Traveling salesman attempts to visit each node with a minimum weight path.

Let’s start with a simpler problem: shortest path. This is the problem that my GPS has to deal with when I say “Plot me a course to the Gym.” Given a weighted graph with one node designated the start node and a different node designated the end node, find the path with the least total weight.

There are several things to keep in mind. First is cycles. If your path visits the same node more than once, there exists a shorter path that visits that node only once. Note that this may not be true of driving in Boston, where you can’t take a left from Commonwealth Ave onto the BU bridge, but instead have to go another block, take a right, and loop around to cross the bridge from the south. But in a simple graph, you have to identify if you are visiting a node that you have visited before. There are two ways to do this depending on whether you are willing to spend more time or more memory on the problem. The time intensive way is to use a list, and add each node to the list. When you visit a new node, you can perform a search of the list to see if you have been there before. If so, disregard that path segment. If not, add it to the list. The alternative is to record a state in each node. When you visit the node, you check to see if the state reflects that you have already visited it. If not, mark the node as visited and continue.

That is what I can remember off the top of my head. Obviously, this is a much discussed topic. There are many Wikipedia articles on this and related topics, so I won’t attempt to reproduce that here.

Working on the Beowulf Process

I am currently listed as one of the maintainers of the BProc project on Sourceforge. Unfortunately, my current life has left me little enough time to do my job and be a father, so other projects fall by the wayside.

BPRoc portrays a cluster of computers a single system from an operating system perspective. A process running anywhere one the cluster shows up in the process tree on the head node. Signals sent on any machine were forwarded to the machine where the process was actually running. A process could voluntarily migrate from one machine to another. All of these techniques take place in the Linux Kernel. Maintaining this code requires understand of operating system concepts such as signal delivery, page table organization, dynamic library linking, as well as network programming. I’ve never had more fun coding.

Linux kernel development is done by taking a copy of the kode published at kernel.org and applying a file that contains the differences between how it looks at the start and how you want it to look at the end. This file is called a patch. THe major Linux distributions all have a version of the Linux Kernel that they select as a starting point, and then a series of patches that they apply to deal with issues they care about. For instance, I am running Red Hat Enterprise Linux 4 machine with a kernel version of 2.6.9-55.0.9. The 2.6.9 is the version that they got from kernel.org. The 55.0.9 indicates the number of major and minor upgrades they have made to that kernel. The number patches applied when last I looked was in the neighborhood of 200. All of the changes we applied to the Linux kernel was maintained in a single patch. As we maintained succeeding version of the kernel, we continued to generate newer versions of that patch. In addition to this code, we had a separate, and much larger, portion of code that was compiled into a binary format that could loaded into the Linux Kernel on demand. The majority of the code in the patch was merely hooks into the code that called out to the loadable kernel modules.

Penguin had branched from the Sourceforge BPRoc before I joined. As such, Sourceforge had already moved on to the 2.6 series Linux Kernel while we were still on the 2.4 series. This was a major difference in the code base, and there was little grounds for sharing. When we did finally start moving to the 2.6 code base, we had one Marketing requirement that the Sourceforge project did not: We needed to interoperate with the Linux Kernel shipped by RedHat for there Enterprise Linux product (RHEL). I spent a long time in research mode trying to make this happen. Two major decisions came out of this. First, PID masquerading had to go. Second, we needed to use binary patching in place of many of the source level patches.


Every process in an operating system has an integer process identifier (PID) that other processes and the kernel can use to access that process. A major mechanism in BProc was the ability to migrate a process from one physical machine to another. PID masquerading is a technique that ensures that the process identified does not have to change during migration. Instead, each process has two identifiers. The first is the ID as allocated on the head node, and is used when reporting information to the head node, other nodes, or user land functions. The second ID is the PID allocated on the local machine, and only used inside to local machines Kernel. When a function like getpid is called, the process identifier returned is the masqueraded PID, not the local PID. PID masquerading has both positive and negative implications. With PID masquerading, a given compute node can actually have two completely separate pools of processes that cannot communicate with each other. each of the pools of processes can be driven from a different head node. This allows the sharing of compute nodes between head nodes. A given machine can actually act as both a head node and a compute node. This was a requirement in early Beowulf clusters, but was no longer necessary by the time I worked on them. The negative impact of PID masquerading was the amount of code required to support it. Every PID reference in the Linux kernel had to be scrutinized for whether it should be a local or remote PID. If it needed to be translated, a hook was inserted that said “If the module is loaded, and this process is a masqueraded process, return the masqueraded PID, otherwise return the real PID.” This type of logic composed approximately a quarter of the BProc Linux Kernel patch. There was no practical way we could inject all of this code without source level patching the kernel.






Binary patching means changing the machine code on a running system. There are two assembly instructions we looked for to see if we could change code. They are CALL and JUMP. Actually, there are two types of jumps, long and short, and we can use either of them. We did analysis of the compiled Linux kernel for places with these instructions near our current set of hooks. The CALL instruction is what maps to a function call in C. In assembly it looks like CALL 0x00000000, where the zeros will be replaced by the linker or the loaded with an address in memory where the function resides. Once we know where the call operation takes place, we can replace the value with our own function. This technique is often used with malicious intent in virus and root kits, but really is not much different than how a debugger or many security software packages work. During the replacement process, we record the original value of the function, so that we can unload our module and return it to it’s original flow. The compiler will often use a JMP instruction in the place of a CALL instruction as an optimization called a “tail call.” All this means is that when the called function returns, instead of returning to the location it was called from, it continues up the call stack. I discussed this in the CS GRE problem set post.

One place that we had to hook to make this work was the function and structure that allocated the process identifiers. The function alloc_pidmap gets a PID from a bitmap. The bitmap is just a page of memory treated as an array of bytes. Bit zero of page[0] represents the PID 0, Bit 1 represents PID 1, and so on. If a given bit is set to 1, there exists a structure in memory that is using that process ID. In standard configuration, a page in Linux is 4k bytes. 1024*4*8=32768, which is the largest number that can be held in a 16 bit signed integer. PIDs have traditionally been 16 bit signed integers in Unix, and Linux. We used a couple tricks to help out here. On the Head node, we set all PIDs less than some threshold (we chose 1000) to be 1, indicating to the system that it should not allocate those pids. On compute nodes, we set all PIDs greater than the threshold to be 1. PIDs to be visible across the entire cluster were allocated on the head node. PIDs allocated for local work on the compute node would be guaranteed not to class with PIDs from the head node.

Aside:  recent versions of the Linux kernel have expanded PIDs to be 32bit signed integers.  At first it was tempting to expand the allowable PIDs, statically partition the PID space amongst the compute nodes, and allow local allocation of PIDs.  We origianlly pursued this approach, but rejected it for several reasons.  First, the Linux Kernel set an arbitrary limit of 4*1024*1024 on the number of PIDS.  We wanted to be able to support clusters of 1024.  This means that any node on the cluster had only 4*1024 PIDs to allocate.  Since the vast majority of PIDs were handed out on the head node anyway, we had to do some unbalanced scheme where the head node go something in the neighborhood of 16000 PIDs, leaving a very small pool to be handed out on each the compute nodes.  Additionally, a compute node crash erased all record of the PIDs that had been handed out on that machine.  Replacing a node meant rebuilding the pidmap from the existing process tree, a very error prone and time consuming activity.  Also, many applications still assumed a 16 bit PID, and we did not want to break those applications.

We found that there were several race conditions that hit us if we relied solely on the pidmap structure to control PIDs. This we ended up hooking  alloc_pidmap, checking for a compute node or head node, and checking that the returned PID was withing the appropriate range.  This type of code pathis frowned up in the mainline Linux kernel, but we found no noticable performance hit in our benchmark applications.

One benefit to this approach was that we could then slowly remove the PID Masquerading code.  We continued to track both the masqueraded and real PIDs, but they were assigned the same value.  Thus we never broke the system as we restructured.

Questions for GRE Prep

This list of question was from a Slashdot post. Here are my answers.

Do you remember what LALR stands for?

This is a form of grammer parser. Look Ahead Left to Right. It accepts/parses Context sensitive grammers.
Can you calculate the cost of a pipeline bubble or a branch misprediction? What information would you need to do so?

You can calculate the cost of a specific misprediction under some architectures. You would need to know what the instructions to be executed in the incorrect case are, how long they take, and whether you can undo them or not. For instance in the sample code:

if     (i){   do_x();  } else{  do_y(); }

If both do_x and do_y are non, inlined functions, you know that a function call will happen.  Assume i has to fetched from memory, if  the addresses of do_x and do_y must also be fetched from memory, there is a likelihood of fetching the wrong address for the call operation.  If, on the other hand, do_y is inlined, and the compiler determines this to be the likely path, the cost will be whatever work do_y executes while waiting on the fetch of (i).

Deeper pipelines have even worse issues, as the “most likely path” may only be executed in it’s entirety a relativelyfew number of times, meaning much of the work done by deeper pipeline stages gets discarded.

How is depth-first/breadth-first related to stacks/queues related to LIFO/FIFO?

a stack is a last in first out (LIFO) data structure

a queue is a first in first out (FIFO) data structure.

In a depth first search, traversing down the the search tree is analogous to pushing values onto a stack. A depth first search is easily implemented using recursion, where the stack is the actual program stack.

In a bredth first search, you traverse across all of the children of the start node. As you search each child, you add it to a queue. Once the current node has no more children, you remove the item at the head of the queue and process it in the same manner. Continue until you meet the search criteria or there is nothing left in the queue.

What is the stack pointer?

The element of the register file that indicates where the top of the stack is for a process/thread.

What is the frame pointer?

On some architectures, a separate register points to a location on the stack that indicates where the currently executing function’s scope begins. On x86 and later architectures, there is no frame pointer register. However, the concept of frame pointer is still valid, as the return location of the current function will be pushed onto the stack with a call operation. To return from a function, this value is popped off the stack and put in the instruction register.

An interesting note is that this value may not always refer to the calling function. For example, assume you have three functions main, outer and inner. Outer and inner have identical parameter lists, for this example, a single integer value x.

void inner(int x){


void outer(x){




int x;



The generated assembly for main will be something like:

mov X,%ebp
call <outer>

But for main it can be

jmp <inner>

The reason is that there is no need to clean up the paramters after the call to inner. When inner completes, the return value stored on the stack will be from main, one after the call to outer. This is called a tail call and is a very common compiler optimization.

What is an interrupt?

An indication of an asynchronous event that the operating system must handle. There are two forms of interrupts, hardware and software. Examples of events which generated interrupts are: request I/O is avaialble from a device, a clock tick has occured, a user has typed a key on a keyboard. Typically, the OS handles the interrupt in two steps, a top half and a bottom half. The Top half is everything that must happen immediately, the bottom half is everything that must be done eventually. In the top half, the systems does things like recording the type of the interrupt, and scheduling the bottom half by placing it on a queue. The bottom half is what would be responsible for reading the data from the network device or linking up the key pressed to the appropriate TTY.

What are the three primary elements of OOP?

I don’t know if I would say there are three. Really, there are only two. Encapsulation and Polymorphism. Inheritance is in most forms of object oriented programming languages, but I agree with people who call this “Class Oriented Programming.” Protype based object oriented languages such as javascript may or may not be described as having inheritance. In this kind of language, a new object is created only by cloning the original object, and then modifying it. Since this is a shallow copy, it will have the same set of functions listed as methods of the object. These can be overloaded by replacing them. This is a instance of the Prototype design pattern.

Encapsulation means that you only have access to the interface of the object, not the underlying implementation. I would argue that Javascript provides no method to hide a method or property, and thus does not really encapsulate. Then again, java provides the entire introspection API, so there are often ways to get around encapsulation using dirty VM tricks as well. In C++, you can alwyas cast the object instance to get access to the underlying implementation. It is safe to say no languages encapsulation is effective against a truely determined effort to violate it.

There are two distinct forms of polymorphism. The first is that the same symbol can point to multiple functions. In C++ this is done using a table of function pointers. The second form is that two functions of the same name can take different sets of parameters and point to differnent implementations. This requires compiler support. In C++ this is implemented via Name mangling, where the types of the parameters and the return value are added to the symbol name. In Java, all inheritance is performed via a V table, and is scoped on the class name of the class that originally defined it.

What is modus ponens/tolens (sp)?

Forms of logical arguments. A->B A Therefore B is a modus ponens argument.

A modus tollens is a logical argument of the form A->B not(B) therefore not(A).

The truth table for A->B is:

A B==A->B

0 0 ==1

1 0 ==0 (logical contradiction)

1 1 ==1

0 1 ==1 (not(a) does not imply not(b))

What kinds of race conditions are there?

Wikipedia has a write up of these, but I never learned these terms.

What is a critical section?

A section of code that can only have a single thread of execution at a given point in time. Access to a critical section is controlled by a lock.

What is the difference between a monitor and a semaphore?

Both are methods of concurancy control. A semaphore performs concurrency control by keeping a counter of requests. THe counter is implemented using an atomic test and set. When the semaphore is 0, another process is granted the permission requested. All others will be required to wait.

A monitor is a compiler generated construct that controls access to a resource. A call to code that is a critical section will have code generated that acquires the lock, and then releases the lock after exiting the critical section.

The third form of concurrency control is message passing. These three forms are all equivalent. Each can be implemented in terms of the other two.

Define NP-complete.

A problem is NP if there is a solution that can be performed in polynomial time by a non-determinisitic turing machine. A solution to the problem can be verified in polynomial time by a deterministic Turing machine. A problem is NP complete if every other problem in NP can be reduced to that problem. NP-complete problems are the hardest problems in NP.

Computer Science GRE Prep

Once again I feel myself being draw to take that damn test. I am a software professional. This means I do about .001% computer science. Computer Science is math. It has nothing to do with computers except that it maybe explains what a computer would theoretically be able to do if it were infinitely big.

So I am going to try to use the blog as my notes for reviewing. One issue that comes up time and time again is combinations. Combinations are based on permutations. If you have n distinct items, they can be ordered in in n! ways. For the non math people, n! is n times n-1 times n-2 … times 1. 0! is defined as 1, as they claim there is only one way to order zero items…suspect, but it keeps the algorithms easy.

If you want to figure out how many permutations of length m you can make out of n items, where n>m, you have n (n-1)(n-2)…(n-m+1). This is equal to n!/(n-m)!

Combinations are like permutations, but order does not matter. Basically, a combination is a subset. If you have three items, and you want a subset of two items, you can take {1,2} {2,3} or {1,3}. This is called n choose m, where n is the number of items in the set, and m is the number of items in the subset. The formula for this is n!/(n-m)!m!. This makes sense as you take the formula for permutations and divide out the number of redundant combinations of length m.

One place where combinations are useful is in graph theory. There are many comparable problems in graph theory which can all be converted into each other. One of these is the clique problem. A clique is a completely connected subgraph.

META Comment: This version of WordPress as a blogging tool does not make it easy to draw pictures.

The clique problem is to determine if there is a clique of size m in a graph whose number of nodes is n, n>m. A brute force algorithm iterates through all subsets of nodes of size m and checks to see if them make a clique. Testing for a clique is not computationally intensive, but generating all of the subsets is O(n choose m).