Interview Question for Distributed Computing

This is an updated version of my interview question,  made non-bproc like.

Using only the following API:

  • printf(…)
  • int get_node_count() // number of compute nodes attached to the head node.  0 means no other nodes
  • int get_current_node()// 0 for the head node, 1-n for the compute nodes.
  • int remote_fork(int node) // like fork, but returns an fd to the child/parent process
  • void send_long_sync(int fd, long value)//send and wait, blocks until receipt
  • long recv_long_sync(int fd)//block until value is available
  • long gettime()

Calculate the average clock skew on the cluster.  Return 0 on success, and -1 on any failures.

Assume that all nodes are up and running.  This is a c subset of c++.  Each of these functions throw an exception upon failure.  rfork has the same semantics as fork:  when it returns, there are two copies of the program running, just on separate machines.  The  next line of code to execute will be the line immediately after the fork on both machines.  However, the returned value is not a process ID, and the parent process  does not need to wait for the remote process to finish: the child process is automatically reaped by init.

Faking out PAM Authentication

I am working on a server application that uses Pluggable Authentication Modules (PAM) for authentication support.  This application  must run as root. As part of development, people need to log in to this server.  I don’t want to give out the root password of my development machine to people.  My hack was to create a setup in /etc/pam.d/emo-auth that always allows the login to succeed, provided the account exists.  The emo-auth configuration is what the application looks up to authenticate network connections.

$ cat /chroot/etc/pam.d/emo-auth

account   required   pam_permit.so
auth      required pam_permit.so
session   required pam_permit.so

Now people login with root, and any password will allow them to get in.
Since this is only for development, this solution works fine, and does not require any code changes.

Musings

Don’t hit publish on the blog when you just want to save a draft.

Big Builds are Bad. Software should be developed and distributed in small packages. Linux is successful due to things like apt, yum, and yast.

Interface Specifications need to be more specific.  Just saying that something is a string is not really helpful if that something needs to conform to a pattern.

Programming and blogging requires sugar in the brain.

Interviews are tricky…on both sides of the table. Career fairs are worse.

C++ Has a lot of magic in it. Can we make type level programming more transparent?

Microsoft purchasing Yahoo would be good for Google, but bad for just about everyone else.

Being a Dad is really cool. Even when it sucks, it is great. Sometimes kids refuse to go to sleep. This leads to sleep deprivation, but also leads to really wonderful moments in rocking chair in the middle of the night.

Pool is a great Geek game. Lower left-hand English is neat.

Snowshoes are good off the trail. Not so good on the trail. If your going on the trail, take the cross country skis. Snowmobiles smell funny.

New Hampshire winter weather is still as brutal today as it was when I left the area in the early ’90s.

It is hard to sing a Jazzy version of Old MacDonald had a Farm.  It is harder to do after the tenth repetition while trying to get a child to fall asleep.
If you listen to Children’s CDs long enough, you will develop favorite children’s songs. I like the hippo song.

Is there really a difference between the Ethernet and SCSI protocols? I don’t know, but it would be fun to find out.

The compiler is your friend. Let it check your work for you.

Why write code on a white board if you have a computer available? Especially if you have an overhead projector?

Where do the local peregrine falcons sleep? Where would they be sleeping if we hadn’t built up the whole area?

If I could have a redo on which language to take as a Sophomore, I would probably would have liked to take Chinese. Russian and Arabic would also do. German was not a good choice for me.

If Bush Senior had insisted on pushing to Baghdad, it would have been my generation in this mess as opposed to the current set of junior officers. Instead of Haiti, I would have gone to Basra or something.

There are too many interesting topics in the world to pursue them all, or even a small fraction of them.

Every philosopher I’ve read, especially the ones I disagree with, ave said something that is valuable and true.

No matter how old you are, when you get together with your parents, you revert to teenager status.

This list should never see the light of day.

High Availability and dealing with failures

Time to use my pubic forum to muddle through some design issues I’m struggling to lay straight.

A Data center is made up of several objects: Servers ( computers, usually horizontal), racks(hold the computers), switches(a network device that connects two or more computers together), power sources, and cables (both electric and network cables, to include fiber optic for storage devices). A server in the data center can serve on or more roles: storage host, computation host, administration, or user interface. If an application is hosted in a data cetner, it is usually important enough that it requires some guarantee of availability. This application will resided on a computation host, and require access to other types of hosts. An email server stores the received messages in a storage host, probably connected to the computation host via fiber optics. It receives and sends messages via a network connection that goes to the outside world. It may also talk to a User Interface machine that runs a web server and an application that allows web access to email. If the computation host loses connectivity with either the public network or the storage host, it cannot process mail. If the web server loses connectivity to the mail server, certain users cannot access their mail.

There are many reasons that connectivity can fail. The major links in the chain are: OS failure, Network interface card (NIC) failure, bad cable, disconnected cable, bad switch, unplugged switch, switch turned off. Once you pass the switch, the same set of potential problems exist on to the other host. To increase reliability, a network will often have two switches, and each server will have two NICs, one plugged into each switch. The same set up goes for storage, although different technologies are used. As a general rule, you don’t want to have things running in hot standby mode. It is a waste of resources, and it doesn’t get tested until an emergency hits. Thus, the double network connectivity usually gets set up also as a way to double bandwidth. Now if one of the cables breaks, that server merely operates in a degraded mode. The second cable has been passing network traffic already, now it just gets all of it.

A typical data center has many machines. Typical server loads are very low, sometimes in the 1-3% range of overall capacity. Thus, if a machine fails, a data center often has plenty of servers that could absorb the load from the failed server. Figuring out how to cross load services in a data center has been a major topic in the IT world over the past decade. This goes by many names, one of which is grid computing. I’ll use that term myself here. There are several problems any grid system has to solve, but most can be clumped under the term application provisioning. This means getting all of the resources together that a given application requires so that they available on the computation host. These resources include the network and storage connections described above, as well as the executables, data files, licenses, and security permissions required to run the application.

When a host fails, some remote monitoring system needs to act. First, it needs to know that the host has failed. This is typically performed through a heartbeat sensor. This is a simple network communication sent by the computation host saying “I’m still alive.” Cue Mike McCready. When a heartbeat fails, the monitor needs to make sure that the application is up online somewhere as soon as possible. Now, the reason the heartbeat failed might have been because of a problem on the heartbeat network, and the application is actually up and running just fine. An advanced solution is to test the system through some alternative method. In the case of the email server, it may be to connect to the email port and send a sample message. This delays the restart of the application, but may minimize downtime.

Sometimes, two copies of the applications can’t run at the same time. In this case, you have to be sure that the original one is gone. To achieve this, you shut off the original server. This is called “Shoot the other node in the head.” or STONITH. Sometimes the word node is replaced with guy and you get STOGITH. If you do this incorrectly, you may take yourself down, a situation referred to as SMITH. Or you take all servers down, and this is called SEITH. But I digest…

Here’s the part that I am currently trying to decide. If an application depends on a resource, and that resource fails, you can bring the application up on a different server. It will take a non-trivial amount of time (estimate it a minute) to shut down the old instance and bring up the new instance. If, on the other hand, the disconnect is temporary, we can have minimal down time by just waiting for the network to come back up. If someone disconnects a cable by accident, that person can just plug the cable back in. If the network is redundant, removing one cable may result in degraded performance, but it may not.
If the failure is due to the switch being down, just selecting another host connected to the same switch will result in downtime and a situation that is no better than the original. If the problem is the storage host being down, there probably is nothing you can do to recover outside of human or divine intervention.

If a switch goes down but there is another set of hosts on a different switch, you can migrate applications to the new host. But you may end up overloading the new switch. This is referred to as the stampeding herd effect. If the lost switch is degrading performance for all applications dependent on that switch, you best strategy is to migrate a subset of applications to balance the load. After each application is moved, recheck the network traffic to determine if you’ve done enough. Or done too much.

A malfunctioning NIC or switch might manifest in intermittent connectivity.  In this case, you want to get the application off of the original server and on to a new server.  The problem is in distinguishing this from the case where the cable just got unplugged once, and then plugged back in.  From the server’s perspective, the network is gone, and then it is back.  This leads to a a lot of questions. What interval, and how many iterations do you let go by before you decide to bail from that server?  If you have redundancy, does the failing connection impact the user’s experience, or does proper routing ensure that they have seamless connectivity?

Dynamic Frequency Assignment Protocol

A book about the war in Iraq had an brief passage about an Army Unit that needed to stop, dismount, and reset their frequency because they were getting stepped on by a marine Corp unit using the same freq. The Marines were in contact and were not about to change Frequencies. This lead me to thinking about a way to solve this problem via network protocols and public/private key exchanges.

Each unit has a radio. These radios already can encrypt traffic and frequency hop. In fact, frequency hopping would have prevented the unit from getting stepped on. But let’s assume that freq hopping is out of the picture for now. Instead, we want to set up a protocol where a unit can reassemble on the network without having to expose itself and meet face to face.

1. Each machine would get a private key. This allows it to be uniquely identified on the network.

2. Each machine gets a list of public keys and their assigned units (a heads up display could then show who was talking, removing the need for “Black 8 this is Black 6 over.”)

3. A set of frequencies would be set aside for meta-data commo. I’ll call this the MDF. The MDF would be shared by all units within a theater of operations, and would be redundant. Just like name servers, a radio would get a list of MDFs to try. Thus four MDFs could be assigned, and four units would get a different MDF to use as it’s primary.

4. The radio would monitor both its main frequencies (the Radios I used when I were could monitor two, I assume the new ones can do better) and the MDF. MDF traffic would all be binary data. Listening to it would be like listening to a 300baud modem.

5. A “Lost” Unit would broadcast a connection request on the metadata frequency. Any other unit in the area could answer.

6. Higher would send down the new frequency using the lost units public key.

7. The lost unit’s radio would set the new frequency and drive on.

One glaring weakness is that the MDF itself could be jammed. One faulty radio could crap-flood the frequency.

It also adds overhead to the whole system by removing frequencies that could otherwise be used. An optional interactive authentication step could provide for the soldier entering a pin or password if there is some question of the radio being compromised. Two valid response could be provided, one that means , “I’m me” and one that means “I’m me and doing this under duress.”

Note that none of this would prevent manual resets, just provide an additional capability for automated resets.

Of course, this is yet more electronics subject to the beating an infantryman gives it.

Update:  Of course the Military is way ahead of this.

http://en.wikipedia.org/wiki/AN/PRC-148

And

http://en.wikipedia.org/wiki/Joint_Tactical_Radio_System

Linux init process

The BProc project supports an older protocol called RARP to assign an IP address for a compute node. While this made sense when BProc was written, it has been made obsolete by DHCP. Since I really don’t want to write a DHCP server, I’ve decided to try to use the DHCP and TFTP servers that come with CentOS to boot the compute nodes. Here’s what I’ve (re)learned:

The initrd image that the Linux kernel builds has a file in it’s / directory called init. This is a shell script that executes in the lash interpreter. It does a modprobe for a set of modules, greats /dev a file for and mounts the root file system, and performs a switchroot.

Aside: Anyone on a linux system can find this out by running:

zcat /boot/initrd<version>.img | cpio -di

I would suggest doing this in an empty directory.

My thinking is that I should hack this script to do a tftp fetch before creating the /dev file. What I plan on fetching is a file that contains an ext2 file system that can be mounted as a ram disk. This ramdisk can be created by creating a (large) file, then running mke2fs. This file will not dynamically resize, so I need to make it large enough to fit all my files needed for booting, but not so large that it is going to eat up a significant portion of ram on the compute node. I know I am going to need the bproc kernel modules (bproc.ko, vmadump.ko), bpmaster, some process to act as init (I’ll use bash to start) and the support libraries:

  • /lib/libncurses.so.5 374024
  • /lib/libdl.so.2 14624
  • /lib/libc.so.6 1367432
  • /lib64/ld-linux-x86-64.so.2 119536
  • bproc.ko 1929345
  • vmadump.ko 285821
  • /bin/bash 797208
  • bpmaster 112920

Turning to my old friend the binary calculator:

echo “( 374024 + 14624 + 1367432 + 119536 + 1929345 + 285821 +112920 + 797208 ) / ( 1024 * 1024 )” | bc

4

So roughly 4 MB. I’ll make it an odd 5 to start.

To create the file:

$ dd if=/dev/zero of=/tmp/ramdisk bs=1024 count=51105110+0 records in
5110+0 records out
5232640 bytes (5.2 MB) copied, 0.024132 seconds, 217 MB/s

I’ll take the defaults for ext2 for now. Notice that I have to type ‘Y when asked to proceed.

$ mke2fs /tmp/ramdisk
mke2fs 1.40-WIP (14-Nov-2006)
/tmp/ramdisk is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
1280 inodes, 5108 blocks
255 blocks (4.99%) reserved for the super user
First data block=1
Maximum filesystem blocks=5242880
1 block group
8192 blocks per group, 8192 fragments per group
1280 inodes per group

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 38 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

Now That I have a ramdisk, I can copy to it

$ sudo mkdir /mnt/ramdisk
Password:
$ sudo mount -o loop /tmp/ramdisk /mnt/ramdisk/
$ ls /mnt/ramdisk/
lost+found

And we have a file system.

Update 1: The initrd layout seems to be distribution specific. On my debian box, there is no lash, and instead there is a busybox executable with, amongst other things, a tftp client built in. This may be a worthy approach: having tftp available as part of the init rd will allow fetching a rootfs to be done more cleanly. Also, there are hooks to put scripts in, and command line options to allow building initrd’s for nfs root or local root. If only I had targeted Debian instead of RHEL 4 to start.

Update2: The Redhat initrd does not have a tftp client in it. I added one in by hand, added all of the libraries it needed (ldd bin/tftp) and kicked off another PXE boot. Network unreachable. Interesting that it is supposed to be able to NFS mount root, but it seems unable to do a tftp fetch.

Back to BProc

The last check in to the BProc CVS repository on sourceforge happened 16 Months ago. I recently checked out the top of tree and found I was unable to build. Looks like what is there is a mix of 2.6.10 and something in the vicinity of 2.6.20 Linux code bases. I am starting again, this time with the code in the tarball. I’ve built this before and know it compiles. Here is my general plan forward:

1. Get a 2.6.9 Kernel with the BPRoc patch applied to boot on a RHEL4 System.

2. Build the BPRoc and VMADump Kernel modules and load them into the kernel.

3. Build the BPMaster and BPSlave binaries. Make sure BPMaster runs.

4. Build the beoboot code.

This is where it gets tricky. At Penguin we had our own PXE Server (Beoserv) that handled provisioning a compute node. Part of the Beoboot package there was creating the root file system and bring up the slave node binary. So here is a tentative plan instead.

1. Deploy the standard redhat PXE and DHCP servers on my head node. Ensure that the DHCP server only responds to requests from the subnet where the compute node resides. Probably best to unplug from the company network when I do this.

2. Set the PXE server to support the booting of a stripped down RHEL4 system. Really, all I want is to get as far as running init.

3. Replace the init in the PXE IMage with the beoboot binary. Have it bring up BPSlave and see if it can talk to BPMaster on the head node.

If I can get this far, I will consider it a great success.

Update 1: I built a 2.6.9 Linux Kernel with the bproc patch applied. makeoldconfig, selected BProc but none of the other options. Upon BootingI got a panic when it could not find device mapper. Looks like device-mapper got added in the 2.6.10 kernel. Since I have already built that kernel, I guess I’ll start by trying the tarball kernel module code against the 2.6.19 patch.

Update 2: Um, nope. TASK_ZOMBIE and mmlist_nr are showing up as undefined symbols. mmlist_nr seems to be acount of the number of memory managers out there. I suspect that this is something that changed between 2.6.9 and 2.6.10. Probably some better way to keep the ref counts was introduced. I Vaguly remember something about the TASK_ZOMBIE.

Update 3: This was bogus and I removed it.
Update 4: Replaced TASK_ZOMBIE with EXIT_ZOMBIE. Commented out the decrement as it seems like it has just been removed.

Update 5: Error accessing rlim in task_strcut. This is now in the signal struct:

– unsigned long gap = current->rlim[RLIMIT_STACK].rlim_cur;
+ unsigned long gap = current->signal->rlim[RLIMIT_STACK].rlim_cur;

Update 6: OK, back to the point I found before. THe hook for kill_pg_info is now kill_pgrp info, and the hook for kill_proc_info is now kill_pid info. This is a change in the patch, so I have to get the module code in line with the new function call parameters. Looks like the header has been changed, but the old function call names are using in kernel/signal.c. Changing, rebuilding, and redeploying kernel.

Update 7:  Success through building and running bpmaster.    I had to create a config directory, but other than that, nothing was too far out of the ordinary.

Working on the Beowulf Process

I am currently listed as one of the maintainers of the BProc project on Sourceforge. Unfortunately, my current life has left me little enough time to do my job and be a father, so other projects fall by the wayside.

BPRoc portrays a cluster of computers a single system from an operating system perspective. A process running anywhere one the cluster shows up in the process tree on the head node. Signals sent on any machine were forwarded to the machine where the process was actually running. A process could voluntarily migrate from one machine to another. All of these techniques take place in the Linux Kernel. Maintaining this code requires understand of operating system concepts such as signal delivery, page table organization, dynamic library linking, as well as network programming. I’ve never had more fun coding.

Linux kernel development is done by taking a copy of the kode published at kernel.org and applying a file that contains the differences between how it looks at the start and how you want it to look at the end. This file is called a patch. THe major Linux distributions all have a version of the Linux Kernel that they select as a starting point, and then a series of patches that they apply to deal with issues they care about. For instance, I am running Red Hat Enterprise Linux 4 machine with a kernel version of 2.6.9-55.0.9. The 2.6.9 is the version that they got from kernel.org. The 55.0.9 indicates the number of major and minor upgrades they have made to that kernel. The number patches applied when last I looked was in the neighborhood of 200. All of the changes we applied to the Linux kernel was maintained in a single patch. As we maintained succeeding version of the kernel, we continued to generate newer versions of that patch. In addition to this code, we had a separate, and much larger, portion of code that was compiled into a binary format that could loaded into the Linux Kernel on demand. The majority of the code in the patch was merely hooks into the code that called out to the loadable kernel modules.

Penguin had branched from the Sourceforge BPRoc before I joined. As such, Sourceforge had already moved on to the 2.6 series Linux Kernel while we were still on the 2.4 series. This was a major difference in the code base, and there was little grounds for sharing. When we did finally start moving to the 2.6 code base, we had one Marketing requirement that the Sourceforge project did not: We needed to interoperate with the Linux Kernel shipped by RedHat for there Enterprise Linux product (RHEL). I spent a long time in research mode trying to make this happen. Two major decisions came out of this. First, PID masquerading had to go. Second, we needed to use binary patching in place of many of the source level patches.

 

Every process in an operating system has an integer process identifier (PID) that other processes and the kernel can use to access that process. A major mechanism in BProc was the ability to migrate a process from one physical machine to another. PID masquerading is a technique that ensures that the process identified does not have to change during migration. Instead, each process has two identifiers. The first is the ID as allocated on the head node, and is used when reporting information to the head node, other nodes, or user land functions. The second ID is the PID allocated on the local machine, and only used inside to local machines Kernel. When a function like getpid is called, the process identifier returned is the masqueraded PID, not the local PID. PID masquerading has both positive and negative implications. With PID masquerading, a given compute node can actually have two completely separate pools of processes that cannot communicate with each other. each of the pools of processes can be driven from a different head node. This allows the sharing of compute nodes between head nodes. A given machine can actually act as both a head node and a compute node. This was a requirement in early Beowulf clusters, but was no longer necessary by the time I worked on them. The negative impact of PID masquerading was the amount of code required to support it. Every PID reference in the Linux kernel had to be scrutinized for whether it should be a local or remote PID. If it needed to be translated, a hook was inserted that said “If the module is loaded, and this process is a masqueraded process, return the masqueraded PID, otherwise return the real PID.” This type of logic composed approximately a quarter of the BProc Linux Kernel patch. There was no practical way we could inject all of this code without source level patching the kernel.

 

 

 

 

 

Binary patching means changing the machine code on a running system. There are two assembly instructions we looked for to see if we could change code. They are CALL and JUMP. Actually, there are two types of jumps, long and short, and we can use either of them. We did analysis of the compiled Linux kernel for places with these instructions near our current set of hooks. The CALL instruction is what maps to a function call in C. In assembly it looks like CALL 0x00000000, where the zeros will be replaced by the linker or the loaded with an address in memory where the function resides. Once we know where the call operation takes place, we can replace the value with our own function. This technique is often used with malicious intent in virus and root kits, but really is not much different than how a debugger or many security software packages work. During the replacement process, we record the original value of the function, so that we can unload our module and return it to it’s original flow. The compiler will often use a JMP instruction in the place of a CALL instruction as an optimization called a “tail call.” All this means is that when the called function returns, instead of returning to the location it was called from, it continues up the call stack. I discussed this in the CS GRE problem set post.

One place that we had to hook to make this work was the function and structure that allocated the process identifiers. The function alloc_pidmap gets a PID from a bitmap. The bitmap is just a page of memory treated as an array of bytes. Bit zero of page[0] represents the PID 0, Bit 1 represents PID 1, and so on. If a given bit is set to 1, there exists a structure in memory that is using that process ID. In standard configuration, a page in Linux is 4k bytes. 1024*4*8=32768, which is the largest number that can be held in a 16 bit signed integer. PIDs have traditionally been 16 bit signed integers in Unix, and Linux. We used a couple tricks to help out here. On the Head node, we set all PIDs less than some threshold (we chose 1000) to be 1, indicating to the system that it should not allocate those pids. On compute nodes, we set all PIDs greater than the threshold to be 1. PIDs to be visible across the entire cluster were allocated on the head node. PIDs allocated for local work on the compute node would be guaranteed not to class with PIDs from the head node.

Aside:  recent versions of the Linux kernel have expanded PIDs to be 32bit signed integers.  At first it was tempting to expand the allowable PIDs, statically partition the PID space amongst the compute nodes, and allow local allocation of PIDs.  We origianlly pursued this approach, but rejected it for several reasons.  First, the Linux Kernel set an arbitrary limit of 4*1024*1024 on the number of PIDS.  We wanted to be able to support clusters of 1024.  This means that any node on the cluster had only 4*1024 PIDs to allocate.  Since the vast majority of PIDs were handed out on the head node anyway, we had to do some unbalanced scheme where the head node go something in the neighborhood of 16000 PIDs, leaving a very small pool to be handed out on each the compute nodes.  Additionally, a compute node crash erased all record of the PIDs that had been handed out on that machine.  Replacing a node meant rebuilding the pidmap from the existing process tree, a very error prone and time consuming activity.  Also, many applications still assumed a 16 bit PID, and we did not want to break those applications.

We found that there were several race conditions that hit us if we relied solely on the pidmap structure to control PIDs. This we ended up hooking  alloc_pidmap, checking for a compute node or head node, and checking that the returned PID was withing the appropriate range.  This type of code pathis frowned up in the mainline Linux kernel, but we found no noticable performance hit in our benchmark applications.

One benefit to this approach was that we could then slowly remove the PID Masquerading code.  We continued to track both the masqueraded and real PIDs, but they were assigned the same value.  Thus we never broke the system as we restructured.