About Adam Young

Once upon a time I was an Army Officer, but that was long ago. Now I work as a Software Engineer. I climb rocks, play saxophone, and spend way too much time in front of a computer.

Heap memory debugging tool in glibc programs.

A couple people here at work just came up against a bug in memory usage on an embedded platform. (This may be straining the definition of embedded in the size of the application, as it is large, but we’ll go with it.) The problem manifested as a failure in malloc. color palette . The debugging tool that allowed them to find it was simply to set the environment variable:

MALLOC_CHECK_=1

This tells glib to use a debug heap as opposed to the standard heap allocation routines. No recompile required.

Autotools and doxygen

Short one this time:

I needed to integrate doxygen with my current code source. Turns out it is pretty easy.

Use doxygen to generate a config file (OK I copied one from another project)

doxygen -g project-dox.conf.in

Note the .in at the end. autoconf/automake will generate the correct code to convert something ending in .in to the correct file.

Add this config file to configure.ac

AC_CONFIG_FILES([Makefile project-dox.conf]

along with whatever other files get listed there for your project.

Our build system builds in the $project/build directory, but leave the source back in the original directory. To get this in sync, I modified my project-dox.conf.in

INPUT = @srcdir@

Assuming doxygen is installed, of course.

‘The Bug’ at Penguin

My first several months at Penguin (Scyld, actually) were spent chasing down a really nasty bug. Our software ran a modified version of the 2.4 Linux Kernel with a kernel module as well. The problem was it would periodically freeze the system. This was such a pronounced problem that my boss, Richard, had put a tape outline of a bug on the floor of his cube. Every time that they though they had squashed the bug, it would reemerge. While the system had many issues to address, none were more critical than solving this stability related issue.

When a system freezes, the primary debugging tools is a message the starts with the word ‘oops’ that are therefore called ‘oops’-es. This is the Linux equivalent of the Blue Screen of Death. The kernel spits out a message and freezes.

The Linux Kernel has come a long way since 2.4.25, the version Scyld shipped when I started (or there abouts). Nowadays, when the kernel oopses, it spits out what is called a stack trace, showing which functions had been called at the time of the problem. By tracing down through the function calls, you can usually figure out fairly quickly what the problem major symptom is, and from there , work backwards to the root cause. Under the 2.4 kernel, we didn’t get an stack trace. Instead, we got a dump of the stack in it’s raw form, and from there had to run it through a post processor (ksymoops) that looked the the data and the layout of the kernel and gave a best guess at the call stack from there.
There were two problems getting out the gate. We needed to reproduce the problem, and we needed to capture the oops message. Since the bug happened intermittently, it usually took several hours to reproduce. Because we ran our head nodes with a graphical front end running, we didn’t necessarily see even the stack trace on a customer system. Periodically someone would get a glimpse, or send a digital photo of some segment of it.

The easier problem to solve was capturing the oops message. You can modify the options that are given to the kernel such that all console output also would be echoed to the serial port. We would connect another computer by cable to the serial port and run a program called minicom that allowed us to display what was happening on the head node’s console.

The harder problem was reproducing. Early on we knew that the problem was related to connection tracking. On certain systems, when connection tracking was turned on, booting a compute node would oops the master.
For people in the enterprise world, it may come as some surprise that this type of behavior was tolerable in a shipped product. However, the high performance computing world is a little different. Our customers are pushing machines to their limit in an attempt to get the most number crunching done in the least amount of time. Much of the system was designed to limit overhead so that as few CPU cycles as possible would be wasted on administrative tasks.
The Bug was triggered when the system was running our code, the Broadcom custom gigabit Ethernet driver (bcm5700), and IP Masquerading,

Since we knew what situation caused it, it was easy to tell people how to work around it. They did so, grudgingly. The usual result was to either run a slower Ethernet driver or not run IP Masquerading. The slower driver meant slower performance. Not running IP Masquerading meant no Firewall around the system. For most people, they did without the firewall.

I mentioned this problem to a friend of mine, who pointed out that I was running in the Linux Kernel, and that stack space was limited. Each process in the (2.4) Linux Kernel has two pages allocated for both the stack (for Kernel mode, not user mode execution) and the structure that represents the process (struct task_struct). The task_struct is allocated at the beginning of these two pages, and the stack starts at the end and grows toward the beginning.  If the stack gets too large, it over writes the task_struct.

This became our hypothesis. To test it, we took two approaches.  I attempted to trigger a debug break during an overflow.  I never succeeded in getting this to work. Meanwhile, a co-worker implemented what we called a DMZ.  The area immediately after the task struct was painted with a magic value (0xDEADBEEF).  On each context switch, we checked to see if this area had been corrupted.  If it was, we knew that the high water mark of the the stack was dangerously close to the task_struct.  Practically speaking, we knew we had an overflow.  This worked and our bug was cornered.

To avoid the bug, we had to shrink the amount of memory placed on the stack.  Static analysis of our code showed that we were allocating a scratch space of 2K on the stack. Since a page is only 4K in length, and there is less than two pages per process, this meant that we were throwing away over a quarter of our free space.  Changing this to a vmalloc caused the bug to go away.  For a while.

Our code was not alone in its guilt of sloppy stack allocations.  Eventually, even with the reduced footprint from our code, the other modules (bcm5700, ipmasq, etc) were enough to cause the stack overflow again.  Fortunately, by now we had the stack overflow detection code available and we were able to identify the problem quickly.  The end solution was to implement a second stack for our code.  Certain deep code paths that used a lot of stack space were modified to use values stored in a vmalloc-ed portion of memory instead of local variables.  The only local variable required for each stack frame was the offset into the vmalloc-ed area.  This was a fairly costly effort, but removed the bug for good.

I hope someone else can learn from our experiences here.

Cool bash trick #1

The trick: Use grep -q and check the return code to see only those that match.

When I would use it: I need to find a symbol in a bunch of shared objects.

Example:

#/bin/sh

if [ $# -ne 2 ]
then
echo usage: $0 DIR SYMBOL
exit 1
fi

DIR=$1
SYMBOL=$2

for lib in $DIR/*.so.*
do
if [ -r $lib ]
then
objdump -t $lib | grep -q $SYMBOL
if [ $? -eq 0 ]
then echo $lib
fi
fi
done
To run it:

./find-symbol.sh /usr/lib64 sendmsg

Validated Text Fields

Whenever I find myself doing Graphical User Interface based code, I find I need to validate text fields that should fit specific formats. Ideally, I would develop a template based field that would validate the string against a regular expression (RE) as part of the constructor. Since Java now has REs and templates as part of the language, and C++ has a great RE facility in Boost, both of these languages should support this. nimbus cloud There are several types I’ve come across that would fit this category. Note that my Regex syntax is a little bastardized. I use the [] to indicate selecte one of values inside and () to indicate a grouping.

Digit: [0-9]. I’ll call this D.

Hex digit: [0-9a-fA-F]. I’ll call this one X since I’ll use it below.

Social security number DDD-DD-DDDD

IP address: D?D?D.D?D?D.D?D?D.D?D?D

MAC Address:XX:XX:XX:XX:XX

ZIP code: DDDDD(-DDDD)?

Phone Number. THis one get’s tricky. Should you allow parenthesis for the Area code? If you do, it gets harder to write as a regex. You have to do something like: (\(DDD\))|(DDD)-DDD-DDDD. If you make the Area code optional, it gets even more complicated.

Even more complicated is the regex for email addresses.
The interesting design decision here is how to implement. Basically, you want a type that takes a regex as a template parameter, or something that can be converted into a regex. Here is a simple example in C++:

template < char * re> class MyType {
static char * mystring;
};
char s1[] = "ABC";
char s2[] = "123";
int main(){
MyType<s1> s1type;
MyType<s2> s2type;
return 0;
}

The interesting thing about this is that the template is really only making two types based on the value of the pointer, not the value of the String field. This is not really any different than using an integer value. You would really want to use a typedef for this. That means that your Regex needs to be a single global instance of your regex class. My Java template Kung-Fu is not so strong; I can’t provide a comparable example for Java.

Since you may be validating a large volume of Data, you don’t want to throw exceptions. Ordinarily, I think exceptions would be correct, but there is some argument to be made that invalid data is part of normal processing. This is an ideal use of a policy that should be selected by the user.

This means you probably want to use a factory to create the container. The factory can then determine wheather to return null, return a null object, or throw an exception if the string fails the parsing.

Regardless of your error handling scheme, you are going to end up with a lot of code like this:

template <class T>



try{
T t = T.factory(str)
}catch(invalid_format& i){
errorCollection.push_back(fieldName, errorMessage);
}

Usually something like this can be wrapped in a loop which is processed on form validation.

C++ Exceptions

As I get ready to code in C++ again full time, I was wondering about the cost of exceptions. It turns out that it costs you nothing at run time to have exception handling in your code unless you actually throw/catch them.

The compiler creates a table. It puts an entry point in the table for all exception handlers. The only potential cost to your code is that this exception handler may modify the logical flow of your function, causing the need for an additional jmp to get around it. More likely, the code will be put at the end of your function, and the jmp will be from within the exception handling code to the next instrcution instruction. For example:

#include <exception>
int a(){
int i = 0;
throw i;
}

int main(){
try{
a();
}catch(int i){

}
return 0;
}

Now the catch block doesn’t do anything here, but it will stop the exception from propagating. The return statement has to be called regardless of whether the catch block is entered. Here’s the result of compiling and then disassembling. site analysis . Note that the functino called ‘a’ about has its name mangled to _Z1av.

0000000000400758 <main>:
400758: 55 push %rbp
400759: 48 89 e5 mov %rsp,%rbp
40075c: 48 83 ec 10 sub $0x10,%rsp
400760: e8 c3 ff ff ff callq 400728 <_Z1av>
400765: eb 26 jmp 40078d <main+0x35>
400767: 48 89 45 f0 mov %rax,0xfffffffffffffff0(%rbp)
40076b: 48 83 fa 01 cmp $0x1,%rdx
40076f: 74 09 je 40077a <main+0x22>
400771: 48 8b 7d f0 mov 0xfffffffffffffff0(%rbp),%rdi
400775: e8 be fe ff ff callq 400638 <_Unwind_Resume@plt>
40077a: 48 8b 7d f0 mov 0xfffffffffffffff0(%rbp),%rdi
40077e: e8 a5 fe ff ff callq 400628 <__cxa_begin_catch@plt>
400783: 8b 00 mov (%rax),%eax
400785: 89 45 fc mov %eax,0xfffffffffffffffc(%rbp)
400788: e8 6b fe ff ff callq 4005f8 <__cxa_end_catch@plt>
40078d: b8 00 00 00 00 mov $0x0,%eax
400792: c9 leaveq
400793: c3 retq

Notice the calls are to begin_catch, unwind_resume, end_catch etc are all boilerplate exception handling code. The jmp at address 400765 skips over all of this and goes right to the return code at address 40078d.

What this means is that for code that does not throw an exception, there is no cost in the calling function. The runtime cost of exception handling may be high if an exception is thrown. Thus, exception handling should not be in the default path, merely the exceptional.

Hello world!

Whenever I contemplated starting a Web Log (yes, that is where Blog comes from) I could never justify putting it on someone else’s site. So, finally, I’ve decided to post it here on a site I administer, with a domain name that means something.

This blog is going to be a mix of history, self-analysis, technological discussions, music, perhaps a touch of politics, and random musings. I’ve been through enough in my 36 years that I feel, just maybe, I have something to say.

A little about me (in no particular order):

I am a software engineer. I’ve worked on a very varied set of software projects in my time as a coder. I hope to use this forum as a method to analyze what I have done, learn from it, and generate new ideas for future development. While there are a million blogs out there that cover software engineering, most come from a very specific direction (Java, PHP, eCommerce) and I hope to get a level above that.

Currently I work for Penguin Computing. I don’t mind saying the name of the company since I am leaving them on good terms in a couple of weeks. I am not leaving because I am unhappy in my job; I actually like it a lot. My reasons for leaving come from my desire to move across country. Penguin is a Linux company that has focused on High Performance Clustering, a very different type of system than enterprise development. My previous work was has covered eCommerce, reporting systems, database drive sites for health-care, and network storage configuration.

I am about to start working for a company in the Cambridge area that most people in the tech world have heard of, but fewer people in other fields. I’ll limit my discussions about things at work to general technical issues. My goal here is to avoid a conflict of interest.

I am a married man. My wife is currently finishing her PhD in biostatistics. Biostat is the mathematical modeling used in public health studies. Since I am a coder, and her work involves programming, I’ve helped her out and learned a thing or two about programming in R, the statistical language she uses.

My wife and I have a one year old son. Aside from the joy every father should feel in his child’s development, I am also fascinated by the opportunity to learn about learning. So much of programming is about developing systems that can handle wider and wider ranges of situation, it is fascinating to see the ultimate software/hardware system in it’s early development stage. Of course, sleep deprivation my inhibit my ability to really process a lot of this.