Aug 23

Stupid mistake: Forgetting to return a value

No matter how long you’ve been programming, you’re bound to hit a problem at some point which takes you multiple hours or days to fix, and which turns out to be a simple mistake on your part. This post is the first in a new series I’ll be writing on stupid programming mistakes I’ve made in the past (and that I’m not particularly proud of).

A couple of days ago, I hit a problem when testing the LAN Bridger central server, which is hosted on a Linux box. I do most of my development and testing on Windows, though, so as a result the LAN Bridger server runs on both Windows and Linux for ease of debugging. From time to time, and especially toward the end of a release cycle, I typically have to compile and test the server thoroughly on Linux to make sure everything works.

I usually do all of my testing with debug builds (contrary to Ken’s beliefs). Once I’m sure everything runs smoothly and I don’t get any assertions or erratic behavior, I turn to release builds. In this particular case, the server worked flawlessly for debug builds, but exhibited a rather strange behavior for release builds.

LAN Bridger uses a connection manager, which is responsible for timing out connections if they become idle. It basically checks to see when the last packet was received and if it’s more than a certain threshold, the connection is closed. When testing the debug build of the server, the connections were correctly timed out and removed. For release builds, not so much.

My first cause of action was to attach a debugger (gdb) to the server process. Unfortunately, since this was a release build, debugging became much harder. For instance, as shown in the stack trace below it was pretty difficult to identify which thread corresponded to the connection manager, even after showing the stack traces for each thread:

(gdb) info threads
  9 Thread 114696 (LWP 2979) 0x08129acf in nanosleep () at cryptlib.h:1144
  8 Thread 98311 (LWP 2978)  0x08129891 in accept () at cryptlib.h:1144
  7 Thread 81926 (LWP 2977)  0x081299b1 in recvfrom () at cryptlib.h:1144
* 6 Thread 65541 (LWP 2976)  0x08129acf in nanosleep () at cryptlib.h:1144
  5 Thread 49156 (LWP 2975)  __pthread_sigsuspend (set=0xb6f8dcdc) at ../linuxthreads/sysdeps/
                             unix/sysv/linux/pt-sigsuspend.c:56
  4 Thread 32771 (LWP 2974)  __pthread_sigsuspend (set=0xb778dcdc) at ../linuxthreads/sysdeps/
                             unix/sysv/linux/pt-sigsuspend.c:56
  3 Thread 16386 (LWP 2973)  __pthread_sigsuspend (set=0xb7f8dcdc) at ../linuxthreads/sysdeps/
                             unix/sysv/linux/pt-sigsuspend.c:56
  2 Thread 32769 (LWP 2972)  0x082936fa in poll ()
  1 Thread 16384 (LWP 2969)  __pthread_sigsuspend (set=0xbf97834c) at ../linuxthreads/sysdeps/
                             unix/sysv/linux/pt-sigsuspend.c:56

By manually setting a breakpoint in the connection manager’s main loop, I was able to determine that it was thread number 6. The stack trace for that particular thread was very uninformative:

(gdb) bt
#0  0x08129acf in nanosleep () at cryptlib.h:1144
#1  0x00000000 in ?? ()

The first thought that hit me was that I was trashing the stack, but setting a breakpoint in the connection manager thread clearly indicated that it was just a misleading stack trace and that the thread was running as intended.

What puzzled me the most was why connections were removed correctly for debug builds, but not for release builds. The server also ran flawlessly on Windows. The only real difference between the debug and release builds was the -O2 compiler flag. To rule out the possibility that the compiler was making a funky optimization error, I decided to upgrade gcc to the latest version (4.3).

This was my first real mistake – never assume that the compiler is doing something wrong until you’ve thoroughly examined all other possibilities. I guess I got blinded by certain optimization bugs I had read about in the past. Unfortunately, the bug also manifested itself even when the server was compiled with the updated version of gcc.

Having wasted valuable time upgrading the compiler and making both the server and all of its third-party dependencies compile and link, I went back to the root of the problem. By stepping through the connection manager, I noticed that the variable, which held the time the last packet was received, was zero. Interestingly, that particular variable was only being assigned values from a single function, GetTickCount(), which was supposed to return a monotonically increasing number, i.e. it should never be zero.

Here’s where I made the second mistake. Rather than just checking the return value of the GetTickCount() function, I decided to verify that the calculation in GetTickCount() was correct. (I’d recently updated the implementation of the function so I thought it quite possible that I’d made a mistake somewhere.) The result of the calculation, however, was correct.

Finally, I checked the actual return value of GetTickCount(). The value of the eax register was 0. This was impossible given that I’d just verified that the calculation was correct.. unless I’d forgotten to return the result!

My own fault

There were several reasons why this mistake had gone unnoticed. First of all, I had originally coded the function while on Windows and it never got compiled by the Visual Studio compiler since the function was only defined for Linux. Secondly, I had deliberately left out the ‘-Wall‘ compiler flag because I got a ton of warnings from third-party libraries. Unfortunately, this meant I didn’t get a warning when I forgot to return a value from the GetTickCount() function. (Compare that to the Visual Studio compiler which automatically promotes such [rather serious] warnings to errors.)

These are both mistakes on my end, and indeed some that I should’ve caught. However, the point of this story is to highlight just how easily small things can lead you astray. For instance, I didn’t notice that the problem was actually in the GetTickCount() function until much later in the process after taking a major detour.

I can’t help but wonder how quickly I would’ve caught this mistake if I was debugging on Windows. As you might have guessed, I’m much more comfortable with WinDbg and the Visual Studio debugger than I am with gdb, and it wouldn’t have taken me more than a few seconds to notice the incorrect value of the timestamp variable, or the incorrect return value of the GetTickCount() function. A bug like this really shows the importance of knowing your tools — and knowing them well. On the plus side, though, I got a chance to work much closer with gdb than I usually do.

Wrapping Up: Why it worked for debug builds

The GetTickCount() function is used to get a timestamp that can be used in timeout operations. The reason the connection manager worked correctly for debug builds is due to how the compiler repurposes the registers. On x86, return values are stored in the eax register, but since GetTickCount() didn’t return a value, eax was essentially left untouched and would instead contain the result of a previous calculation or the return value from another function.

For the debug builds, when receiving a packet and storing the current time in a timestamp variable, GetTickCount() might return zero due to the way the compiler had purposed the registers. When the connection manager later checked to see if the connection had timed out by subtracting the current value of GetTickCount(), which might be non-zero depending on the current value of the eax register, with the previously stored timestamp, the difference might be larger than the timeout threshold and the connection would be removed.

For the release builds, eax was always zero when GetTickCount() was called, and thus the timeout threshold was never exceeded and no connections were ever removed.

If you’d like to share your own stupid mistake stories, drop me a line at [dreijer at echobit dot net] and I’ll publish them here (with due credit, of course).

No comments, yet.

Add a comment.

Name:
E-mail:
Website:
reCAPTCHA:
 
Comment: