Break on NaN in gdb

Published on

Recently I had to debug a case where, somewhere during a numerically-intensive computation (solving an ordinary differential equation), a value would become NaN (“not a number”). This happens, for example, when taking a logarithm or a square root of a negative number, dividing 0 by 0, stuff like that. However, I had no idea where and why NaNs appeared in this particular program.

So here I’ll show how to detect this using gdb, the GNU debugger. Here is the program that we will be debugging:

/* nan.c */
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

void f1(double *y) {
  *y -= 2.0;
}

void f2(double *z) {
  *z = sqrt(*z);
  *z += 1.0;
}

int main() {
  double *x = malloc(sizeof(double));
  *x = 5e4;
  int i;
  for (i = 0; i < 1000; i++) {
    f1(x);
    f2(x);
  }
  printf("%f\n", *x);
  free(x);
}

Compile it with

gcc -g -lm -Wall -pedantic -onan nan.c

and confirm that it produces a NaN:

% ./nan
-nan

Now, start gdb and proceed to the point where the storage for our double value is allocated:

Reading symbols from ./nan...
(gdb) start
Temporary breakpoint 1 at 0x4011cd: file nan.c, line 15.

Temporary breakpoint 1, main () at nan.c:15
15    double *x = malloc(sizeof(double));
(gdb) next
16    *x = 5e4;

At this point we can do two useful things.

First, we turn on the program execution log, so that we can go back in time once we encounter a NaN.

(gdb) set record full stop-at-limit off
(gdb) record full

Second, we set a watchpoint that will trigger once *x becomes NaN. But how do we express that?

If we were programming in C, we would use the isnan() function. But isnan() is not a C function, it is a preprocessor macro and is by default not available in gdb. And even if I make it available (by compiling the program with -g3), it still doesn’t work, at least on my system:

(gdb) macro expand isnan(1.0)
expands to: __builtin_isnan (1.0)
(gdb) print isnan(1.0)
No symbol "__builtin_isnan" in current context.

Luckily, there are a few workarounds. Perhaps the simplest and the most reliable one is to exploit the fact that NaN is the only floating-point value not equal to itself. Therefore we can set a conditional watchpoint like this:

(gdb) watch *x if *x != *x
Hardware watchpoint 2: *x

Some other options are:

In any case, once we’ve started recording the execution log and set a watchpoint, we are ready to restart the program and wait for the condition to trigger:

(gdb) continue
Continuing.

Hardware watchpoint 2: *x

Old value = -0.19219211672498604
New value = -nan(0x8000000000000)
f2 (z=0x4052a0) at nan.c:11
11    *z += 1.0;

Now we know that NaN was produced in f2 right before line 11. If this is not enough to diagnose the bug, we can use the execution log to go back in time. Let’s say we want to find out what the value of *x was before the last f1 call.

(gdb) tbreak f1
Temporary breakpoint 3 at 0x40115e: file nan.c, line 6.
(gdb) reverse-continue
Continuing.

Temporary breakpoint 3, f1 (y=0x4052a0) at nan.c:6
6     *y -= 2.0;
(gdb) print *y
$1 = 1.807807883275014