Break on NaN in gdb

Published on January 11, 2020

Recently I had to debug a case where, somewhere during a numerically-intensive computation (solving an ordinary differential equation), a value would become NaN (“not a number”). This happens, for example, when taking a logarithm or a square root of a negative number, dividing 0 by 0, stuff like that. However, I had no idea where and why NaNs appeared in this particular program.

So here I’ll show how to detect this using gdb, the GNU debugger. Here is the program that we will be debugging:

/* nan.c */
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

void f1(double *y) {
  *y -= 2.0;
}

void f2(double *z) {
  *z = sqrt(*z);
  *z += 1.0;
}

int main() {
  double *x = malloc(sizeof(double));
  *x = 5e4;
  int i;
  for (i = 0; i < 1000; i++) {
    f1(x);
    f2(x);
  }
  printf("%f\n", *x);
  free(x);
}

Compile it with

gcc -g -lm -Wall -pedantic -onan nan.c

and confirm that it produces a NaN:

% ./nan
-nan

Now, start gdb and proceed to the point where the storage for our double value is allocated:

Reading symbols from ./nan...
(gdb) start
Temporary breakpoint 1 at 0x4011cd: file nan.c, line 15.

Temporary breakpoint 1, main () at nan.c:15
15    double *x = malloc(sizeof(double));
(gdb) next
16    *x = 5e4;

At this point we can do two useful things.

First, we turn on the program execution log, so that we can go back in time once we encounter a NaN.

(gdb) set record full stop-at-limit off
(gdb) record full

Second, we set a watchpoint that will trigger once *x becomes NaN. But how do we express that?

If we were programming in C, we would use the isnan() function. But isnan() is not a C function, it is a preprocessor macro and is by default not available in gdb. And even if I make it available (by compiling the program with -g3), it still doesn’t work, at least on my system:

(gdb) macro expand isnan(1.0)
expands to: __builtin_isnan (1.0)
(gdb) print isnan(1.0)
No symbol "__builtin_isnan" in current context.

Luckily, there are a few workarounds. Perhaps the simplest and the most reliable one is to exploit the fact that NaN is the only floating-point value not equal to itself. Therefore we can set a conditional watchpoint like this:

(gdb) watch *x if *x != *x
Hardware watchpoint 2: *x

Some other options are:

Figure out the underlying C function used to implement the isnan macro. On my system, this seems to work:
```
(gdb) p ((int (*)(double))__isnan)(sqrt(-1.0))
$1 = 1
(gdb) p ((int (*)(double))__isnan)(sqrt(1.0))
$2 = 0
```
However, it may differ on your system/compiler/standard library.
Wrap the standard isnan macro in a C function in your program to make it available inside gdb:
```
int myisnan(double x) {
  return isnan(x);
}
```
(Naming your wrapper function isnan might work too. In particular, it doesn’t necessarily lead to an infinite recursion because the inner isnan call will be expanded by the preprocessor. However, the C standard explicitly says (section 7.1.3) that the standard macro names are reserved identifiers, and redefining these identifiers results in undefined behavior.)
You can try to inspect the bit pattern of the floating-point number directly. See e.g. this answer by Paul Pluzhnikov. Note, however, that IEEE 754 NaNs do not have a fixed bit pattern as Paul appears to assume (they may have an arbitrary fraction part), so unless you know exactly what NaN you are expecting, you have to be more careful.

In any case, once we’ve started recording the execution log and set a watchpoint, we are ready to restart the program and wait for the condition to trigger:

(gdb) continue
Continuing.

Hardware watchpoint 2: *x

Old value = -0.19219211672498604
New value = -nan(0x8000000000000)
f2 (z=0x4052a0) at nan.c:11
11    *z += 1.0;

Now we know that NaN was produced in f2 right before line 11. If this is not enough to diagnose the bug, we can use the execution log to go back in time. Let’s say we want to find out what the value of *x was before the last f1 call.

(gdb) tbreak f1
Temporary breakpoint 3 at 0x40115e: file nan.c, line 6.
(gdb) reverse-continue
Continuing.

Temporary breakpoint 3, f1 (y=0x4052a0) at nan.c:6
6     *y -= 2.0;
(gdb) print *y
$1 = 1.807807883275014