Recently I had to debug a case where, somewhere during a numerically-intensive computation (solving an ordinary differential equation), a value would become
NaN (“not a number”). This happens, for example, when taking a logarithm or a square root of a negative number, dividing 0 by 0, stuff like that. However, I had no idea where and why NaNs appeared in this particular program.
So here I’ll show how to detect this using gdb, the GNU debugger. Here is the program that we will be debugging:
Compile it with
gcc -g -lm -Wall -pedantic -onan nan.c
and confirm that it produces a NaN:
% ./nan -nan
Now, start gdb and proceed to the point where the storage for our
double value is allocated:
Reading symbols from ./nan... (gdb) start Temporary breakpoint 1 at 0x4011cd: file nan.c, line 15. Temporary breakpoint 1, main () at nan.c:15 15 double *x = malloc(sizeof(double)); (gdb) next 16 *x = 5e4;
At this point we can do two useful things.
First, we turn on the program execution log, so that we can go back in time once we encounter a NaN.
(gdb) set record full stop-at-limit off (gdb) record full
Second, we set a watchpoint that will trigger once
*x becomes NaN. But how do we express that?
If we were programming in C, we would use the
isnan() function. But
isnan() is not a C function, it is a preprocessor macro and is by default not available in gdb. And even if I make it available (by compiling the program with
-g3), it still doesn’t work, at least on my system:
(gdb) macro expand isnan(1.0) expands to: __builtin_isnan (1.0) (gdb) print isnan(1.0) No symbol "__builtin_isnan" in current context.
Luckily, there are a few workarounds. Perhaps the simplest and the most reliable one is to exploit the fact that NaN is the only floating-point value not equal to itself. Therefore we can set a conditional watchpoint like this:
(gdb) watch *x if *x != *x Hardware watchpoint 2: *x
Some other options are:
Figure out the underlying C function used to implement the
isnanmacro. On my system, this seems to work:
(gdb) p ((int (*)(double))__isnan)(sqrt(-1.0)) $1 = 1 (gdb) p ((int (*)(double))__isnan)(sqrt(1.0)) $2 = 0
However, it may differ on your system/compiler/standard library.
Wrap the standard
isnanmacro in a C function in your program to make it available inside gdb:
(Naming your wrapper function
isnanmight work too. In particular, it doesn’t necessarily lead to an infinite recursion because the inner
isnancall will be expanded by the preprocessor. However, the C standard explicitly says (section 7.1.3) that the standard macro names are reserved identifiers, and redefining these identifiers results in undefined behavior.)
You can try to inspect the bit pattern of the floating-point number directly. See e.g. this answer by Paul Pluzhnikov. Note, however, that IEEE 754 NaNs do not have a fixed bit pattern as Paul appears to assume (they may have an arbitrary fraction part), so unless you know exactly what NaN you are expecting, you have to be more careful.
In any case, once we’ve started recording the execution log and set a watchpoint, we are ready to restart the program and wait for the condition to trigger:
(gdb) continue Continuing. Hardware watchpoint 2: *x Old value = -0.19219211672498604 New value = -nan(0x8000000000000) f2 (z=0x4052a0) at nan.c:11 11 *z += 1.0;
Now we know that NaN was produced in
f2 right before line 11. If this is not enough to diagnose the bug, we can use the execution log to go back in time. Let’s say we want to find out what the value of
*x was before the last
(gdb) tbreak f1 Temporary breakpoint 3 at 0x40115e: file nan.c, line 6. (gdb) reverse-continue Continuing. Temporary breakpoint 3, f1 (y=0x4052a0) at nan.c:6 6 *y -= 2.0; (gdb) print *y $1 = 1.807807883275014