Friday 7 October 2011

Gdb debugging...

Last month my brother came to visit and we did what every normal family would do: have beer and talk comp-scy. When he asked me how we debugged programs I had to sheepishly admit that I use printf... Part of the issue is that ocaml doesn't have good support for elf debugging annotation (the patch in PR#4888 hopes to address some of that) but it also comes from the fact that I am just not really up to speed on my debugging tools... Printf debugging has nearly no learning curve but it is slow and painful. Tools like gdb,strace,gprof,valgrind come with a steeper learning curve (especially for higher level languages because they require you to peer under the abstraction) but are the way to go in the long run.

So this month is going to be the no printf debugging month which means that I will only start modifying the source code of a programmer to debug it only as a last resort.

Meet the culprit

Today I had a quick look at a problem with the compiler itself. The compiler segfaulted while compiling code with very long lists (PR#5368).

The following bash command will generate a file (big_list.ml) that will cause the failure:

cat > big_list.ml <<EOF
let big x =[
  $(yes "true;" | head -n 100000)
 ]
EOF

Looking at backtraces

This smelled like a stack overflow (the stack size is fixed if you have too many function call chained you blow your stack out and might get a segfault). Sure enough, after raising the size of the stack (ulimit -s 50000) the compilation ran fine... So we are probably looking for a stack overflow. Those are usually called by non-tail call reccursions and real easy to find:
  • load the binary in gdb. with gdb --args ocamlopt.opt big_list.ml
  • run it (run) until it blows up
  • look at the stack (bt) and one or several function should appear all the time.

Using breakpoints

In my case the stacktrace was a bit anti climatic:
#0  0x000000000058150d in camlIdent__find_same_16167 ()
#1  0x0000000000000000 in ?? ()
The lack of proper backtrace could be due to one of several things:
  • Ocamlopt's calling convention for function is not the same as C and this could throw of gdb
  • the Ocaml run time has code to detect stack overflow (./asmrun/signals_asm.c). It works by registering a signal handler for the SIGSEGV signal and examining the address of the error and raising an exception if anything is wrong. This code is running inside a unix signal; this is a very restricted environment in which you are not allowed to do much (e.g. you cannot call malloc); it might be doing something illegal and/or messing up the stack.
We did however get one function name out of it: camlIdent__find_same_16167. The caml compiler assigns symbols to functions following this naming convention: caml<module name>__<function name>_<integer>. In this case the function is the find_name function in the Ident(typing/ident,ml) module. Let's have a look at who's calling this function by using break points. No before calling runin gdb we set a breakpoint on the function.

(gdb) break camlIdent__find_same_16167
Breakpoint 1 at 0x5814f0
(gdb) run
Starting program: /opt/ocaml-exp/bin/ocamlopt.opt big_list.ml

Breakpoint 1, 0x00000000005814f0 in camlIdent__find_same_16167 ()
We want to let cross this break point enough to have a nice a fat backtrace.
(gdb) ignore 1 500
Will ignore next 500 crossings of breakpoint 1.
(gdb) continue
Continuing.

Breakpoint 1, 0x00000000005814f0 in camlIdent__find_same_16167 ()
By looking at the backtrace we can now clearly see that: camlTypecore__type_construct_206357 is appearing a lot on the stack and, sure enough, the type_construct in typing/typecore.ml is not tail recursive. In our case the easiest solution is probably to change our code generator to output the list by chunks:
let v0=[]

let v1= true::true:: ..... ::v0
let v2= true::true:: ..... ::v1
....
let v = vn

Finding function's symbol

Last but not least: of you wanted to put a breakpoint in typecore.ml on the function type_argument you'd have to figure out the symbol name:
> nm /opt/ocaml-exp/bin/ocamlopt.opt | grep camlTypecore__type_argument
0000000000526b70 T camlTypecore__type_argument_206355