Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious segfaults in sv-benchmarks runs #1520

Closed
sim642 opened this issue Jun 21, 2024 · 9 comments
Closed

Spurious segfaults in sv-benchmarks runs #1520

sim642 opened this issue Jun 21, 2024 · 9 comments
Assignees
Labels
bug sv-comp SV-COMP (analyses, results), witnesses type-safety Type-safety improvements
Milestone

Comments

@sim642
Copy link
Member

sim642 commented Jun 21, 2024

For a while now, sv-benchmarks runs have been showing a number of spurious SEGMENTATION FAULT results which are on different tasks on every run. In our latest run, it was 157/31498 tasks (0.5%). These make BenchExec diff tables particularly annoying to look at: one has to filter out those statuses from both columns to focus on actual differences.

In SV-COMP 2024 results we had zero of them, so something must have happened. AFAIK, segfaults in OCaml can only be due to misuse of Obj or bad C stubs (which we use for float domains). I opened #1371 at one point but convinced myself that our stubs were actually fine. Maybe they still are not? But they were such also at SV-COMP 2024, so the same issues should've popped up then as well.

Their spurious nature makes them extremely difficult to debug as well. Since it's not any specific task, one cannot just bisect and run that under gdb. Rather, at every bisect step, a full sv-benchmarks run needs to be done. And that only gives the first problematic commit at best, there's still no way to get the full backtrace from gdb while running Goblint under BenchExec.

@sim642 sim642 added bug type-safety Type-safety improvements sv-comp SV-COMP (analyses, results), witnesses labels Jun 21, 2024
@sim642 sim642 added this to the SV-COMP 2025 milestone Jun 21, 2024
@michael-schwarz
Copy link
Member

Iirc, StackOverflows that happen while executing native code (e.g., in Apron) also produce segfaults, right?
I always assumed that these explain most of those cases?

@sim642
Copy link
Member Author

sim642 commented Jun 21, 2024

That's true and that can even be true without things like Apron. If the C stack happens to exactly overflow during some OCaml runtime implementation (in C), then that could also be the case.

However, I'm still doubtful that it's what's happening here. If a particular program causes a stack overflow (whether in OCaml or C at the time), this should be consistent and repeatable. We don't invoke any randomness, so Goblint should be deterministic at what it does and how deep each recursion goes. It shouldn't be that one run somehow goes into deeper recursion than the second identical run.

@sim642
Copy link
Member Author

sim642 commented Jun 27, 2024

#1522 doesn't seem to have helped with this, sadly. avsm/ocaml-yaml#75 fixed a segfault in YAML with large values, but that doesn't seem to have been the case here.

@sim642
Copy link
Member Author

sim642 commented Jun 27, 2024

I just noticed something in my latest ghost witness generation runs (https://goblint.cs.ut.ee/results/188-concurrency-witness-simpl/table-generator.table.html#/table):
there are no segfaults in the non-relational-only runs and there are segfaults in the relational runs. So maybe this is related to Apron somehow.

EDIT: I managed to coincidentally reproduce this under gdb where bt shows the following:

#0  caml_darken (v=0, p=0x7fffffffb728) at major_gc.c:285
#1  0x00005555577da51d in caml_do_local_roots_nat (f=f@entry=0x5555577dc930 <caml_darken>, bottom_of_stack=<optimized out>, last_retaddr=<optimized out>, gc_regs=<optimized out>, local_roots=<optimized out>)
    at roots_nat.c:514
#2  0x00005555577da6e5 in caml_do_roots (f=0x5555577dc930 <caml_darken>, do_globals=<optimized out>) at roots_nat.c:432
#3  0x00005555577da732 in caml_darken_all_roots_start () at roots_nat.c:357
#4  0x00005555577dd250 in start_cycle () at major_gc.c:407
#5  caml_major_collection_slice (howmuch=howmuch@entry=-1) at major_gc.c:1089
#6  0x00005555577de741 in caml_gc_dispatch () at minor_gc.c:500
#7  0x00005555577de851 in caml_check_urgent_gc (extra_root=<optimized out>, extra_root@entry=1) at minor_gc.c:575
#8  0x00005555577de8e4 in caml_alloc_small_dispatch (wosize=2, flags=flags@entry=1, nallocs=nallocs@entry=1, encoded_alloc_lens=encoded_alloc_lens@entry=0x0) at minor_gc.c:524
#9  0x00005555577e00f9 in caml_alloc_small (wosize=2, tag=tag@entry=255) at alloc.c:68
#10 0x00005555577f4645 in alloc_custom_gen (ops=0x5555587e7c00 <camlidl_apron_custom_var_ptr>, bsz=<optimized out>, mem=0, max_major=1, mem_minor=0, max_minor=1) at custom.c:50
#11 0x000055555777d5d4 in camlidl_environment_ap_environment_vars ()
#12 0x0000555556dd5a0b in camlGoblint_lib__ApronDomain__join_2689 () at src/cdomains/apron/apronDomain.apron.ml:568
[...]

Although this doesn't necessarily mean Apron did something wrong. It just means that when Apron wanted to allocate on the OCaml heap, the GC ran and something in the OCaml heap has been screwed up at some earlier point, such that GC now crashes following an invalid pointer.

@sim642
Copy link
Member Author

sim642 commented Jun 27, 2024

After a day of debugging, I think I found camlidl of all things to be the cause: xavierleroy/camlidl#29. Apron uses camlidl to generate the bindings.

@sim642 sim642 self-assigned this Jun 27, 2024
@michael-schwarz
Copy link
Member

Wow, that is quite the find! There would have been many other things I would have suspected before turning to camlidl....

@sim642
Copy link
Member Author

sim642 commented Jun 27, 2024

I found some hand-written instances in Apron itself as well: antoinemine/apron#112.

@sim642
Copy link
Member Author

sim642 commented Jun 28, 2024

I now did a sv-benchmarks run with the camlidl and Apron fixes on top of #1522: all segfaults now disappeared.

@sim642
Copy link
Member Author

sim642 commented Oct 9, 2024

Since this has been fixed in upstream repositories, but not yet released to opam, I've added opam pins for them for now. There's nothing more for us to do about this issue, other than remove the pins eventually.

@sim642 sim642 closed this as completed Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug sv-comp SV-COMP (analyses, results), witnesses type-safety Type-safety improvements
Projects
None yet
Development

No branches or pull requests

2 participants