Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang when loading GAP.jl with Julia master #901

Closed
zickgraf opened this issue Jun 17, 2023 · 16 comments
Closed

Hang when loading GAP.jl with Julia master #901

zickgraf opened this issue Jun 17, 2023 · 16 comments

Comments

@zickgraf
Copy link

Since JuliaLang/julia@03c4bc1, loading GAP.jl sometimes hangs.

Steps to reproduce:

  1. Compile Julia master after the above mentioned commit.
  2. Install GAP.jl version 0.9.6 or GAP.jl master.
  3. Execute while true; do ./julia -e "println(\"start\"); using GAP; println(\"end\");"; done in a shell.

Observe that after some time this hangs after "start". Sometimes it hangs on the first try, sometimes it hangs after 50 tries (or so). When reproducing in a REPL, the hang occurs after "Loading the library" is printed but before "and packages ..." is printed.

@benlorenz
Copy link
Member

benlorenz commented Jun 17, 2023

That is probably the same issue that we are seeing with Oscar: oscar-system/Oscar.jl#2476
My guess (hope?) is that we just need a new libjulia_jll for the latest julia master (including that commit) and a rebuild of GAP_jll and GAP_pkg_juliainterface_jll.

@fingolfin
Copy link
Member

@zickgraf just to clarify: are you saying that Julia commit started it (i.e.: the commit before works)? Or are you just giving this as a bound ("since at least that commit things are not working")?

@zickgraf
Copy link
Author

@zickgraf just to clarify: are you saying that Julia commit started it (i.e.: the commit before works)? Or are you just giving this as a bound ("since at least that commit things are not working")?

I bisected the problem to this commit, i.e. the commit before works (with a small amount of uncertainty due to the randomness involved when trying to reproduce the hang). It's also the same commit which @benlorenz suspected in oscar-system/Oscar.jl#2476 (comment).

@fingolfin
Copy link
Member

@benlorenz I am sceptical about this. Which gc.h changes do you have in mind? I just had a look at the last couple of commits made to that file (in fact all changes made to src/*.h files in Julia in the past 2-3 months), and for none is it apparent to me why that would require a GAP_jll rebuild. Indeed, I've tried hard to minimize the places where we directly access Julia structs to the absolute minimum (including adding new APIs to Julia itself to help us do that).

That said, of course I can easily have missed something. It would still be good to know what.

That said, the commit @zickgraf pointed out make some major changes to how some things in the GC work, including how age is handled, and I fear this may have broken the GAP integration in a a deeper way. But I'll need to dig deeper to know. I'll first wait to lear if he bisected it or how else that commit came up.

@zickgraf
Copy link
Author

I am sceptical about this. ...

Ah, now I better understand the question above. So just to reaffirm: Yes, I have bisected the problem and I am 99.9% sure that JuliaLang/julia@03c4bc1 is the first bad commit :-)

@fingolfin
Copy link
Member

Thanks for clarifying, @zickgraf. In that case I am afraid a mere recompile won't solve the problem, so it certainly won't hurt to try it.

@fingolfin
Copy link
Member

I should also point out that there have been lingering GC issues related to GAP.jl for some time, as analyzed by @benlorenz , see oscar-system/Oscar.jl#2336

@benlorenz
Copy link
Member

benlorenz commented Jun 19, 2023

@benlorenz I am sceptical about this. Which gc.h changes do you have in mind? I just had a look at the last couple of commits made to that file (in fact all changes made to src/*.h files in Julia in the past 2-3 months), and for none is it apparent to me why that would require a GAP_jll rebuild. Indeed, I've tried hard to minimize the places where we directly access Julia structs to the absolute minimum (including adding new APIs to Julia itself to help us do that).

I was referring to the same commit that zickgraf mentioned, this changed the members of jl_gc_pagemeta_t and bigval_t, both are used in a bunch of inline functions in that header and in several other structs.
I think this might cause issues even if GAP does not directly use these structs.

In valgrind I see these before the hang:

==25998== Invalid write of size 8
==25998==    at 0x59BA668: ijl_sig_throw (task.c:790)
==25998==  Address 0x12195fd8 is on thread 1's stack
==25998== 
==25998== Invalid write of size 8
==25998==    at 0x59BA66A: ijl_sig_throw (task.c:790)
==25998==  Address 0x12195fd0 is on thread 1's stack
==25998== 
==25998== Invalid write of size 8
==25998==    at 0x59BA66B: ijl_sig_throw (task.c:792)
==25998==  Address 0x12195fc8 is on thread 1's stack
==25998== 
==25998== Invalid read of size 8
==25998==    at 0x4CC9E9A: pthread_getspecific (pthread_getspecific.c:65)
==25998==  Address 0x12195fc8 is on thread 1's stack

Unfortunately without any backtraces.
These appear on that commit, but don't appear on the commit before.

There are also a few Use of uninitialised value of size 8 before that but these happen on the before that commit as well and don't seem to affect the tests.

@fingolfin
Copy link
Member

@benlorenz OK, let's give it a try then :-)

@benlorenz
Copy link
Member

Unfortunately the new binaries did not seem to help, still seeing hangs during the GAP.jl tests (needs a few retries until it happens). Will have another look at the debugger / logs.

@benlorenz
Copy link
Member

With the help of rr I got some ideas what is happening:

#0  ijl_gc_internal_obj_base_ptr (p=0x147600000027)
    at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:4033
#1  0x000014762c53e9ed in TryMark (p=<optimized out>) at src/julia_gc.c:405
#2  TryMarkRange (start=<optimized out>, start@entry=0x7ffd8e5990d8, end=end@entry=0x7ffd8e5a8560)
    at src/julia_gc.c:550
#3  0x000014762c53ecb6 in GapRootScanner (full=<optimized out>) at src/julia_gc.c:610
#4  0x000014764cca55e6 in _jl_gc_collect (ptls=ptls@entry=0x2258990,
    collection=collection@entry=JL_GC_AUTO)
    at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:3156
#5  0x000014764cca7f9b in ijl_gc_collect (collection=collection@entry=JL_GC_AUTO)
    at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:3478
#6  0x000014764cca81ed in maybe_collect (ptls=0x2258990)
    at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:914
#7  jl_gc_pool_alloc_inner (ptls=0x2258990, pool_offset=<optimized out>, osize=48)
    at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:1272
#8  0x000014764cca8b36 in jl_gc_pool_alloc_noinline (osize=<optimized out>,
    pool_offset=<optimized out>, ptls=0x2258990)
    at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:1333
#9  jl_gc_alloc_ (ty=0x147637fdf550 <jl_system_image_data+3867280>, sz=40, ptls=0x2258990)
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_internal.h:466
#10 jl_gc_alloc (ptls=0x2258990, sz=40, ty=ty@entry=0x147637fdf550 <jl_system_image_data+3867280>)
    at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:3526
#11 0x000014764cc6c921 in _new_array_ (elsz=<optimized out>, zeroinit=1 '\001',
    isunion=<optimized out>, hasptr=<optimized out>, isunboxed=0 '\000', dims=0x7ffd8e5993c8, ndims=1,
    atype=0x147637fdf550 <jl_system_image_data+3867280>)
    at /home/datastore/lorenz/software/julia/jv/julia/src/array.c:134
#12 _new_array (dims=0x7ffd8e5993c8, ndims=1, atype=0x147637fdf550 <jl_system_image_data+3867280>)
    at /home/datastore/lorenz/software/julia/jv/julia/src/array.c:198
#13 ijl_alloc_array_1d (atype=0x147637fdf550 <jl_system_image_data+3867280>, nr=<optimized out>)
    at /home/datastore/lorenz/software/julia/jv/julia/src/array.c:436
#14 0x00001476376bcba8 in Array () at boot.jl:475
#15 getindex () at array.jl:446

We are in a GC event which then triggered the GapRootScanner which wants to mark gap objects somewhere on the stack? (Not really sure about the details ...)
The julia function jl_gc_internal_obj_base_ptr tries to look up a metadata page for a given pointer but that turns out to be filled with zeros:

(rr) next
4034        if (meta) {
(rr) 
4037            size_t off = (char *)p - page;
(rr) p meta
$24 = (jl_gc_pagemeta_t *) 0x15c5ba90
(rr) p * meta
$25 = {
  pool_n = 0 '\000',
  has_marked = 0 '\000',
  has_young = 0 '\000',
  nold = 0,
  prev_nold = 0,
  nfree = 0,
  osize = 0,
  fl_begin_offset = 0,
  fl_end_offset = 0,
  thread_n = 0,
  data = 0x147600000000 ""
}

I don't really understand where this metadata comes from, it looks a bit weird like this, what code is supposed to generate this?

But I think it didn't cause problems earlier due to what I describe below:

The code was changed only very slightly here:
JuliaLang/julia@03c4bc1#diff-76b4d6ef0f7d9c2a6855e1ef7b0f7c94ddc68fcca2cdcd69e6d2b9007c279fdeL4110-L4123
The key change might be that the ages field disappeared, that field probably had null in there as well (I haven't checked that).

I would guess that previously the control flow skipped the if (meta && meta->ages), but now without the meta->ages check it goes in there and at line 4043 off2 %= osize; triggers an FPE since osize is 0.

4043            off2 %= osize;
(rr) n

Thread 1 received signal SIGFPE, Arithmetic exception.
0x000014764cca9b03 in ijl_gc_internal_obj_base_ptr (p=0x147600000027) at /home/datastore/lorenz/software/julia/jv/julia/src/gc.c:4043

The handling of this exceptions then triggers another signal:

(rr) 
Thread 1 received signal SIGSEGV, Segmentation fault.
0x000014764cc69fb4 in jl_gc_state_set (state=0 '\000', old_state=1 '\001', ptls=0x2258990) at /home/datastore/lorenz/software/julia/jv/julia/src/julia_threads.h:348
348             jl_gc_safepoint_(ptls);
(rr) bt
#0  0x000014764cc69fb4 in jl_gc_state_set (state=0 '\000', old_state=1 '\001', ptls=0x2258990)
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_threads.h:348
#1  jl_gc_state_set (old_state=1 '\001', state=0 '\000', ptls=0x2258990)
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_threads.h:341
#2  jl_gc_state_save_and_set (ptls=ptls@entry=0x2258990, state=0 '\000')
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_threads.h:354
#3  0x000014764cc6a6dc in throw_internal_altstack (
    exception=0x14763bfad390 <jl_system_image_data+70770896>, ct=<optimized out>)
    at /home/datastore/lorenz/software/julia/jv/julia/src/task.c:756
#4  ijl_sig_throw () at /home/datastore/lorenz/software/julia/jv/julia/src/task.c:801

And the handler for this SEGV wants to wait for a safepoint:

(rr) bt
#0  jl_safepoint_wait_gc () at /home/datastore/lorenz/software/julia/jv/julia/src/safepoint.c:171
#1  0x000014764ccb3d6b in jl_set_gc_and_wait ()
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_internal.h:945
#2  segv_handler (context=0x147640df93c0, info=<optimized out>, sig=<optimized out>)
    at /home/datastore/lorenz/software/julia/jv/julia/src/signals-unix.c:351
#3  segv_handler (sig=<optimized out>, info=<optimized out>, context=0x147640df93c0)
    at /home/datastore/lorenz/software/julia/jv/julia/src/signals-unix.c:338
#4  <signal handler called>
#5  0x000014764cc69fb4 in jl_gc_state_set (state=0 '\000', old_state=1 '\001', ptls=0x2258990)
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_threads.h:348
#6  jl_gc_state_set (old_state=1 '\001', state=0 '\000', ptls=0x2258990)
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_threads.h:341
#7  jl_gc_state_save_and_set (ptls=ptls@entry=0x2258990, state=0 '\000')
    at /home/datastore/lorenz/software/julia/jv/julia/src/julia_threads.h:354
#8  0x000014764cc6a6dc in throw_internal_altstack (
    exception=0x14763bfad390 <jl_system_image_data+70770896>, ct=<optimized out>)
    at /home/datastore/lorenz/software/julia/jv/julia/src/task.c:756
#9  ijl_sig_throw () at /home/datastore/lorenz/software/julia/jv/julia/src/task.c:801

which seems to trigger the hang (since we are already inside some GC?).

To confirm my suspicion I did a small change:


diff --git a/src/gc.c b/src/gc.c
index 00b0102f72..da484671df 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -4031,7 +4031,7 @@ JL_DLLEXPORT jl_value_t *jl_gc_internal_obj_base_ptr(void *p)
 {   
     p = (char *) p - 1;
     jl_gc_pagemeta_t *meta = page_metadata(p);
-    if (meta) {
+    if (meta && meta->osize) {
         char *page = gc_page_data(p);
         // offset within page.
         size_t off = (char *)p - page;

This adds another check whether the metadata table seems valid, since I think osize=0 should not really happen.
With this small change I have the GAP.jl tests running in a loop for several hours now without a hang. Without the change I think it usually happened after about 5-10 runs.

Not really sure whether this would be a reasonable change for julia, I don't understand enough about the details to make a PR for this. I think it would make sense to dig into these metadata tables a bit?

PS: I can provide the rr trace if needed.

@fingolfin
Copy link
Member

@benlorenz fantastic work, thank you!

@fingolfin
Copy link
Member

The code in question triea to determine if meta is valid. So
i think the correct fix would be to add if (osize == 0) return NULL; right before tge line with the crashing modulo.

@benlorenz
Copy link
Member

This should be fixed now, thanks Max!
I restarted the latest nightly tests on master they were successful.

@zickgraf
Copy link
Author

I can confirm that everything is working as expected again. Thanks @benlorenz and @fingolfin for the investigation and the fix!

@fingolfin
Copy link
Member

All credit really should go to @benlorenz !

glad to hear it works now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants