-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm: Non-deterministic crashes of Legion/Realm/Regent examples in the CI #1305
Comments
The completion queue test: https://gitlab.com/StanfordLegion/legion/-/jobs/3701536692 |
Do we have a bullet for |
Here's a new failure for |
I fixed the |
Just another failure with realm_reductions.cc: |
FWIW, that failure mode of the network not being quiescent is not specific to that test. I've seen it on lots of different tests. |
Here's a fun variation of the mutex checker overflow: |
Another one for Not sure if that's a duplicate |
Is this a new failure mode? |
That is a known issue in the Legion master branch that is fixed in the control replication branch and is too hard to backport.
No, it's already fixed and it was not intermittent. |
Have you seen this one before (on latest |
In the latest |
Yes, and master and every other branch I've worked on. PMI does not like something in our docker setup.
Try again. |
Some clues as to the problems with the PMI setup: https://stackoverflow.com/questions/23237026/simple-mpi-program-fail-with-large-number-of-processes Better to describe machines as IP addresses than host names. |
We're setting |
I have a suspicion that those failures are actually related to this commit that I made today: |
I am adding two more failures to make sure they get addressed: |
We need to audit all reported failures here to see if any still remain |
At least 1, 2, and 15 are still occurring since they also show up in the fuzzer tests (see #1745) |
Here's a recent CI failure for 2: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090909 Here's a CI failure for 15: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090855 |
This might be related to 15, but not sure: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864325
|
There's also this, which might be a hang: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864333
|
@elliottslaughter This is a probably discussed already somewhere else but I recall you have done a "fuzz testing" done a relatively short time ago that exposed a number of bugs. Would you be able to describe what sort of fuzz tester is it? Or perhaps point to a place that has some context on it. I would be open to discuss integrating the fuzz testing for Realm. Either as a standalone tool that we run/maintain ourselves or something derived from what you have already done. |
#1745 is the fuzzer-specific issue, maybe I'll answer over there since this thread is already quite long? |
We have noticed some examples crash non-deterministically in the CI, so I plan to use this issue to track them, such that people won't be surprised if they see the same error in their development branch.
Such non-deterministic crashes are difficulty to reproduce. We may need to run multiple containers concurrently to increase the contention on the machine.
Realm::MutexChecker::lock_fail
This one is often seen in GASNetEx CIs, need to keep track if it also happens with GASNet.
Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2775315984
Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler()
At least one GASNetEXEvent is allocated but not freed. This one is often see in GASNetEx CIs.
Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2791336163
Here is another issue to track it GASNetEXEvent is not recycled correctly in gcc9_cxx11_debug_gasnetex_mpi_regent CI test #1304
FIXED: realm
proc_group
Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2810475096
(Fixed by https://gitlab.com/StanfordLegion/legion/-/merge_requests/784)
Crash again in Mac OS: https://gitlab.com/StanfordLegion/legion/-/jobs/4518715817
FIXED: crash in
examples/separate_compilation.rg
Regent: nondeterministic separate_compilation.rg crash in CI #1478unclear whether this is a crash in the code itself or an issue in the launcher (MPI/PMI)
Job log: https://gitlab.com/StanfordLegion/legion/-/jobs/3095304675
realm
compqueue
with ucxHere is the job log: https://gitlab.com/StanfordLegion/legion/-/jobs/4241109046
legion spy
https://gitlab.com/StanfordLegion/legion/-/jobs/4318846145
FIXED attach_file_mini on Mac OS with c++17
https://gitlab.com/StanfordLegion/legion/-/jobs/4433345645
Regent non-deterministic segfaults Regent: nondeterministic crashes in multi-node CI jobs #1490
Temporary FIXED by removing the cancel_operation: realm
test_profiling
poisoned failurehttps://gitlab.com/StanfordLegion/legion/-/merge_requests/1077
https://gitlab.com/StanfordLegion/legion/-/jobs/4505374541
https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078
https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078
The following assertion code failed
realm
deferred_allocs
https://gitlab.com/StanfordLegion/legion/-/jobs/4505374448
realm
event_subsribe
possible related to the inaccurate usleep in containershttps://gitlab.com/StanfordLegion/legion/-/jobs/4513555518
https://gitlab.com/StanfordLegion/legion/-/jobs/5954499303
realm
ctxswitch
https://gitlab.com/StanfordLegion/legion/-/jobs/4521720755
https://gitlab.com/StanfordLegion/legion/-/jobs/6022551375
jupyter notebook timeout
https://gitlab.com/StanfordLegion/legion/-/jobs/4521720751
another realm compqueue with gasnetex
https://gitlab.com/StanfordLegion/legion/-/jobs/5170623224
realm
simple_reduce
network still not quiescenthttps://gitlab.com/StanfordLegion/legion/-/jobs/6079379775
unknown barrier related, the active message received seems to be incorrect
https://gitlab.com/StanfordLegion/legion/-/jobs/6413356142
realm
evernt_subscribe
https://gitlab.com/StanfordLegion/legion/-/jobs/6393148865
another failure mode with gasnetex during shutting down
https://gitlab.com/StanfordLegion/legion/-/jobs/6770356161
The text was updated successfully, but these errors were encountered: