You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These are known Realm crashes that have been previously reported, but in the past reproducing them has been very tricky. The good news is that the Fuzzer can be used to reproduce these crashes, and it works directly on Sapling, sidestepping the need to mess with CI or containers.
Note that all reproducers are on Sapling.
Failure modes
Here's a sample of the failure modes that I am able to reproduce. Note that these are essentially random, so unlike typical fuzzer configurations I'm not sure there's anything inherent to specific seeds that provide any meaning here; we're just getting (un)lucky in particular runs leading to various errors.
MutexChecker:
[0 - 7f3d47a57c40] 9.358371 {6}{mutex}: over limit on entry into MutexChecker(xpair push,0x7f3cce70ac30) limit=1 actval=1 at stack trace: 10 frames
ChunkedRecycler:
fuzzer: /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1002: Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler() [with T = Realm::GASNetEXEvent; unsigned int CHUNK_SIZE = 64]: Assertion `cur_alloc.load() == 0' failed.
Network not quiescent:
[0 - 7fe84eeedc40] 9.147451 {6}{realm}: network still not quiescent after 10 attempts
Instructions
To reproduce:
cd /scratch/eslaught/fuzzer-experiment-6-debug-multi
source experiment/sapling/env.sh
./experiment/sapling/run_all_tests.sh
To rebuild (note you'll have to become me on the machine, or you can make a copy of all the build directories):
srun -n 1 -N 1 -c 40 -p all --exclusive --pty bash --login
cd /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/build_debug_multi
make clean && make install -j20
cd ../../build_debug_multi
make clean && make -j20
I can also provide from-scratch reproducer instructions if you'd prefer to do this in your own account.
The only way I've found to reproduce is to do an entire run of 10k tests. I suspect that we're talking about a very rare race condition, so the only way to make it happen is to slam the machine for an extended period of time to force the threads to interleave in a very particular way. In the end it probably takes 15 minutes to get some interesting failures, but it requires the full script run to do so.
The good news is that you can still use REALM_FREEZE_ON_ERROR so it should be possible to get all of these in a debugger.
These are known Realm crashes that have been previously reported, but in the past reproducing them has been very tricky. The good news is that the Fuzzer can be used to reproduce these crashes, and it works directly on Sapling, sidestepping the need to mess with CI or containers.
Note that all reproducers are on Sapling.
Failure modes
Here's a sample of the failure modes that I am able to reproduce. Note that these are essentially random, so unlike typical fuzzer configurations I'm not sure there's anything inherent to specific seeds that provide any meaning here; we're just getting (un)lucky in particular runs leading to various errors.
MutexChecker
:ChunkedRecycler
:Network not quiescent:
Instructions
To reproduce:
To rebuild (note you'll have to become me on the machine, or you can make a copy of all the build directories):
I can also provide from-scratch reproducer instructions if you'd prefer to do this in your own account.
Fuzzer version: StanfordLegion/fuzzer@3ef4c19
Legion version: afd9161
The text was updated successfully, but these errors were encountered: