Fuzzer reproduces low-probability Realm crashes on Sapling #1745

elliottslaughter · 2024-08-26T19:30:25Z

These are known Realm crashes that have been previously reported, but in the past reproducing them has been very tricky. The good news is that the Fuzzer can be used to reproduce these crashes, and it works directly on Sapling, sidestepping the need to mess with CI or containers.

Note that all reproducers are on Sapling.

Failure modes

Here's a sample of the failure modes that I am able to reproduce. Note that these are essentially random, so unlike typical fuzzer configurations I'm not sure there's anything inherent to specific seeds that provide any meaning here; we're just getting (un)lucky in particular runs leading to various errors.

MutexChecker:

[0 - 7f3d47a57c40]    9.358371 {6}{mutex}: over limit on entry into MutexChecker(xpair push,0x7f3cce70ac30) limit=1 actval=1 at stack trace: 10 frames

ChunkedRecycler:

fuzzer: /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1002: Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler() [with T = Realm::GASNetEXEvent; unsigned int CHUNK_SIZE = 64]: Assertion `cur_alloc.load() == 0' failed.

Network not quiescent:

[0 - 7fe84eeedc40]    9.147451 {6}{realm}: network still not quiescent after 10 attempts

Instructions

To reproduce:

cd /scratch/eslaught/fuzzer-experiment-6-debug-multi
source experiment/sapling/env.sh
./experiment/sapling/run_all_tests.sh

To rebuild (note you'll have to become me on the machine, or you can make a copy of all the build directories):

srun -n 1 -N 1 -c 40 -p all --exclusive --pty bash --login
cd /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/build_debug_multi
make clean && make install -j20
cd ../../build_debug_multi
make clean && make -j20

I can also provide from-scratch reproducer instructions if you'd prefer to do this in your own account.

Fuzzer version: StanfordLegion/fuzzer@3ef4c19

Legion version: afd9161

The text was updated successfully, but these errors were encountered:

lightsighter · 2024-08-27T09:46:42Z

How long does it take to reproduce some of these?

elliottslaughter · 2024-08-27T17:12:15Z

The only way I've found to reproduce is to do an entire run of 10k tests. I suspect that we're talking about a very rare race condition, so the only way to make it happen is to slam the machine for an extended period of time to force the threads to interleave in a very particular way. In the end it probably takes 15 minutes to get some interesting failures, but it requires the full script run to do so.

The good news is that you can still use REALM_FREEZE_ON_ERROR so it should be possible to get all of these in a debugger.

apryakhin added this to the realm-24.11 milestone Sep 17, 2024

lightsighter mentioned this issue Sep 17, 2024

Realm: Non-deterministic crashes of Legion/Realm/Regent examples in the CI #1305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzer reproduces low-probability Realm crashes on Sapling #1745

Fuzzer reproduces low-probability Realm crashes on Sapling #1745

elliottslaughter commented Aug 26, 2024

lightsighter commented Aug 27, 2024

elliottslaughter commented Aug 27, 2024

Fuzzer reproduces low-probability Realm crashes on Sapling #1745

Fuzzer reproduces low-probability Realm crashes on Sapling #1745

Comments

elliottslaughter commented Aug 26, 2024

Failure modes

Instructions

lightsighter commented Aug 27, 2024

elliottslaughter commented Aug 27, 2024