Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzer reproduces low-probability Realm crashes on Sapling #1745

Open
elliottslaughter opened this issue Aug 26, 2024 · 2 comments
Open

Fuzzer reproduces low-probability Realm crashes on Sapling #1745

elliottslaughter opened this issue Aug 26, 2024 · 2 comments
Milestone

Comments

@elliottslaughter
Copy link
Contributor

These are known Realm crashes that have been previously reported, but in the past reproducing them has been very tricky. The good news is that the Fuzzer can be used to reproduce these crashes, and it works directly on Sapling, sidestepping the need to mess with CI or containers.

Note that all reproducers are on Sapling.

Failure modes

Here's a sample of the failure modes that I am able to reproduce. Note that these are essentially random, so unlike typical fuzzer configurations I'm not sure there's anything inherent to specific seeds that provide any meaning here; we're just getting (un)lucky in particular runs leading to various errors.

MutexChecker:

[0 - 7f3d47a57c40]    9.358371 {6}{mutex}: over limit on entry into MutexChecker(xpair push,0x7f3cce70ac30) limit=1 actval=1 at stack trace: 10 frames

ChunkedRecycler:

fuzzer: /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1002: Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler() [with T = Realm::GASNetEXEvent; unsigned int CHUNK_SIZE = 64]: Assertion `cur_alloc.load() == 0' failed.

Network not quiescent:

[0 - 7fe84eeedc40]    9.147451 {6}{realm}: network still not quiescent after 10 attempts

Instructions

To reproduce:

cd /scratch/eslaught/fuzzer-experiment-6-debug-multi
source experiment/sapling/env.sh
./experiment/sapling/run_all_tests.sh

To rebuild (note you'll have to become me on the machine, or you can make a copy of all the build directories):

srun -n 1 -N 1 -c 40 -p all --exclusive --pty bash --login
cd /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/build_debug_multi
make clean && make install -j20
cd ../../build_debug_multi
make clean && make -j20

I can also provide from-scratch reproducer instructions if you'd prefer to do this in your own account.

Fuzzer version: StanfordLegion/fuzzer@3ef4c19

Legion version: afd9161

@lightsighter
Copy link
Contributor

How long does it take to reproduce some of these?

@elliottslaughter
Copy link
Contributor Author

The only way I've found to reproduce is to do an entire run of 10k tests. I suspect that we're talking about a very rare race condition, so the only way to make it happen is to slam the machine for an extended period of time to force the threads to interleave in a very particular way. In the end it probably takes 15 minutes to get some interesting failures, but it requires the full script run to do so.

The good news is that you can still use REALM_FREEZE_ON_ERROR so it should be possible to get all of these in a debugger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@elliottslaughter @apryakhin @lightsighter and others