Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: Non-deterministic crashes of Legion/Realm/Regent examples in the CI #1305

Open
eddy16112 opened this issue Aug 3, 2022 · 24 comments
Open
Assignees
Labels
bug Realm Issues pertaining to Realm
Milestone

Comments

@eddy16112
Copy link
Contributor

eddy16112 commented Aug 3, 2022

We have noticed some examples crash non-deterministically in the CI, so I plan to use this issue to track them, such that people won't be surprised if they see the same error in their development branch.

Such non-deterministic crashes are difficulty to reproduce. We may need to run multiple containers concurrently to increase the contention on the machine.

  1. Realm::MutexChecker::lock_fail
    This one is often seen in GASNetEx CIs, need to keep track if it also happens with GASNet.
    Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2775315984

  2. Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler()
    At least one GASNetEXEvent is allocated but not freed. This one is often see in GASNetEx CIs.
    Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2791336163
    Here is another issue to track it GASNetEXEvent is not recycled correctly in gcc9_cxx11_debug_gasnetex_mpi_regent CI test #1304

  3. FIXED: realm proc_group
    Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2810475096
    (Fixed by https://gitlab.com/StanfordLegion/legion/-/merge_requests/784)
    Crash again in Mac OS: https://gitlab.com/StanfordLegion/legion/-/jobs/4518715817

  4. FIXED: crash in examples/separate_compilation.rg Regent: nondeterministic separate_compilation.rg crash in CI #1478
    unclear whether this is a crash in the code itself or an issue in the launcher (MPI/PMI)
    Job log: https://gitlab.com/StanfordLegion/legion/-/jobs/3095304675

  5. realm compqueuewith ucx
    Here is the job log: https://gitlab.com/StanfordLegion/legion/-/jobs/4241109046

  6. legion spy
    https://gitlab.com/StanfordLegion/legion/-/jobs/4318846145

  7. FIXED attach_file_mini on Mac OS with c++17
    https://gitlab.com/StanfordLegion/legion/-/jobs/4433345645

  8. Regent non-deterministic segfaults Regent: nondeterministic crashes in multi-node CI jobs #1490

  9. Temporary FIXED by removing the cancel_operation: realm test_profiling poisoned failure
    https://gitlab.com/StanfordLegion/legion/-/merge_requests/1077
    https://gitlab.com/StanfordLegion/legion/-/jobs/4505374541
    https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078
    https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078
    The following assertion code failed

    cargs.sleep_useconds = 5000000;
    Event e4 = task_proc.spawn(CHILD_TASK, &cargs, sizeof(cargs), prs);
    sleep(2);
    int info = 111;
    e4.cancel_operation(&info, sizeof(info));
    bool poisoned = false;
    e4.wait_faultaware(poisoned);
    assert(poisoned);

  1. realm deferred_allocs
    https://gitlab.com/StanfordLegion/legion/-/jobs/4505374448

  2. realm event_subsribe possible related to the inaccurate usleep in containers
    https://gitlab.com/StanfordLegion/legion/-/jobs/4513555518
    https://gitlab.com/StanfordLegion/legion/-/jobs/5954499303

  3. realm ctxswitch
    https://gitlab.com/StanfordLegion/legion/-/jobs/4521720755
    https://gitlab.com/StanfordLegion/legion/-/jobs/6022551375

  4. jupyter notebook timeout
    https://gitlab.com/StanfordLegion/legion/-/jobs/4521720751

  5. another realm compqueue with gasnetex
    https://gitlab.com/StanfordLegion/legion/-/jobs/5170623224

  6. realm simple_reduce network still not quiescent
    https://gitlab.com/StanfordLegion/legion/-/jobs/6079379775

  7. unknown barrier related, the active message received seems to be incorrect
    https://gitlab.com/StanfordLegion/legion/-/jobs/6413356142

  8. realm evernt_subscribe
    https://gitlab.com/StanfordLegion/legion/-/jobs/6393148865

  9. another failure mode with gasnetex during shutting down
    https://gitlab.com/StanfordLegion/legion/-/jobs/6770356161

@lightsighter
Copy link
Contributor

lightsighter commented May 11, 2023

@eddy16112 eddy16112 changed the title Non-deterministic crashes of Legion/Realm/Regent examples in new CI Non-deterministic crashes of Legion/Realm/Regent examples in the CI Jun 8, 2023
@elliottslaughter
Copy link
Contributor

Do we have a bullet for {realm}: network still not quiescent after 10 attempts? Failing here:

https://gitlab.com/StanfordLegion/legion/-/jobs/4481778384

@elliottslaughter
Copy link
Contributor

Here's a new failure for MutexChecker: https://gitlab.com/StanfordLegion/legion/-/jobs/4482517781

@lightsighter
Copy link
Contributor

I fixed the attach_file_mini test.

@apryakhin
Copy link
Contributor

Just another failure with realm_reductions.cc:

@lightsighter
Copy link
Contributor

Just another failure with realm_reductions.cc:

FWIW, that failure mode of the network not being quiescent is not specific to that test. I've seen it on lots of different tests.

@lightsighter
Copy link
Contributor

Here's a fun variation of the mutex checker overflow:
https://gitlab.com/StanfordLegion/legion/-/jobs/4802099679

@apryakhin
Copy link
Contributor

Another one for gcc9_cxx11_release_gasnetex_ucx_regent:
https://gitlab.com/StanfordLegion/legion/-/jobs/4803853742

Not sure if that's a duplicate

@elliottslaughter
Copy link
Contributor

Is this a new failure mode?

https://gitlab.com/StanfordLegion/legion/-/jobs/4810462264

@lightsighter
Copy link
Contributor

Another one for gcc9_cxx11_release_gasnetex_ucx_regent:

That is a known issue in the Legion master branch that is fixed in the control replication branch and is too hard to backport.

Is this a new failure mode?

No, it's already fixed and it was not intermittent.

@elliottslaughter
Copy link
Contributor

Have you seen this one before (on latest control_replication)?

https://gitlab.com/StanfordLegion/legion/-/jobs/4811257857

@elliottslaughter
Copy link
Contributor

In the latest control_replication, I'm still seeing:

https://gitlab.com/StanfordLegion/legion/-/jobs/4811257849

@lightsighter
Copy link
Contributor

Have you seen this one before (on latest control_replication)?

Yes, and master and every other branch I've worked on. PMI does not like something in our docker setup.

In the latest control_replication, I'm still seeing:

Try again.

@lightsighter
Copy link
Contributor

@elliottslaughter
Copy link
Contributor

We're setting mpirun -n 2 from https://gitlab.com/StanfordLegion/legion/-/blob/master/.gitlab-ci.yml?ref_type=heads#L312, which I believe will implicitly refer to localhost. Do you really think that will be an issue with the host name?

@lightsighter
Copy link
Contributor

lightsighter commented Aug 17, 2023

I have a suspicion that those failures are actually related to this commit that I made today:
https://gitlab.com/StanfordLegion/legion/-/commit/831103cb94264df3f6c7326cd84d3df0b1425b1f
It handles the race where you launch Legion on multiple nodes, the whole program runs on one node, and then that node exits even before the other nodes have even finished starting up., so the job scheduler starts trying to tear down the other processes before they're even done. Let's see if those kinds of errors go away.

@apryakhin
Copy link
Contributor

I am adding two more failures to make sure they get addressed:

@apryakhin
Copy link
Contributor

We need to audit all reported failures here to see if any still remain

@apryakhin apryakhin added this to the realm-24.11 milestone Sep 17, 2024
@apryakhin apryakhin self-assigned this Sep 17, 2024
@apryakhin apryakhin changed the title Non-deterministic crashes of Legion/Realm/Regent examples in the CI [BUG] Non-deterministic crashes of Legion/Realm/Regent examples in the CI Sep 17, 2024
@lightsighter
Copy link
Contributor

At least 1, 2, and 15 are still occurring since they also show up in the fuzzer tests (see #1745)

@apryakhin apryakhin added bug Realm Issues pertaining to Realm labels Sep 20, 2024
@apryakhin apryakhin changed the title [BUG] Non-deterministic crashes of Legion/Realm/Regent examples in the CI Realm: Non-deterministic crashes of Legion/Realm/Regent examples in the CI Sep 20, 2024
@elliottslaughter
Copy link
Contributor

Here's a recent CI failure for 2: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090909

Here's a CI failure for 15: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090855

@elliottslaughter
Copy link
Contributor

elliottslaughter commented Sep 25, 2024

This might be related to 15, but not sure: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864325

runtime/realm/gasnetex/gasnetex_internal.cc:3596: bool Realm::GASNetEXInternal::check_for_quiescence(size_t): Assertion `0' failed.

@elliottslaughter
Copy link
Contributor

There's also this, which might be a hang: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864333

HELP!  Alarm triggered - likely hang!

@apryakhin
Copy link
Contributor

@elliottslaughter This is a probably discussed already somewhere else but I recall you have done a "fuzz testing" done a relatively short time ago that exposed a number of bugs. Would you be able to describe what sort of fuzz tester is it? Or perhaps point to a place that has some context on it. I would be open to discuss integrating the fuzz testing for Realm. Either as a standalone tool that we run/maintain ourselves or something derived from what you have already done.

@elliottslaughter
Copy link
Contributor

#1745 is the fuzzer-specific issue, maybe I'll answer over there since this thread is already quite long?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

4 participants