-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in Realm::PosixAIORead::check_completion in resilience test #1753
Comments
@elliottslaughter can we get the output of the following log message? I think we need better error handling here. I really wish we would stop asserting / aborting on error handling... if (ret != 0)
{
const char *message = realm_strerror(errno);
log_aio.fatal("Failed asynchronous IO read [%d]: %s", errno, message);
abort();
} |
@muraj You can see it in the linked CI run:
The Realm code is incorrect. AIO does not set |
I rebased on top of the newest MR 1440 and applied the following local patch: diff --git a/runtime/realm/transfer/lowlevel_dma.cc b/runtime/realm/transfer/lowlevel_dma.cc
index 6d64c79be..79ab8c165 100644
--- a/runtime/realm/transfer/lowlevel_dma.cc
+++ b/runtime/realm/transfer/lowlevel_dma.cc
@@ -345,8 +345,8 @@ namespace Realm {
static_cast<void *>(&cb), ret);
if (ret != 0)
{
- const char *message = realm_strerror(errno);
- log_aio.fatal("Failed asynchronous IO read [%d]: %s", errno, message);
+ const char *message = realm_strerror(ret);
+ log_aio.fatal("Failed asynchronous IO read [%d]: %s", ret, message);
abort();
}
} Now I'm getting:
|
The fix for the AIO error reporting is here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1456 |
For the record this still reproduces on Sapling with the instructions at the top of this issue. |
Bad file descriptor is interesting. Is it possible to get an strace output that follows the file descriptor? I am wondering if this is related to the issue that Wei is working on with the IO bgworkers. |
Edit: Disregard these logs, they mixed output from both ranks into one file. I updated the
You're also welcome to look at the entire collection of output from other runs on Sapling at I'm not entirely sure what you're looking for in the logs, but one thing I noticed is that logs contain either:
Or they contain:
But they never contain both at the same time. I'm not sure if this is just file buffering issue (though I have tried to flush things as aggressively as possible) or if this is somehow meaningful. |
Thanks, I'll take a closer look later tonight. I'm trying to trace through when this file descriptor is closed and why. I suspect there is a race between this async read operation and a close somewhere that is invalidating the file descriptor. But I wanted to confirm that first with the strace logs. |
If I understand the log correctly, I guess the issue is from:
we close the fd=15, and then epoll_ctl it. |
Yeah, that's what I suspected. Now the question is, where is the close coming from, and why is it before the async read is complete? |
The fd is created by |
@elliottslaughter I tried to grep -r "Bad file descriptor" * under your logs,
I realized that the |
Sorry, the files probably got mixed up because these were multi-node runs and I wasn't properly piping the output of each rank to a separate file. Please disregard the files above and look at the ones below, which are properly split per-rank:
Files: output_0.log, output_1.log, strace_0.log, strace_1.log
Files: output_0.log output_1.log strace_0.log strace_1.log Once again I note that jobs hit one or the other but not both. You can look at output from all of the jobs on Sapling: |
If not both, it is hard to tell if the "Bad file descriptor" from strace is from the realm AIO. Could you please apply the patch and re-genenrate the logs, or if you can tell me how to pipe the log of each rank correctly?
I would like to know if the |
To be clear, the files in #1753 (comment) were generated with correctly piping output per-rank. I believe that the property that we see errors in either the output or strace is a fundamental property of the problem we are looking at, not a mistake. But I can certainly generate logs with the diff you suggested. |
I would like to understand why the error could show up in the strace but not in the output. Where does this error come from? |
I put the log files you requested here:
E.g., in
|
I don't know. But one thing to keep in mind is that |
Based on the new logs, it is almost certain that we close the file before the aio call. There could be two causes:
BTW, here is the stacktrace of closing the file, I am not sure if we should blame legion/realm/application.
|
Is it even possible for a Legion application to detach an instance too early? The only way to do that is to flip the execution order of tasks, and then the instance would be simply invalid and the program would probably hit a deterministic error in Legion before ever reaching an I/O failure. To me, your stack trace suggests Legion is processing the detach op through the runtime analysis pipeline, so if this is somehow out of order, it would be a dependence analysis failure in Legion. |
I enabled the logging of checking the completion of aio https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/lowlevel_dma.cc#L603, In the first run, the log is:
but in the second run (the failed one), the log becomes:
Apparently, the detach op is issued too early. |
I added more logs to check if the |
Legion makes sure that the completion events for all copies involving an instance (including file instances) are used as preconditions for deleting the instance. I suspect that the problem is that some of the asynchronous I/O operations are not being correctly reflected in the completion event of the copies. |
@elliottslaughter do you think if you could create a simple legion reproducer to reproduce the pattern that you are using for regent checkpointing? I tried to create a reproducer to do:
But I can not reproduce the out of order issue (the detach is issued before the completion of aio). Even if I manually slow down the aio code, I can not reproduce it. I tried to run your regent test sequentially, and the out of order issue is disappeared. Apparently, the issue only happens with parallel runs. However, I can even see the out of order logs in the |
Realm's logging will be in the order that it was logged in time from the same node, so I think you are actually seeing the two different orderings with your print. Can you log the completion events for all those aio copies in Realm and then also run with |
@lightsighter here is what I found from the log, the logs are divided into rank 0 and rank 1.
Here are the failed logs:
I can send you the full logs if you need. |
@elliottslaughter Are you sure you fixed all the bugs trying to issue detaches after exiting a parent task? There are still no checks for that in the master branch so if you're doing it then it would cause a problem like this. |
We delete the resilient runtime (which stores all the data structures) before running the postamble, if that's what you're asking: I suppose if there were a set of loggers that listed the postamble execution and any potential detaches, that would answer the question definitively. Do you know if we have that? |
There are no loggers that check that right now. The I'd like to see detailed Legion Spy logging from a failing run along with the name of the completion event for the bad copy. |
This is attempting to replicate a failure I've seen in the resilience CI, e.g. here: https://github.com/StanfordLegion/resilience/actions/runs/10647296358/job/29515103267
Failure mode:
Note the failure mode isn't the same as what I saw in CI, but this is the best I can do for now. It's possible that if we resolved this issue we'd see the other failure mode return afterwards.
Note I'm going to share the build with #1752, so watch out when you rebuild it:
To reproduce:
Reproduction ratio seems to be about 10%.
To rebuild:
The text was updated successfully, but these errors were encountered: