Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in Realm::PosixAIORead::check_completion in resilience test #1753

Open
elliottslaughter opened this issue Sep 3, 2024 · 29 comments
Open

Comments

@elliottslaughter
Copy link
Contributor

elliottslaughter commented Sep 3, 2024

This is attempting to replicate a failure I've seen in the resilience CI, e.g. here: https://github.com/StanfordLegion/resilience/actions/runs/10647296358/job/29515103267

Failure mode:

#7  0x00007fc576f805be in Realm::PosixAIORead::check_completion (this=0x7fc4f80380b0)
    at /scratch/eslaught/resilience-test-network/legion/runtime/realm/transfer/lowlevel_dma.cc:350
#8  0x00007fc576f8124d in Realm::AsyncFileIOContext::do_work (this=0x556822bed5a0, work_until=...)
    at /scratch/eslaught/resilience-test-network/legion/runtime/realm/transfer/lowlevel_dma.cc:604
#9  0x00007fc57716be8e in Realm::BackgroundWorkManager::Worker::do_work (this=0x7fc56aeb39c0, max_time_in_ns=-1, 
    interrupt_flag=0x0) at /scratch/eslaught/resilience-test-network/legion/runtime/realm/bgwork.cc:599
#10 0x00007fc577169844 in Realm::BackgroundWorkThread::main_loop (this=0x556822be97c0)
    at /scratch/eslaught/resilience-test-network/legion/runtime/realm/bgwork.cc:103
#11 0x00007fc57716d324 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x556822be97c0) at /scratch/eslaught/resilience-test-network/legion/runtime/realm/threads.inl:97

Note the failure mode isn't the same as what I saw in CI, but this is the best I can do for now. It's possible that if we resolved this issue we'd see the other failure mode return afterwards.

Note I'm going to share the build with #1752, so watch out when you rebuild it:

To reproduce:

cd /scratch/eslaught/resilience-test-network
source experiment/sapling/env.sh
cd test_region
salloc -n 1 -N 1 -c 40 -p all --exclusive
for i in $(seq 0 40); do ./test.sh "test_$i" & done

Reproduction ratio seems to be about 10%.

To rebuild:

cd /scratch/eslaught/resilience-test-network
source experiment/sapling/env.sh
srun -n 1 -N 1 -c 40 -p all --exclusive --pty bash --login
./experiment/setup.sh
@muraj
Copy link

muraj commented Sep 5, 2024

@elliottslaughter can we get the output of the following log message? I think we need better error handling here. I really wish we would stop asserting / aborting on error handling...

        if (ret != 0)
        {
          const char *message = realm_strerror(errno);
          log_aio.fatal("Failed asynchronous IO read [%d]: %s", errno, message);
          abort();
        }

@elliottslaughter
Copy link
Contributor Author

@muraj You can see it in the linked CI run:

[0 - 7f6b033f4c80]    0.634381 {6}{aio}: Failed asynchronous IO read [0]: Success

The Realm code is incorrect. AIO does not set errno at all, nor is it correct to query errno after a failed return. I think what you really want to see is the value of ret in that API call:

https://linux.die.net/man/3/aio_error

@elliottslaughter
Copy link
Contributor Author

I rebased on top of the newest MR 1440 and applied the following local patch:

diff --git a/runtime/realm/transfer/lowlevel_dma.cc b/runtime/realm/transfer/lowlevel_dma.cc
index 6d64c79be..79ab8c165 100644
--- a/runtime/realm/transfer/lowlevel_dma.cc
+++ b/runtime/realm/transfer/lowlevel_dma.cc
@@ -345,8 +345,8 @@ namespace Realm {
                       static_cast<void *>(&cb), ret);
         if (ret != 0)
         {
-          const char *message = realm_strerror(errno);
-          log_aio.fatal("Failed asynchronous IO read [%d]: %s", errno, message);
+          const char *message = realm_strerror(ret);
+          log_aio.fatal("Failed asynchronous IO read [%d]: %s", ret, message);
           abort();
         }
       }

Now I'm getting:

[0 - 7fd113bfac40]    1.316147 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor

@elliottslaughter
Copy link
Contributor Author

The fix for the AIO error reporting is here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1456

@elliottslaughter
Copy link
Contributor Author

For the record this still reproduces on Sapling with the instructions at the top of this issue.

@muraj
Copy link

muraj commented Sep 6, 2024

Bad file descriptor is interesting. Is it possible to get an strace output that follows the file descriptor? I am wondering if this is related to the issue that Wei is working on with the IO bgworkers.

@elliottslaughter
Copy link
Contributor Author

elliottslaughter commented Sep 10, 2024

Edit: Disregard these logs, they mixed output from both ranks into one file.

I updated the test.sh script on Sapling to collect logs via strace. Here's an example of a failing run test_1:

You're also welcome to look at the entire collection of output from other runs on Sapling at /scratch/eslaught/resilience-test-network/strace_output/test_*.

I'm not entirely sure what you're looking for in the logs, but one thing I noticed is that logs contain either:

./test_13/output.log:[0 - 7f3de78fac40]    1.053084 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor

Or they contain:

./test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd8e07ed3c) = -1 EBADF (Bad file descriptor)
./test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe1cf1af9c) = -1 EBADF (Bad file descriptor)

But they never contain both at the same time. I'm not sure if this is just file buffering issue (though I have tried to flush things as aggressively as possible) or if this is somehow meaningful.

@muraj
Copy link

muraj commented Sep 10, 2024

Thanks, I'll take a closer look later tonight. I'm trying to trace through when this file descriptor is closed and why. I suspect there is a race between this async read operation and a close somewhere that is invalidating the file descriptor. But I wanted to confirm that first with the strace logs.

@eddy16112
Copy link
Contributor

If I understand the log correctly, I guess the issue is from:

shutdown(15, SHUT_RDWR)                 = 0
close(15)                               = 0
epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff51a2e42c) = -1 EBADF (Bad file descriptor)

we close the fd=15, and then epoll_ctl it.

@muraj
Copy link

muraj commented Sep 10, 2024

Yeah, that's what I suspected. Now the question is, where is the close coming from, and why is it before the async read is complete?

@eddy16112
Copy link
Contributor

eddy16112 commented Sep 10, 2024

The fd is created by FileMemory::attempt_register_external_resource https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/lowlevel_disk.cc#L161 and closed by FileMemory::unregister_external_resource https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/lowlevel_disk.cc#L217, which are called by applications. I am not sure if it is the application's fault that the file is closed too early.

@eddy16112
Copy link
Contributor

@elliottslaughter I tried to grep -r "Bad file descriptor" * under your logs,

test_0/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe4ecfeefc) = -1 EBADF (Bad file descriptor)
test_0/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcc2732fec) = -1 EBADF (Bad file descriptor)
test_1/output.log:[0 - 7fafd8123c40]    1.250129 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_10/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0fd8cfac) = -1 EBADF (Bad file descriptor)
test_10/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffbf98589c) = -1 EBADF (Bad file descriptor)
test_11/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0d9becec) = -1 EBADF (Bad file descriptor)
test_11/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd8c136b9c) = -1 EBADF (Bad file descriptor)
test_12/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe7a19ce6c) = -1 EBADF (Bad file descriptor)
test_12/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdd0adc6fc) = -1 EBADF (Bad file descriptor)
test_13/output.log:[0 - 7f3de78fac40]    1.053084 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_14/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff0985e1ac) = -1 EBADF (Bad file descriptor)
test_14/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffce4b272dc) = -1 EBADF (Bad file descriptor)
test_15/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffea041c8bc) = -1 EBADF (Bad file descriptor)
test_15/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd3e62fc4c) = -1 EBADF (Bad file descriptor)
test_16/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0c1d849c) = -1 EBADF (Bad file descriptor)
test_17/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe78182f3c) = -1 EBADF (Bad file descriptor)
test_17/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffec178aefc) = -1 EBADF (Bad file descriptor)
test_18/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc9d4c549c) = -1 EBADF (Bad file descriptor)
test_18/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff75baa5ac) = -1 EBADF (Bad file descriptor)
test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd8e07ed3c) = -1 EBADF (Bad file descriptor)
test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe1cf1af9c) = -1 EBADF (Bad file descriptor)
test_2/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc7b30b29c) = -1 EBADF (Bad file descriptor)
test_2/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdd8ea2d3c) = -1 EBADF (Bad file descriptor)
test_20/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff7b7bf0ec) = -1 EBADF (Bad file descriptor)
test_20/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcddc33eec) = -1 EBADF (Bad file descriptor)
test_21/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc3f5afd8c) = -1 EBADF (Bad file descriptor)
test_21/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffee46d995c) = -1 EBADF (Bad file descriptor)
test_22/output.log:[0 - 7f67c81b6c40]    0.894134 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_23/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe093a915c) = -1 EBADF (Bad file descriptor)
test_23/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd3062e8bc) = -1 EBADF (Bad file descriptor)
test_24/output.log:[0 - 7fd3f8b68c40]    1.211296 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_25/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffc962893c) = -1 EBADF (Bad file descriptor)
test_25/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcbad72a1c) = -1 EBADF (Bad file descriptor)
test_26/output.log:[0 - 7fac88d92c40]    1.004523 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_27/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe6819513c) = -1 EBADF (Bad file descriptor)
test_27/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe860aa9bc) = -1 EBADF (Bad file descriptor)
test_28/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe73bd5e3c) = -1 EBADF (Bad file descriptor)
test_28/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffdf562cec) = -1 EBADF (Bad file descriptor)
test_29/output.log:[0 - 7fef361a4c40]    1.049102 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_3/output.log:[0 - 7ff7ba6a9c40]    1.039958 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_30/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff96dbebbc) = -1 EBADF (Bad file descriptor)
test_30/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffded2ea7ec) = -1 EBADF (Bad file descriptor)
test_31/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcceb3f77c) = -1 EBADF (Bad file descriptor)
test_31/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd91a6085c) = -1 EBADF (Bad file descriptor)
test_32/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc788ec00c) = -1 EBADF (Bad file descriptor)
test_32/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdc7d994ac) = -1 EBADF (Bad file descriptor)
test_33/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffe61e9eac) = -1 EBADF (Bad file descriptor)
test_33/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffea0f2d8ac) = -1 EBADF (Bad file descriptor)
test_34/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe5e9ee28c) = -1 EBADF (Bad file descriptor)
test_34/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe3ceee46c) = -1 EBADF (Bad file descriptor)
test_35/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc4cf3e17c) = -1 EBADF (Bad file descriptor)
test_35/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd98f3b48c) = -1 EBADF (Bad file descriptor)
test_36/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe5b0cb2fc) = -1 EBADF (Bad file descriptor)
test_36/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0b79cdcc) = -1 EBADF (Bad file descriptor)
test_37/output.log:[0 - 7f000c137c40]    0.934232 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_38/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc13eaa9ac) = -1 EBADF (Bad file descriptor)
test_38/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe4037fdec) = -1 EBADF (Bad file descriptor)
test_39/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffedd5d141c) = -1 EBADF (Bad file descriptor)
test_39/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc21d662ac) = -1 EBADF (Bad file descriptor)
test_4/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff51a2e42c) = -1 EBADF (Bad file descriptor)
test_4/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcaa5f5a2c) = -1 EBADF (Bad file descriptor)
test_40/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdec07645c) = -1 EBADF (Bad file descriptor)
test_40/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff43db0afc) = -1 EBADF (Bad file descriptor)
test_5/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd9334317c) = -1 EBADF (Bad file descriptor)
test_6/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff4d68014c) = -1 EBADF (Bad file descriptor)
test_6/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc3d44572c) = -1 EBADF (Bad file descriptor)
test_7/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffeab01d4ec) = -1 EBADF (Bad file descriptor)
test_7/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff5538572c) = -1 EBADF (Bad file descriptor)
test_8/output.log:[0 - 7f935423bc40]    0.943846 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_9/output.log:[0 - 7fbf58f5ec40]    1.278778 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor

I realized that the Failed asynchronous IO read [9]: Bad file descriptor in output.log does not match with the epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc13eaa9ac) = -1 EBADF (Bad file descriptor) in strace.log. For example,
we have seen test_0/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe4ecfeefc) = -1 EBADF (Bad file descriptor) in test_0, but actually test_0 is finished correctly. In the other hand, we have seen test_1/output.log:[0 - 7fafd8123c40] 1.250129 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor on test_1, but no error on the strace.log.

@elliottslaughter
Copy link
Contributor Author

Sorry, the files probably got mixed up because these were multi-node runs and I wasn't properly piping the output of each rank to a separate file. Please disregard the files above and look at the ones below, which are properly split per-rank:

test_0 hits:

test_0/output_0.log:[0 - 7ffaa42b0c40]    1.399569 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor

Files: output_0.log, output_1.log, strace_0.log, strace_1.log

test_2 hits:

test_2/strace_0.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd1b81f8bc) = -1 EBADF (Bad file descriptor)
test_2/strace_1.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff0f34f86c) = -1 EBADF (Bad file descriptor)

Files: output_0.log output_1.log strace_0.log strace_1.log

Once again I note that jobs hit one or the other but not both.

You can look at output from all of the jobs on Sapling: /scratch/eslaught/resilience-test-network/strace_output_per_rank

@eddy16112
Copy link
Contributor

If not both, it is hard to tell if the "Bad file descriptor" from strace is from the realm AIO.

Could you please apply the patch and re-genenrate the logs, or if you can tell me how to pipe the log of each rank correctly?

diff --git a/runtime/realm/transfer/channel_disk.cc b/runtime/realm/transfer/channel_disk.cc
index 492598c8f..62a83ab03 100644
--- a/runtime/realm/transfer/channel_disk.cc
+++ b/runtime/realm/transfer/channel_disk.cc
@@ -39,6 +39,7 @@ static int fsync(int fd)
 #endif
 
 namespace Realm {
+       extern Logger log_disk;
 
     FileXferDes::FileXferDes(uintptr_t _dma_op, Channel *_channel,
                             NodeID _launch_node, XferDesID _guid,
@@ -66,6 +67,7 @@ namespace Realm {
        
       RegionInstanceImpl *impl = get_runtime()->get_instance_impl(inst);
       file_info = static_cast<FileMemory::OpenFileInfo *>(impl->metadata.mem_specific);
+      log_disk.print("create FileXferDes, fd=%d", file_info->fd);
 
       const int max_nr = 10; // FIXME
       file_reqs = (FileRequest*) calloc(max_nr, sizeof(DiskRequest));
diff --git a/runtime/realm/transfer/lowlevel_disk.cc b/runtime/realm/transfer/lowlevel_disk.cc
index 6f78b42e5..90c2f175b 100644
--- a/runtime/realm/transfer/lowlevel_disk.cc
+++ b/runtime/realm/transfer/lowlevel_disk.cc
@@ -204,6 +204,7 @@ namespace Realm {
           OpenFileInfo *info = new OpenFileInfo;
           info->fd = fd;
           info->offset = res->offset;
+          log_disk.print("attempt_register_external_resource inst=%llx, fd=%d", inst->me.id, info->fd);
 
           inst->metadata.add_mem_specific(info);
           return true;
@@ -219,6 +220,7 @@ namespace Realm {
       OpenFileInfo *info = inst->metadata.find_mem_specific<OpenFileInfo>();
       assert(info != 0);
       int ret = close(info->fd);
+      log_disk.print("unregister_external_resource inst=%llx, fd=%d", inst->me.id, info->fd);
       if(ret == -1) {
         log_disk.warning() << "file failed to close cleanly, disk contents may be corrupted";
       }

I would like to know if the Failed asynchronous IO read error is printed after or before we close the file.

@elliottslaughter
Copy link
Contributor Author

To be clear, the files in #1753 (comment) were generated with correctly piping output per-rank. I believe that the property that we see errors in either the output or strace is a fundamental property of the problem we are looking at, not a mistake.

But I can certainly generate logs with the diff you suggested.

@eddy16112
Copy link
Contributor

I believe that the property that we see errors in either the output or strace is a fundamental property of the problem we are looking at, not a mistake.

I would like to understand why the error could show up in the strace but not in the output. Where does this error come from?

@elliottslaughter
Copy link
Contributor Author

I put the log files you requested here:

/scratch/eslaught/resilience-test-network/strace_output_per_rank_with_extra_logging

E.g., in test_0/output_0.log you'll see:

[0 - 7f4963860c40]    1.171475 {3}{disk}: attempt_register_external_resource inst=4000000000800000, fd=22
Done!
[0 - 7f4960967c40]    1.288046 {3}{disk}: create FileXferDes, fd=22
[0 - 7f4963860c40]    1.324472 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22
[0 - 7f4960a6bc40]    1.333938 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor

@elliottslaughter
Copy link
Contributor Author

I would like to understand why the error could show up in the strace but not in the output. Where does this error come from?

I don't know. But one thing to keep in mind is that strace reports syscalls. When we call aio_read or whatever, that's a function in libc. Maybe it maps directly to a syscall, or maybe it doesn't. Perhaps the aio_read function performs additional internal syscalls and in some scenarios it handles the error, and in other scenarios it doesn't? I'm speculating, but depending on how complex that function is, it's at least possible.

@eddy16112
Copy link
Contributor

Based on the new logs, it is almost certain that we close the file before the aio call. There could be two causes:

  1. realm notifies legion/application the completion of the IO before the aio is actually completely.
  2. legion or the application dettach the file too early.

BTW, here is the stacktrace of closing the file, I am not sure if we should blame legion/realm/application.

#9  0x00007f8673736c5d in Realm::FileMemory::unregister_external_resource (this=0x55b2d6b9c310, inst=0x7f85f6795820) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/transfer/lowlevel_disk.cc:228
#10 0x00007f8673736ec6 in Realm::FileMemory::release_storage_immediate (this=0x55b2d6b9c310, inst=0x7f85f6795820, poisoned=false, work_until=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/transfer/lowlevel_disk.cc:260
#11 0x00007f8673b3ee7e in Realm::MemoryImpl::release_storage_deferrable (this=0x55b2d6b9c310, inst=0x7f85f6795820, precondition=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/mem_impl.cc:196
#12 0x00007f8673a9d82e in Realm::RegionInstance::destroy (this=0x7f85f6796e20, wait_on=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/inst_impl.cc:521
#13 0x00007f867680a7d1 in Legion::Internal::PhysicalManager::perform_deletion (this=0x7f85f6796c40, source=0, i_lock=0x7f85f4200100) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_instances.cc:3212
#14 0x00007f867680a5b7 in Legion::Internal::PhysicalManager::perform_deletion (this=0x7f85f6796c40, source=0, i_lock=0x0) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_instances.cc:3179
#15 0x00007f8676cb51da in Legion::Internal::MemoryManager::detach_external_instance (this=0x7f85f6796840, manager=0x7f85f6796c40) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/runtime.cc:11096
#16 0x00007f867680ac9d in Legion::Internal::PhysicalManager::detach_external_instance (this=0x7f85f6796c40) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_instances.cc:3310
#17 0x00007f86769ae6e3 in Legion::Internal::ReplDetachOp::detach_external_instance (this=0x7f85fa911e40, manager=0x7f85f6796c40) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_replication.cc:7078
#18 0x00007f867688d954 in Legion::Internal::DetachOp::trigger_complete (this=0x7f85fa911e40, effects=...) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_ops.cc:21175
#19 0x00007f867683e749 in Legion::Internal::Operation::complete_mapping (this=0x7f85fa911e40, wait_on=...) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_ops.cc:2019
#20 0x00007f86766e2465 in Legion::Internal::InnerContext::process_deferred_mapped_queue (this=0x7f85f6789650, precondition=..., previous_fevent=..., performed=0) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_context.cc:8880
#21 0x00007f86766f033c in Legion::Internal::InnerContext::handle_deferred_mapped_queue (args=0x55b2da2d6dc0) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_context.cc:12086
#22 0x00007f8676d07984 in Legion::Internal::Runtime::legion_runtime_task (args=0x55b2da2d6dc0, arglen=36, userdata=0x55b2da2d27e0, userlen=8, p=...) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/runtime.cc:32502
#23 0x00007f8673b7e7f0 in Realm::LocalTaskProcessor::execute_task (this=0x55b2d6abf6e0, func_id=4, task_args=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/proc_impl.cc:1176
#24 0x00007f8673bfb5dc in Realm::Task::execute_on_processor (this=0x55b2da2d6c40, p=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/tasks.cc:326
#25 0x00007f8673c00956 in Realm::UserThreadTaskScheduler::execute_task (this=0x55b2d6bf6250, task=0x55b2da2d6c40) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/tasks.cc:1687
#26 0x00007f8673bfe6f1 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55b2d6bf6250) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/tasks.cc:1160
#27 0x00007f8673c06dce in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x55b2d6bf6250) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/threads.inl:97

@elliottslaughter
Copy link
Contributor Author

Is it even possible for a Legion application to detach an instance too early? The only way to do that is to flip the execution order of tasks, and then the instance would be simply invalid and the program would probably hit a deterministic error in Legion before ever reaching an I/O failure.

To me, your stack trace suggests Legion is processing the detach op through the runtime analysis pipeline, so if this is somehow out of order, it would be a dependence analysis failure in Legion.

@eddy16112
Copy link
Contributor

I enabled the logging of checking the completion of aio https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/lowlevel_dma.cc#L603,

In the first run, the log is:

[0 - 7f53f418bc40]    2.357283 {3}{disk}: create FileXferDes, fd=22
[0 - 7f53f418bc40]    2.365385 {3}{aio}: aio op completed: op=0x7f53a001cb70
[0 - 7f53f5d38c40]    2.409349 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22

but in the second run (the failed one), the log becomes:

[0 - 7fe01363ac40]    1.273747 {3}{disk}: create FileXferDes, fd=22
[0 - 7fe01a4acc40]    1.290690 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22
[0 - 7fe01363ac40]    1.302596 {3}{aio}: aio op completed: op=0x7fdfa401f9d0

Apparently, the detach op is issued too early.

@eddy16112
Copy link
Contributor

I added more logs to check if the FileXferDes is mark_completed before the detach. In failure runs, we detach the file before the xd is marked as completed. @lightsighter do you think if it is a bug in Legion?

@lightsighter
Copy link
Contributor

Legion makes sure that the completion events for all copies involving an instance (including file instances) are used as preconditions for deleting the instance. I suspect that the problem is that some of the asynchronous I/O operations are not being correctly reflected in the completion event of the copies.

@eddy16112
Copy link
Contributor

@elliottslaughter do you think if you could create a simple legion reproducer to reproduce the pattern that you are using for regent checkpointing? I tried to create a reproducer to do:

attach_external_resource
issue_copy_operation
detach_external_resource

But I can not reproduce the out of order issue (the detach is issued before the completion of aio). Even if I manually slow down the aio code, I can not reproduce it.

I tried to run your regent test sequentially, and the out of order issue is disappeared. Apparently, the issue only happens with parallel runs. However, I can even see the out of order logs in the output_0.log for some success runs, then I am not sure if the "out of order" is due to realm may print the logging out of order.

@lightsighter
Copy link
Contributor

However, I can even see the out of order logs in the output_0.log for some success runs, then I am not sure if the "out of order" is due to realm may print the logging out of order.

Realm's logging will be in the order that it was logged in time from the same node, so I think you are actually seeing the two different orderings with your print. Can you log the completion events for all those aio copies in Realm and then also run with event=1 to see if those events trigger before or after the aio operations are done?

@eddy16112
Copy link
Contributor

@lightsighter here is what I found from the log, the logs are divided into rank 0 and rank 1.
The logs of the success run:
The file copy operation is 0x7f7cb001f5b0, the event is 8000801000d00003, we can see the event is triggered before unregistering the external resource.

[1 - 7f7d2d65ac40]    0.435905 {2}{dma}: dma request 0x7f7cb001f5b0 created - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
[1 - 7f7d2d65ac40]    0.435919 {2}{dma}: dma request 0x7f7cb001f5b0 ready - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
[1 - 7f7d2d65ac40]    0.435925 {2}{dma}: dma request 0x7f7cb001f5b0 started - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
...
[1 - 7f7d266a9c40]    0.436674 {2}{dma}: dma request 0x7f7cb001f5b0 completed - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
[1 - 7f7d266a9c40]    0.436679 {1}{event}: event triggered: event=8000801000d00003 by node 1 (poisoned=0)
[1 - 7f7d267adc40]    0.436922 {1}{event}: event subscription: node=0 event=8000801000d00003
[1 - 7f7d267adc40]    0.436930 {1}{event}: event subscription immediate trigger: node=0 event=8000801000d00003 (<= 3)

[0 - 7f4cfb7acc40]    0.413373 {3}{disk}: attempt_register_external_resource inst=4000000000800000, fd=22
[0 - 7f4cf88b3c40]    0.436266 {3}{disk}: create FileXferDes, op=0x7f7cb001f5b0, fd=22, xd=0x7f4c7c003640
[0 - 7f4cf89b7c40]    0.436386 {2}{xd}: bytes_write: 10000000a(0) 44+0 -> 0
[0 - 7f4cf89b7c40]    0.436399 {2}{xd}: completion: xd=10000000a total_bytes=44 pending=44
[0 - 7f4cf89b7c40]    0.436525 {3}{aio}: enqueue_read launched: op=0x7f4cac044030, req=0x7f4c7c003b30, xd=0x7f4c7c003640
...
[0 - 7f4cf88b3c40]    0.436593 {3}{aio}: aio op completed: op=0x7f4cac044030, req=0x7f4c7c003b30, xd=0x7f4c7c003640
[0 - 7f4cf88b3c40]    0.436609 {2}{xd}: bytes_read: 10000000a(0) 0+44 -> 44
[0 - 7f4cf88b3c40]    0.436617 {2}{xd}: bytes_write: 10000000a(0) 0+44 -> 44
[0 - 7f4cf88b3c40]    0.436623 {2}{xd}: completion: xd=10000000a remaining=0
[0 - 7f4cf89b7c40]    0.436634 {3}{xd}: xd=0x7f4c7c003640 mark_completed
[0 - 7f4cfb7acc40]    0.436860 {2}{event}: deferring user event trigger: event=8000002001000002 wait_on=8000801000d00003
[0 - 7f4cf89b7c40]    0.436944 {1}{event}: event update: event=8000801000d00003 poisoned=0:{}
[0 - 7f4cfb7acc40]    0.439114 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22

Here are the failed logs:
The copy 0x7fb44801f0f0 is not completed, and the event 8000801000c00004 is not triggered before unregistering the external resource.

[1 - 7fb4bbbeec40]    1.073064 {2}{dma}: dma request 0x7fb44801f0f0 created - plan=0x7fb44800e7a0 before=8000801000e00002 after=8000801000c00004
[1 - 7fb4b8c8ec40]    1.073681 {2}{dma}: dma request 0x7fb44801f0f0 ready - plan=0x7fb44800e7a0 before=8000801000e00002 after=8000801000c00004
[1 - 7fb4b8c8ec40]    1.073689 {2}{dma}: dma request 0x7fb44801f0f0 started - plan=0x7fb44800e7a0 before=8000801000e00002 after=8000801000c00004
[1 - 7fb4b8c8ec40]    1.074819 {1}{event}: event subscription: node=0 event=8000801000c00004
[1 - 7fb4b8c8ec40]    1.074827 {1}{event}: event subscription recorded: node=0 event=8000801000c00004 (> 3)
[1 - 7fb4b8c8ec40]    1.074832 {1}{event}: event subscription immediate trigger: node=0 event=8000801000c00004 (<= 3)
[1 - 7fb4bbbeec40]    1.078966 {2}{event}: deferring user event trigger: event=8000801000600008 wait_on=8000801000c00004

[0 - 7f1485c70c40]    0.981099 {3}{disk}: attempt_register_external_resource inst=4000000000800000, fd=22
[0 - 7f147edb6c40]    1.074091 {3}{disk}: create FileXferDes, op=0x7fb44801f0f0, fd=22, xd=0x7f1400019810
[0 - 7f147edb6c40]    1.074235 {2}{xd}: bytes_write: 10000000a(0) 44+0 -> 0
[0 - 7f147edb6c40]    1.074247 {2}{xd}: completion: xd=10000000a total_bytes=44 pending=44
[0 - 7f147ecb2c40]    1.074396 {1}{event}: event update: event=8000801000e00002 poisoned=0:{}
[0 - 7f147edb6c40]    1.074402 {3}{aio}: enqueue_read launched: op=0x7f140001e400, req=0x7f140001de00, xd=0x7f1400019810
[0 - 7f1485c70c40]    1.074671 {2}{event}: deferring user event trigger: event=8000002000800005 wait_on=8000801000c00004
...
[0 - 7f1485c70c40]    1.082443 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22

I can send you the full logs if you need.

@lightsighter
Copy link
Contributor

@elliottslaughter Are you sure you fixed all the bugs trying to issue detaches after exiting a parent task? There are still no checks for that in the master branch so if you're doing it then it would cause a problem like this.

@elliottslaughter
Copy link
Contributor Author

We delete the resilient runtime (which stores all the data structures) before running the postamble, if that's what you're asking:

https://github.com/StanfordLegion/resilience/blob/1ef5ba534d6ffdfb3b01d81a4304ccf1d7528afa/src/resilience/resilience.cc#L2300-L2302

I suppose if there were a set of loggers that listed the postamble execution and any potential detaches, that would answer the question definitively. Do you know if we have that?

@lightsighter
Copy link
Contributor

There are no loggers that check that right now. The grandrefactor branch has a check that you aren't using a context after calling legion_task_postamble but it's hard to backport that to the master branch because it relies on a bunch of other refactoring in that branch.

I'd like to see detailed Legion Spy logging from a failing run along with the name of the completion event for the bad copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants