-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator #306
Comments
Here are attached the full logs from two failed runs: |
@makortel do you have any ideas ? |
Note that this does not happen when using MPS, see #307 . |
Thanks, I'll take a look (deep dive...). Did I understand correctly that the crash occurs only if you run multiple jobs in parallel? I.e. a single job with 3-4 streams/threads works? |
Yes, the crash happens when running 2 jobs, with 4 streams/threads each, on the same GPU. Running a single job works. Looking at the extended logs, everything seems in order, so I am inclined to consider this a CUDA bug... |
I can reproduce on |
It looks like the two processes must have in total >= 7 threads for the crash to occur. E.g. 4-4, 4-3, 5-2 crash, whereas e.g. 4-2, 5-1 do not seem to crash (ok, I did try only a couple of times). Other the other hand, 6-1 seems to work as well (and 6-2 crashes). |
I can reproduce this (alsu under gdb) on a V100 and a T4. It is not clear if it happens on a GTX 1080 or a P100. |
By the way, during the E4 Hackathon, an NVIDIA guy mentioned the new device-side RAPIDS Memory Manager. |
Actually, RMM seems like a thin wrapper around the CNMeM library. |
The same problem is reproducible running in parallel multiple copies of the |
Interesting. Have you tested if the crash occurs also in CUDA 11? If this crash is considered as a future blocker, I'd first try to reduce the (ridiculous) number of CUDA events along #487. |
With CMSSW it happens also with CUDA 11.0 and 11.1 .
So far, using MPS is a viable workaround - and we would likely want to use
it anyway to get better performance.
Andreas (from NVIDIA) was able to reproduce the crash with full CMSSW and
with pixeltrack-standalone, so he might be able to investigate...
|
However, reducing the number of CUDA events might be worthwhile on its own.
|
I just reproduced this on a single process of
This was with CUDA 11.1 |
Hopefully fixed by cms-sw#34725 |
When running multiple cmsRun applications sharing the same GPU, they have a random chance of crashing during the first event with a message similar to
This seems to happen frequently if the jobs are configured with 3-4 streams each, while it has not been observed if the jobs are configured with 7-8 streams each.
GPU memory itself should not be an issue, as this happens also on a V100 with 32 GB.
Enabling the allocator debug messages (and extending them a bit) gives for exaple
before the error:
The line in question is
and the error seems to come genuinely from it; checking
cudaGetLastError()
right before it reports nothing.The text was updated successfully, but these errors were encountered: