cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator #306

fwyzard · 2019-03-28T10:47:00Z

When running multiple cmsRun applications sharing the same GPU, they have a random chance of crashing during the first event with a message similar to

CUDA error 77 [HeterogeneousCore/CUDAServices/src/CachingDeviceAllocator.h, 598]: an illegal memory access was encountered

This seems to happen frequently if the jobs are configured with 3-4 streams each, while it has not been observed if the jobs are configured with 7-8 streams each.

GPU memory itself should not be an issue, as this happens also on a V100 with 32 GB.

Enabling the allocator debug messages (and extending them a bit) gives for exaple

...
        Device 0 allocated new device block at 0x2aac09613000 (4096 bytes associated with stream 46912876745424, event 46914488986560).
                8 available blocks cached (5767168 bytes), 62 live blocks outstanding(126362368 bytes).
        Device 0 allocated new device block at 0x2aac09614000 (4096 bytes associated with stream 46921169338368, event 46921227412960).
                8 available blocks cached (5767168 bytes), 63 live blocks outstanding(126366464 bytes).
        Device 0 allocated new device block at 0x2aac09615000 (4096 bytes associated with stream 46912876745432, event 46921149539136).
                8 available blocks cached (5767168 bytes), 64 live blocks outstanding(126370560 bytes).
        Device 0 returned 4096 bytes at 0x2aac09615000 from associated stream 46912876745432, event 46921149539136.
                 9 available blocks cached (5771264 bytes), 63 live blocks outstanding. (126366464 bytes)
        Device 0 returned 4096 bytes at 0x2aac09613000 from associated stream 46912876745424, event 46914488986560.
                 10 available blocks cached (5775360 bytes), 62 live blocks outstanding. (126362368 bytes)
        Host returned 2097152 bytes from associated stream 46912876745432 on device 0.
                 4 available blocks cached (2097344 bytes), 15 live blocks outstanding. (14680352 bytes)
        Host returned 2097152 bytes from associated stream 46912876745432 on device 0.
                 5 available blocks cached (4194496 bytes), 14 live blocks outstanding. (12583200 bytes)
        Host returned 2097152 bytes from associated stream 46912876745424 on device 0.
                 6 available blocks cached (6291648 bytes), 13 live blocks outstanding. (10486048 bytes)
        Device 0 returned 4096 bytes at 0x2aac09612000 from associated stream 46912876745656, event 46921137025856.
                 11 available blocks cached (5779456 bytes), 61 live blocks outstanding. (126358272 bytes)
        Host returned 2097152 bytes from associated stream 46912876745424 on device 0.
                 7 available blocks cached (8388800 bytes), 12 live blocks outstanding. (8388896 bytes)
        Host returned 2097152 bytes from associated stream 46912876745656 on device 0.
                 8 available blocks cached (10485952 bytes), 11 live blocks outstanding. (6291744 bytes)
        Host returned 2097152 bytes from associated stream 46912876745656 on device 0.
                 9 available blocks cached (12583104 bytes), 10 live blocks outstanding. (4194592 bytes)
        Device 0 returned 4096 bytes at 0x2aac09614000 from associated stream 46921169338368, event 46921227412960.
                 12 available blocks cached (5783552 bytes), 60 live blocks outstanding. (126354176 bytes)
        Host returned 2097152 bytes from associated stream 46921169338368 on device 0.
                 10 available blocks cached (14680256 bytes), 9 live blocks outstanding. (2097440 bytes)
        Host returned 2097152 bytes from associated stream 46921169338368 on device 0.
                 11 available blocks cached (16777408 bytes), 8 live blocks outstanding. (288 bytes)
        Host returned 8 bytes from associated stream 46912876745424 on device 0.
                 12 available blocks cached (16777416 bytes), 7 live blocks outstanding. (280 bytes)
        Host returned 8 bytes from associated stream 46921169338368 on device 0.
                 13 available blocks cached (16777424 bytes), 6 live blocks outstanding. (272 bytes)
        Host returned 8 bytes from associated stream 46912876745656 on device 0.
                 14 available blocks cached (16777432 bytes), 5 live blocks outstanding. (264 bytes)
        Host returned 8 bytes from associated stream 46912876745432 on device 0.
                 15 available blocks cached (16777440 bytes), 4 live blocks outstanding. (256 bytes)

before the error:

CUDA error 77 [HeterogeneousCore/CUDAServices/src/CachingDeviceAllocator.h, 602]: an illegal memory access was encountered
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  an illegal memory access was encountered
        Device 0 returned 64 bytes at 0x2aab0ce08c00 from associated stream 46921169338368, event 46921227412064.
                 13 available blocks cached (5783616 bytes), 59 live blocks outstanding. (126354112 bytes)

The line in question is

              if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

and the error seems to come genuinely from it; checking cudaGetLastError() right before it reports nothing.

The text was updated successfully, but these errors were encountered:

fwyzard · 2019-03-28T11:06:47Z

Here are attached the full logs from two failed runs:
failure1.log
failure2.log

fwyzard · 2019-03-28T11:31:05Z

@makortel do you have any ideas ?

fwyzard · 2019-03-28T14:35:57Z

Note that this does not happen when using MPS, see #307 .

makortel · 2019-03-28T14:59:11Z

Thanks, I'll take a look (deep dive...). Did I understand correctly that the crash occurs only if you run multiple jobs in parallel? I.e. a single job with 3-4 streams/threads works?

fwyzard · 2019-03-28T15:04:09Z

Yes, the crash happens when running 2 jobs, with 4 streams/threads each, on the same GPU.

Running a single job works.
Running two jobs on different GPUs works (ok, I didn't try recently, but it used to work).
Running two jobs on a single GPU with MPS also works.

Looking at the extended logs, everything seems in order, so I am inclined to consider this a CUDA bug...

makortel · 2019-03-28T15:25:20Z

I can reproduce on felk40 (RTX 2080).

makortel · 2019-03-28T15:45:41Z

It looks like the two processes must have in total >= 7 threads for the crash to occur. E.g. 4-4, 4-3, 5-2 crash, whereas e.g. 4-2, 5-1 do not seem to crash (ok, I did try only a couple of times). Other the other hand, 6-1 seems to work as well (and 6-2 crashes).

fwyzard · 2019-04-17T16:01:31Z

I can reproduce this (alsu under gdb) on a V100 and a T4.

It is not clear if it happens on a GTX 1080 or a P100.

fwyzard · 2019-04-17T16:13:38Z

By the way, during the E4 Hackathon, an NVIDIA guy mentioned the new device-side RAPIDS Memory Manager.

fwyzard · 2019-04-21T11:19:01Z

Actually, RMM seems like a thin wrapper around the CNMeM library.

fwyzard · 2020-10-10T08:08:33Z

The same problem is reproducible running in parallel multiple copies of the cuda program from https://github.com/cms-patatrack/pixeltrack-standalone/ .

makortel · 2020-10-10T18:47:54Z

Interesting. Have you tested if the crash occurs also in CUDA 11?

If this crash is considered as a future blocker, I'd first try to reduce the (ridiculous) number of CUDA events along #487.

fwyzard · 2020-10-10T21:40:54Z

With CMSSW it happens also with CUDA 11.0 and 11.1 . So far, using MPS is a viable workaround - and we would likely want to use it anyway to get better performance. Andreas (from NVIDIA) was able to reproduce the crash with full CMSSW and with pixeltrack-standalone, so he might be able to investigate...

fwyzard · 2020-10-10T21:43:15Z

However, reducing the number of CUDA events might be worthwhile on its own.

makortel · 2020-11-11T21:22:26Z

I just reproduced this on a single process of cuda program of https://github.com/cms-patatrack/pixeltrack-standalone/ (after the merge of cms-patatrack/pixeltrack-standalone#129) on V100 with 7 CPU threads and EDM streams.

.../pixeltrack-standalone/src/cuda/CUDACore/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

This was with CUDA 11.1

fwyzard · 2021-08-02T14:28:38Z

Hopefully fixed by cms-sw#34725

fwyzard added the bug label Mar 28, 2019

makortel mentioned this issue Apr 17, 2019

implement simple memory "working space" #138

Open

makortel mentioned this issue Apr 22, 2019

Move BeamSpot transfer to GPU to its own producer #318

Merged

makortel mentioned this issue Nov 25, 2020

Add --tryAgain option to run-scan.py cms-patatrack/pixeltrack-standalone#138

Merged

makortel mentioned this issue Jul 28, 2021

GPU Tests - pixel crashes on GPU Hilton cms-sw/cmssw#34659

Closed

fwyzard changed the title ~~crash related to the CachingDeviceAllocator~~ cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator #306

cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator #306

fwyzard commented Mar 28, 2019 •

edited

Loading

fwyzard commented Mar 28, 2019

fwyzard commented Mar 28, 2019

fwyzard commented Mar 28, 2019

makortel commented Mar 28, 2019

fwyzard commented Mar 28, 2019

makortel commented Mar 28, 2019

makortel commented Mar 28, 2019

fwyzard commented Apr 17, 2019

fwyzard commented Apr 17, 2019

fwyzard commented Apr 21, 2019

fwyzard commented Oct 10, 2020 •

edited

Loading

makortel commented Oct 10, 2020

fwyzard commented Oct 10, 2020 via email

fwyzard commented Oct 10, 2020 via email

makortel commented Nov 11, 2020 •

edited

Loading

fwyzard commented Aug 2, 2021

cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator #306

cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator #306

Comments

fwyzard commented Mar 28, 2019 • edited Loading

fwyzard commented Mar 28, 2019

fwyzard commented Mar 28, 2019

fwyzard commented Mar 28, 2019

makortel commented Mar 28, 2019

fwyzard commented Mar 28, 2019

makortel commented Mar 28, 2019

makortel commented Mar 28, 2019

fwyzard commented Apr 17, 2019

fwyzard commented Apr 17, 2019

fwyzard commented Apr 21, 2019

fwyzard commented Oct 10, 2020 • edited Loading

makortel commented Oct 10, 2020

fwyzard commented Oct 10, 2020 via email

fwyzard commented Oct 10, 2020 via email

makortel commented Nov 11, 2020 • edited Loading

fwyzard commented Aug 2, 2021

fwyzard commented Mar 28, 2019 •

edited

Loading

fwyzard commented Oct 10, 2020 •

edited

Loading

makortel commented Nov 11, 2020 •

edited

Loading