Investigate the use of CUDA managed memory #85

fwyzard · 2018-06-20T09:48:36Z

Given the small time spent in memory transfer, and the possibility to optimise it via prefetching, it makes sense to investigate using CUDA managed memory.

To form a good idea of what it involves, one can go through these 2017 CUDA blog posts:

For further reading:

The text was updated successfully, but these errors were encountered:

cmsbot · 2018-06-20T09:48:52Z

A new Issue was created by @fwyzard Andrea Bocci.

can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2018-06-20T09:49:59Z

FYI: @makortel @felicepantaleo @rovere @VinInn @vkhristenko

makortel · 2018-08-13T07:53:26Z

Just to write up one idea that came up in a discussion with @fwyzard and @felicepantaleo.

It seems that the main(?) drawback from unified memory is that making device-to-host prefetches asyncronous in CPU is a bit complicated (from @fwyzard's [third link])(https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/))

For device-to-host prefetches that are not deferred by the driver, the call doesn’t return until the entire prefetch operation has completed. This is because the CPU’s page tables cannot be updated asynchronously. So to unblock the CPU for device-to-host prefetches, the stream should not be idle when calling cudaMemPrefetchAsync.

(and for that "deferred means"

For busy CUDA streams, the call to prefetch is deferred to a separate background thread by the driver because the prefetch operation has to execute in stream order. The background thread performs the prefetch operation when all prior operations in the stream complete. For idle streams, the driver has a choice to either defer the operation or not, but the driver typically does not defer because of the associated overhead.

)

So one option would be a mixed approach of using unified memory to transfer data to GPUs (especially for conditions), and explicit memory for transferring data to CPU.

makortel · 2018-09-03T14:50:54Z

#157 experiments with unified memory for conditions

New version of templated code based on "trait structs"

makortel · 2019-02-18T20:47:15Z

@fwyzard re #267 (comment) (I started to write a reply but never finished, following up here)

Whether to manage the device and host memories separately or use the unified memory is still under discussion. I suppose in the latter case we still want a caching allocator to avoid calling cudaMallocManaged() every time.

With a recent enough kernel (4.14, so RHEL 7 with an updated kernel, or RHEL 8) malloc/free should be enough.

You're referring to HMM, right? That essentially makes standard malloc() to return a pointer to the unified memory, right?

I wonder if malloc()+free() then become implicitly synchronizing as well. On the other hand, then we'd have jemalloc to do the caching between us and the OS (or so I presume), so maybe in practice the synchronization wouldn't matter more than with a custom caching allocator.

fwyzard · 2019-02-19T00:47:54Z

From what I understand (see e.g. https://lwn.net/Articles/731259/ ) it is kind of the opposite: any memory area can be mapped from the host to the device; when the cpu later tries to access it, it triggers a page fault, and the memory is copied back to the host.

So, my guess is that all memory returned by malloc, mmap, jemalloc, etc. can work with the Heterogeneous Memory Management, and can be passed on to the GPU.

The next step would be to try it in practice... but I haven't been able to set up vinavx2 or an other machine with a recent enough kernel, and my laptop has a Maxwell card, while this requires Pascal or newer.

And, as it will likely require CentOS 8 for use in production, it may be something we have to delay for a while.

makortel · 2019-02-19T20:45:12Z

I'd still expect (in absence of better information) that the HMM internally talks to the NVidia driver, and that for the "HMM memory", the driver and the device have to do something similar to what is done for cudaMallocManaged()+cudaFree(). Therefore, since cudaMallocManaged()+cudaFree() create an implicit synchronization point, I'd assume "HMM memory" would have similar constraints. (but I'm happy to be proven wrong)

makortel · 2020-05-18T16:08:45Z

I'm planning to do a full-scale study with the pixeltrack-standalone, tracked in cms-patatrack/pixeltrack-standalone#43.

cmsbot added the pending-assignment label Jun 20, 2018

fwyzard removed the pending-assignment label Jul 5, 2018

fwyzard added the enhancement label Jul 23, 2018

cmsbot added pending-assignment and removed enhancement labels Sep 3, 2018

fwyzard removed the pending-assignment label Sep 4, 2018

fwyzard added the enhancement label Sep 27, 2018

fwyzard pushed a commit that referenced this issue Nov 1, 2018

Merge pull request #85 from mbluj/TauClusterVarRefactor_10_3_X

79da684

New version of templated code based on "trait structs"

makortel mentioned this issue Mar 11, 2019

Improve work-to-device scheduling #277

Open

fwyzard added new feature and removed enhancement new feature labels Mar 14, 2019

makortel mentioned this issue May 18, 2020

Investigate the performance impact of CUDA managed memory cms-patatrack/pixeltrack-standalone#43

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the use of CUDA managed memory #85

Investigate the use of CUDA managed memory #85

fwyzard commented Jun 20, 2018

cmsbot commented Jun 20, 2018

fwyzard commented Jun 20, 2018

makortel commented Aug 13, 2018

makortel commented Sep 3, 2018

makortel commented Feb 18, 2019

fwyzard commented Feb 19, 2019

makortel commented Feb 19, 2019

makortel commented May 18, 2020

Investigate the use of CUDA managed memory #85

Investigate the use of CUDA managed memory #85

Comments

fwyzard commented Jun 20, 2018

cmsbot commented Jun 20, 2018

fwyzard commented Jun 20, 2018

makortel commented Aug 13, 2018

makortel commented Sep 3, 2018

makortel commented Feb 18, 2019

fwyzard commented Feb 19, 2019

makortel commented Feb 19, 2019

makortel commented May 18, 2020