Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the use of CUDA managed memory #85

Open
fwyzard opened this issue Jun 20, 2018 · 8 comments
Open

Investigate the use of CUDA managed memory #85

fwyzard opened this issue Jun 20, 2018 · 8 comments

Comments

@fwyzard
Copy link

fwyzard commented Jun 20, 2018

Given the small time spent in memory transfer, and the possibility to optimise it via prefetching, it makes sense to investigate using CUDA managed memory.

To form a good idea of what it involves, one can go through these 2017 CUDA blog posts:

For further reading:

@cmsbot
Copy link

cmsbot commented Jun 20, 2018

A new Issue was created by @fwyzard Andrea Bocci.

can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Author

fwyzard commented Jun 20, 2018

@makortel
Copy link

Just to write up one idea that came up in a discussion with @fwyzard and @felicepantaleo.

It seems that the main(?) drawback from unified memory is that making device-to-host prefetches asyncronous in CPU is a bit complicated (from @fwyzard's [third link])(https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/))

For device-to-host prefetches that are not deferred by the driver, the call doesn’t return until the entire prefetch operation has completed. This is because the CPU’s page tables cannot be updated asynchronously. So to unblock the CPU for device-to-host prefetches, the stream should not be idle when calling cudaMemPrefetchAsync.

(and for that "deferred means"

For busy CUDA streams, the call to prefetch is deferred to a separate background thread by the driver because the prefetch operation has to execute in stream order. The background thread performs the prefetch operation when all prior operations in the stream complete. For idle streams, the driver has a choice to either defer the operation or not, but the driver typically does not defer because of the associated overhead.

)

So one option would be a mixed approach of using unified memory to transfer data to GPUs (especially for conditions), and explicit memory for transferring data to CPU.

@makortel
Copy link

makortel commented Sep 3, 2018

#157 experiments with unified memory for conditions

fwyzard pushed a commit that referenced this issue Nov 1, 2018
New version of templated code based on "trait structs"
@makortel
Copy link

@fwyzard re #267 (comment) (I started to write a reply but never finished, following up here)

Whether to manage the device and host memories separately or use the unified memory is still under discussion. I suppose in the latter case we still want a caching allocator to avoid calling cudaMallocManaged() every time.

With a recent enough kernel (4.14, so RHEL 7 with an updated kernel, or RHEL 8) malloc/free should be enough.

You're referring to HMM, right? That essentially makes standard malloc() to return a pointer to the unified memory, right?

I wonder if malloc()+free() then become implicitly synchronizing as well. On the other hand, then we'd have jemalloc to do the caching between us and the OS (or so I presume), so maybe in practice the synchronization wouldn't matter more than with a custom caching allocator.

@fwyzard
Copy link
Author

fwyzard commented Feb 19, 2019

From what I understand (see e.g. https://lwn.net/Articles/731259/ ) it is kind of the opposite: any memory area can be mapped from the host to the device; when the cpu later tries to access it, it triggers a page fault, and the memory is copied back to the host.

So, my guess is that all memory returned by malloc, mmap, jemalloc, etc. can work with the Heterogeneous Memory Management, and can be passed on to the GPU.

The next step would be to try it in practice... but I haven't been able to set up vinavx2 or an other machine with a recent enough kernel, and my laptop has a Maxwell card, while this requires Pascal or newer.

And, as it will likely require CentOS 8 for use in production, it may be something we have to delay for a while.

@makortel
Copy link

I'd still expect (in absence of better information) that the HMM internally talks to the NVidia driver, and that for the "HMM memory", the driver and the device have to do something similar to what is done for cudaMallocManaged()+cudaFree(). Therefore, since cudaMallocManaged()+cudaFree() create an implicit synchronization point, I'd assume "HMM memory" would have similar constraints. (but I'm happy to be proven wrong)

@makortel
Copy link

I'm planning to do a full-scale study with the pixeltrack-standalone, tracked in cms-patatrack/pixeltrack-standalone#43.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants