[cudadev][RFC] Prototype (host|device)_unique_ptr API to use lightweight "Context" object instead of CUDA stream #256

makortel · 2021-10-26T01:57:44Z

This PR builds on top of #224, but because of the actual developments conflict between the base commit of #224 and master, also the #224 part is rebased. The actual developments of this PR are in the last three commits.

The change can be summarized in make_device_unique<T>(stream) changing to make_device_unique(ctx) where ctx can be e.g. the AcquireContext/ProduceContext, or a "lightweight" HostAllocatorContext/DeviceAllocatorContext/Context (the AcquireContext/ProduceContext are convertible to the latter Context objects). (I'm really overusing the "Context" term here, but haven't figured out better wording yet).

The idea is that

HostAllocatorContext provides the access to pinned host memory allocator (and only that)
DeviceAllocatorContext provides access to device memory allocator (and only that)
Context provides access to both pinned host and device memory allocators (via conversions to the two former types), and also whatever is needed to launch asynchronous kernels or memory transfers (in practice the CUDA stream)

This change would allow e.g.

Moving the CachingDeviceAllocator and CachingHostAllocator objects from global variables to be owned (again) by CUDAService (in CMSSW only), that would further enable (again) the caching allocator parameters be configured at run time
Possibly evolving the interplay between the caching allocators and the AcquireContext/ProduceContext for better performance (see discussion in [cudadev] Improve caching allocator performance #218)
Pinned host allocation API without explicit use of CUDA stream (evolving possibly later to an interface not requiring the stream, see point above). Such API would be more useful than the current CUDA stream -based API in ESProducers, that initiate transfers from one (pinned) host memory block to all available devices.

Add runAcquire(), runProduce() functions

… in better-defined way

…parts

makortel · 2021-10-26T01:59:13Z

src/cudadev/CUDACore/Context.h

+  class HostAllocatorContext {
+  public:
+    explicit HostAllocatorContext(cudaStream_t stream) : stream_(stream) {}
+
+    void *allocate_host(size_t nbytes) const { return cms::cuda::allocate_host(nbytes, stream_); }
+
+    void free_host(void *ptr) const { cms::cuda::free_host(ptr); }
+
+  private:
+    cudaStream_t stream_;
+  };
+
+  class DeviceAllocatorContext {
+  public:
+    explicit DeviceAllocatorContext(cudaStream_t stream) : stream_(stream) {}
+
+    void *allocate_device(size_t nbytes) const { return cms::cuda::allocate_device(nbytes, stream_); }
+
+    void free_device(void *ptr) const { cms::cuda::free_device(ptr, stream_); }
+
+  private:
+    cudaStream_t stream_;
+  };


Right now (and possibly forever in cudadev) the HostAllocatorContext and DeviceAllocatorContext look nearly identical, but in the future (in CMSSW) they could hold a pointer to the CachingHostAllocator/CachingDeviceAllocator objects.

…uda::Context objects instead of cudaStream_t

makortel · 2022-09-16T19:13:20Z

Made effectively obsolete by cms-sw/cmssw#39428 (although this particular development is not part of the CMSSW PR).

makortel added 11 commits October 25, 2021 17:05

Add Context class hierarchy

06cc8f2

Add EDProducer base class, and runAcquire() and runProduce() functions

44b1429

Add runAcquire(), runProduce() functions

Migrate code from ScopedContext to Context

70341c6

Avoid using ContextState

df786db

Remove ScopedContext as obsolete

8b3187a

Share the CUDA stream between Event products from the same EDProducer…

4cd8dab

… in better-defined way

Split Context.h into one class per file

55effec

Migrate code away from Context.h

e5b8bdd

Remove Context.h

eaf55b2

Add lightweight Context class with memory allocation oriented counter…

744c94f

…parts

Add TestContext for tests

0429ad5

makortel commented Oct 26, 2021

View reviewed changes

Migrate make_(device|host)_unique and all calling code to pass cms::c…

c91e0e3

…uda::Context objects instead of cudaStream_t

makortel force-pushed the cudadev_next_api_v3_allocator_via_context_v3 branch from f2054b5 to c91e0e3 Compare October 26, 2021 23:57

makortel added the cuda label Oct 27, 2021

makortel mentioned this pull request Oct 27, 2021

[cudadev][RFC] Prototype an improved model for EventSetup use #257

Closed

fwyzard added the enhancement New feature or request label Oct 27, 2021

makortel mentioned this pull request Oct 27, 2021

Evolve Context to have allocation interface in standalone cms-sw/framework-team#233

Closed

makortel mentioned this pull request Feb 22, 2022

[alpakatest][RFC] Prototype evolution of EDModule API #314

Closed

makortel mentioned this pull request Sep 16, 2022

Evolution of the Alpaka "gpu framework" cms-sw/cmssw#39428

Merged

5 tasks

makortel closed this Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudadev][RFC] Prototype (host|device)_unique_ptr API to use lightweight "Context" object instead of CUDA stream #256

[cudadev][RFC] Prototype (host|device)_unique_ptr API to use lightweight "Context" object instead of CUDA stream #256

makortel commented Oct 26, 2021

makortel Oct 26, 2021

makortel commented Sep 16, 2022

[cudadev][RFC] Prototype (host|device)_unique_ptr API to use lightweight "Context" object instead of CUDA stream #256

[cudadev][RFC] Prototype (host|device)_unique_ptr API to use lightweight "Context" object instead of CUDA stream #256

Conversation

makortel commented Oct 26, 2021

makortel Oct 26, 2021

Choose a reason for hiding this comment

makortel commented Sep 16, 2022