Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudadev][RFC] Prototype (host|device)_unique_ptr API to use lightweight "Context" object instead of CUDA stream #256

Conversation

makortel
Copy link
Collaborator

This PR builds on top of #224, but because of the actual developments conflict between the base commit of #224 and master, also the #224 part is rebased. The actual developments of this PR are in the last three commits.

The change can be summarized in make_device_unique<T>(stream) changing to make_device_unique(ctx) where ctx can be e.g. the AcquireContext/ProduceContext, or a "lightweight" HostAllocatorContext/DeviceAllocatorContext/Context (the AcquireContext/ProduceContext are convertible to the latter Context objects). (I'm really overusing the "Context" term here, but haven't figured out better wording yet).

The idea is that

  • HostAllocatorContext provides the access to pinned host memory allocator (and only that)
  • DeviceAllocatorContext provides access to device memory allocator (and only that)
  • Context provides access to both pinned host and device memory allocators (via conversions to the two former types), and also whatever is needed to launch asynchronous kernels or memory transfers (in practice the CUDA stream)

This change would allow e.g.

  • Moving the CachingDeviceAllocator and CachingHostAllocator objects from global variables to be owned (again) by CUDAService (in CMSSW only), that would further enable (again) the caching allocator parameters be configured at run time
  • Possibly evolving the interplay between the caching allocators and the AcquireContext/ProduceContext for better performance (see discussion in [cudadev] Improve caching allocator performance #218)
  • Pinned host allocation API without explicit use of CUDA stream (evolving possibly later to an interface not requiring the stream, see point above). Such API would be more useful than the current CUDA stream -based API in ESProducers, that initiate transfers from one (pinned) host memory block to all available devices.

Comment on lines +8 to +30
class HostAllocatorContext {
public:
explicit HostAllocatorContext(cudaStream_t stream) : stream_(stream) {}

void *allocate_host(size_t nbytes) const { return cms::cuda::allocate_host(nbytes, stream_); }

void free_host(void *ptr) const { cms::cuda::free_host(ptr); }

private:
cudaStream_t stream_;
};

class DeviceAllocatorContext {
public:
explicit DeviceAllocatorContext(cudaStream_t stream) : stream_(stream) {}

void *allocate_device(size_t nbytes) const { return cms::cuda::allocate_device(nbytes, stream_); }

void free_device(void *ptr) const { cms::cuda::free_device(ptr, stream_); }

private:
cudaStream_t stream_;
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now (and possibly forever in cudadev) the HostAllocatorContext and DeviceAllocatorContext look nearly identical, but in the future (in CMSSW) they could hold a pointer to the CachingHostAllocator/CachingDeviceAllocator objects.

…uda::Context objects instead of cudaStream_t
@makortel
Copy link
Collaborator Author

Made effectively obsolete by cms-sw/cmssw#39428 (although this particular development is not part of the CMSSW PR).

@makortel makortel closed this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants