Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add value initialisation to make_host_unique (and make_device_unique ?) #587

Open
fwyzard opened this issue Nov 30, 2020 · 9 comments
Open

Comments

@fwyzard
Copy link

fwyzard commented Nov 30, 2020

cms::cuda::make_host_unique allocates pinned host memory but leaves it uninitialised.

In some cases it may be useful to initialise the memory to a specific value (or N copies of a value for the array version).
It should be simple to add overload that takes a value by copy and sets the newly allocated memory.

I'm not sure if it makes sense to do it also for make_device_uniqe ?
For a single value it could easily be done via cudaMemsetAsync or cudaMemcpyAsync.
For an array I don't know if there is a CUDA runtime function we can leverage.

@fwyzard
Copy link
Author

fwyzard commented Nov 30, 2020

@makortel what do you think ?

@fwyzard
Copy link
Author

fwyzard commented Nov 30, 2020

@jsalfeld this is something you brought up on Mattermost

@makortel
Copy link

That was intentional because we wanted to allocate device memory as uninitialized, and we(/I?) wanted to enforce it compile time to minimize surprises, which essentially implied similar restrictions on the pinned-host allocations as well.

cms-sw#31721 made also me think we probably could improve the interface.

For pinned host allocations we could just do the value initialization in make_host_unique to be consistent with make_unique. We could also add make_host_unique_for_overwrite (to mimic make_unique_for_overwrite that's coming in C++20) to do default initialization instead, and probably remove the current make_host_unique_uninitialized (is there any real need to avoid calling a default constructor that does non-default initialization of class members?)

The implications for device memory would then be (for consistency)

  • make_device_unique
    • I was thinking of dropping it, but
    • we could have an overload that can use cudaMemsetAsync() or cudaMemcpyAsync()
    • in principle we could allow "arbitrary" construction by running a kernel, but maybe we
  • make_device_unique_for_overwrite: could be added
  • make_device_unique_uninitialized: should probably be kept to allow allocating memory for classes with non-trivial constructors

I think the current requirement of std::is_trivially_constructible should be then changed to std::is_trivially_destructible (in principle we should require the latter already now).

What about std::is_trivially_copy_constructible? We effectively assume that for all types that are copied with cudaMemcpyAsync(). I'm not sure if doing the check in cms::cuda::copyAsync() would be useful because it is so easy to use cudaMemcpyAsync() directly instead. In a sense the only purpose for pinned host memory allocations is to copy data to or from the device, so requiring those to be trivially copy constructible should not restrict too much (in principle). Device memory allocations could still be allowed to have non-trivial copy constructors.

@makortel
Copy link

For an array I don't know if there is a CUDA runtime function we can leverage.

Why cudaMemcpyAsync() would not work?

On the other hand, std::make_unique provides only value initialization for array elements and std::make_unique_for_overwrite default initialization, so I don't think we'd need to support very generic initialization for arrays anyway.

@fwyzard
Copy link
Author

fwyzard commented Nov 30, 2020

I need more time to digest the rest, but I can comment on this:

For an array I don't know if there is a CUDA runtime function we can leverage.

Why cudaMemcpyAsync() would not work?

If we initialise an array of N elements to a single value, it would be more efficient to copy the value to the GPU only once, and use it to set all elements.

To use cudaMemcpyAsync() we would need to either fill an equally large buffer on the host and copy it (which is waste of memory and bandwidth) or call cudaMemcpyAsync() N times (which is a waste of runtime calls and bandwidth).
Unfortunately I can't find any adequate cudaMemcpy variant - but we could implement it by single cudaMemcpyAsync followed by and ad hoc kernel that makes N copies of a single value.

@fwyzard fwyzard changed the title add value nitialisation to make_host_unique (and make_device_unique ?) add value initialisation to make_host_unique (and make_device_unique ?) Dec 1, 2020
@fwyzard
Copy link
Author

fwyzard commented Dec 1, 2020

make_unique_for_overwrite is one of those C++ things that make my head hurt :-(

So

  • make_unique<T>() is equivalent to unique_ptr<T>(new T())
  • make_unique_for_overwrite<T>() is equivalent to unique_ptr<T>(new T)

If I managed to understand correctly the difference between default initialisation and value initialisation/zero initialisation:

  • if T has a user defined, non-defaulted default constructor, it is called (can we say "non trivial" default constructor ?), otherwise
  • make_unique<T>() will initialise T to zero
  • make_unique_for_overwrite<T>() will leave the memory uninitialised

On our side

  • make_host_unique<T>() is equivalent to unique_ptr<T>((T*) malloc(sizeof(T))) but since it checks that T is trivially constructible, it should be equivalent to make_unique_for_overwrite<T>() ?

  • make_host_unique_uninitialized<T>() is equivalent to unique_ptr<T>((T*) malloc(sizeof(T))) without the check that T is trivially constructible; I agree that it's confusing, and since we are not using it we could actually drop it ?

@makortel
Copy link

makortel commented Dec 1, 2020

make_unique_for_overwrite is one of those C++ things that make my head hurt :-(

I don't disagree, the whole initialization business is rather convoluted.

On our side

  • make_host_unique<T>() is equivalent to unique_ptr<T>((T*) malloc(sizeof(T))) but since it checks that T is trivially constructible, it should be equivalent to make_unique_for_overwrite<T>() ?

I agree.

  • make_host_unique_uninitialized<T>() is equivalent to unique_ptr<T>((T*) malloc(sizeof(T))) without the check that T is trivially constructible; I agree that it's confusing, and since we are not using it we could actually drop it ?

I agree. The history of the _uninitialized was to allow allocating objects of Eigen classes that have non-defaulted default constructors to make it clear for the caller that the memory is uninitialized (arguably not the best choice of interface). I don't remember if the _uninitialized was added mainly for device or host (the other one being for completeness). Anyway, neither appear to be used anymore (probably the use case was covered by allocating uint8_t or something and explicitly casting part of that to the desired type).

For pinned host allocations following make_unique and make_unique_for_overwrite is straightforward. But what to do for the device allocations? On one hand I think consistent bad interface is better than inconsistent bad interface, i.e. if we mimic std the behavior should be similar, which would mean make_device_unique to do value initialization, which I believe we don't want to do in most cases. On the other hand writing make_device_for_overwrite (or make_device_uninitialized) all the time would be annoying. But at least the behavior would be clear.

@fwyzard
Copy link
Author

fwyzard commented Dec 2, 2020

Yes, I agree that writing make_device_unique_for_overwrite or make_device_unique_uninitialised all the time would be annoying.

I don't think I have good answers to the rest :-(

@makortel
Copy link

makortel commented Dec 2, 2020

We could also think of ditching the attempt to mimic std::unique_ptr, for device memory its API is anyway wider than what is really usable (like operator*(), operator->(), operator[] can't be used). Then it would be somewhat easier to just say "make_device() does not initialize`.

Or maybe we could rename the creation function to something along allocate_device() (possibly still returning std::unique_ptr with a custom deleter) to make it clear that the function does not initialize the memory. (in this case I'd move the current cms::cuda::allocate_device() function to e.g. cms::cuda::allocator namespace to hide the "void *" interface more).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants