Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the use of CUDA API wrappers #386

Closed
20 tasks done
makortel opened this issue Sep 11, 2019 · 21 comments
Closed
20 tasks done

Remove the use of CUDA API wrappers #386

makortel opened this issue Sep 11, 2019 · 21 comments
Assignees

Comments

@makortel
Copy link

makortel commented Sep 11, 2019

This issue is to track progress for removing the use of CUDA API wrappers. The library turned out to not be that useful (see some discussion in #279 (comment))).

We are currently using the following components (extracted with git grep, may be incomplete)

@makortel
Copy link
Author

@fwyzard About the cuda::throw_if_error(), given that we are mostly using cudaCheck(), how about changing cudaCheck() such that there would be a compile-time flag whether to abort() or throw an exception? (such that in CMSSW master it would throw an exception, patatrack and private forks could still use the abort())

@makortel
Copy link
Author

I'll start with removing the streams and events.

@fwyzard fwyzard added the task label Sep 14, 2019
@fwyzard
Copy link

fwyzard commented Sep 14, 2019

OK.
I'll have a look at providing an equivalent of cuda::launch().

@fwyzard
Copy link

fwyzard commented Sep 14, 2019

cuda::launch() does not support functions objects or lambdas - just plain functions, right ?

@makortel
Copy link
Author

cuda::launch() does not support functions objects or lambdas - just plain functions, right ?

I believe it is just a wrapper for the kernel launch <<<...>>>, so the passed function has to be __global__ function (or whatever CUDA kernel launch supports).

@makortel
Copy link
Author

First part of streams and events is done in #389.

@fwyzard
Copy link

fwyzard commented Sep 17, 2019

@makortel I have prepared an alternative for cuda::launch().

Do you prefer to keep the same syntax, e.g.

launch(kernel, {gridDim, blockDim, sharedMem = 0, stream = nullptr}, args...);

or something more like what Cupla uses, e.g.

launch(kernel)(gridDim, blockDim, sharedMem = 0, stream = nullptr)(args...);

?

@makortel
Copy link
Author

@fwyzard I don't have a clear preference (I would have probably gone with the bare kernel launch syntax, i.e. <<<...>>>, but am really fine with ~any syntax since the use is not that widespread).

@fwyzard
Copy link

fwyzard commented Sep 18, 2019

OK, I'll stick to the current syntax then, since it is simpler to implement.

The advantage over the kernel<<<...>>>(...) syntax is that launch(kernel, {...}, ...) can be used also in the host compiler.

@fwyzard fwyzard assigned makortel, fwyzard and waredjeb and unassigned makortel and fwyzard Oct 24, 2019
@fwyzard
Copy link

fwyzard commented Oct 24, 2019

I have asked @waredjeb to look into the cuda::memory operations, replacing

  • cuda::memory::device::make_unique() with cudautils::make_device_unique()
  • cuda::memory[::async]::copy() with cudaMemcpy[Async]()
  • cuda::memory[::async]::zero() and ::set() with cudaMemset[Async]()

@makortel
Copy link
Author

  • cuda::memory::device::make_unique() with cudautils::make_device_unique()

The cudautils::make_device_unique() is not a direct replacement as it (currently) requires the CUDA stream (although I'm thinking to add a variant that caches the allocation but is not "tied" to a CUDA stream).

IIRC all calls to cuda::memory::device::make_unique() are in unit tests. I was thinking that maybe the API wrappers could be acceptable there. On the other hand, using the CUDAStreamCache to get the stream for the allocator is not too hard either.

@fwyzard
Copy link

fwyzard commented Oct 24, 2019

Would it work to explicitly pass stream 0 (actually nullptr) to use the default stream ?

@makortel
Copy link
Author

Would it work to explicitly pass stream 0 (actually nullptr) to use the default stream ?

Probably, since all *Async API calls accept that (right?).

@fwyzard
Copy link

fwyzard commented Oct 24, 2019

I think they do.

@fwyzard
Copy link

fwyzard commented Oct 28, 2019

@fwyzard About the cuda::throw_if_error(), given that we are mostly using cudaCheck(), how about changing cudaCheck() such that there would be a compile-time flag whether to abort() or throw an exception? (such that in CMSSW master it would throw an exception, patatrack and private forks could still use the abort())

Sorry @makortel looks like I never answered you: given that an uncaught exception results anyway in an abort() can we just make the replacement for good ?

@makortel
Copy link
Author

@fwyzard

given that an uncaught exception results anyway in an abort() can we just make the replacement for good ?

I didn't follow if you meant "cudaCheck() should throw exception" or "cudaCheck() should abort"?

(also, exceptions from places where we call CUDA APIs should get caught by the framework)

@fwyzard
Copy link

fwyzard commented Oct 28, 2019

I meant, cudaCheck() should just throw exceptions.

exceptions from places where we call CUDA APIs should get caught by the framework

Right... but they would still result in a stack trace and in the job ending, shouldn't they ?

Edit: see #398 .

@makortel
Copy link
Author

makortel commented Nov 7, 2019

#404 completes the work for streams and events.

@makortel
Copy link
Author

makortel commented Nov 7, 2019

I can take care of cuda::device::current::scoped_override_t<> after #404 gets merged.

@waredjeb
Copy link

waredjeb commented Nov 7, 2019

I can work on the remaining cuda::devicecalls!

@fwyzard
Copy link

fwyzard commented Nov 27, 2019

The last references to the CUDA API Wrappers were removed via #417 .

@fwyzard fwyzard added the fixed label Nov 27, 2019
@fwyzard fwyzard closed this as completed Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants