Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Physical CUDA stream management #279

Open
makortel opened this issue Mar 11, 2019 · 10 comments
Open

Physical CUDA stream management #279

makortel opened this issue Mar 11, 2019 · 10 comments

Comments

@makortel
Copy link

Currently (and in #100) we use CUDA stream class cuda::stream_t<> from the CUDA API wrappers. This issue is to discuss whether we should continue to do so, or switch to the cudaStream_t (or do something else)

Pros of cuda::stream_t<>

  • cuda::stream<>::enqueue::callback() gives very easy way to use lambdas as callbacks

Cons of cuda::stream_t<>

  • It does not assume that the device of the stream would the current one
    • We tend to always set the current device properly beforehand
    • Leads to unnecessary calls to cudaGetDevice()/cudaSetDevice()
  • At one point in Next prototype of the framework integration #100 the non-existence of default constructor gave some headache (could not use std::optional), but later with the caching of streams in CUDAService this point is no longer an issue (needs std::shared_ptr anyway)
@fwyzard
Copy link

fwyzard commented Mar 11, 2019

With #100, we cache the CUDA streams with an edm::ReusableObjectHolder<cuda::stream_t<>>, so we work with a shared_ptr<cuda::stream_t<>>.
cuda::stream_t<> contains (by value) a cudaStream_t, which is in turn a pointer (to struct CUstream_st, an opaque type).

We could get rid of one level of pointer indirection working with an edm::ReusableObjectHolder<struct CUstream_st>, as long as we use an explicit "deleter" for the involved unique_ptr and shared_ptr, that calls cudaStreamDestroy().

Something along the lines of

struct opaque;
typedef opaque * opaque_t;

void deleter(opaque_t msg) {
  std::cout << (char *) msg << std::endl;
}

int main(void) {
  char * data = new char[64];
  strncpy(data, "Hello world", 64);

  opaque_t ptr = reinterpret_cast<opaque *>(data);
  auto shared = std::shared_ptr<opaque>(ptr, &deleter);
  auto unique = std::unique_ptr<opaque, decltype((deleter))>(ptr, deleter);

  return 0;
}

@makortel
Copy link
Author

Thanks @fwyzard, so it was just a matter of defining the custom deleters. Which actually make this approach not to work with edm::ReusableObjectHolder<T> out of the box (it uses tbb::concurrent_queue<T *> internally and effectively assumes that T * is destructed with delete T.

On the other hand it looks like extending the edm::ReusableObjectHolder<T> to support custom deleters (in the same manner as std::unique_ptr) could be relatively straightforward.

@makortel
Copy link
Author

Can we trust that cudaStream_t will stay as a pointer-to-something?

(ok, likely NVIDIA will continue to like to hide the internals so "yes", but want to write the question out loud anyway)

@fwyzard
Copy link

fwyzard commented Mar 11, 2019

Some CUDA API functions (e.g. cudaStreamCreateWithFlags) mention NULL explicitly as a possible value for a cudaStream_t, so there is some acknowledgement that it is a pointer type.

On the other hand the special values 1 and 2 are also used...

#define cudaStreamLegacy                    ((cudaStream_t)0x1)
#define cudaStreamPerThread                 ((cudaStream_t)0x2)

@makortel
Copy link
Author

We should probably do the same trick with the cached CUDA events.

@fwyzard
Copy link

fwyzard commented Mar 14, 2019

Adding support for a custom deleter to ReusableObjectHolder is not that complicated: cms-sw@89fab9b .

On the other hand, changing CUDAProduct, CUDAScopedContext and CUDAContextToken to work directly with cudaStream_t instead of cuda::stream_t<> did not seem to make any difference performance-wise:

reference + custom deleter

Running 10 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1889.3 ±   1.7 ev/s (4000 events)
  1896.5 ±   1.9 ev/s (4000 events)
  1886.1 ±   2.7 ev/s (4000 events)
  1887.6 ±   2.0 ev/s (4000 events)
  1890.8 ±   2.1 ev/s (4000 events)
  1873.3 ±   2.0 ev/s (4000 events)
  1887.5 ±   1.6 ev/s (4000 events)
  1883.3 ±   2.1 ev/s (4000 events)
  1875.4 ±   1.8 ev/s (4000 events)
  1887.4 ±   1.9 ev/s (4000 events)                                                                                                                                                                                                          
 --------------------                                                                                                                                                                                                                        
  1885.7 ±   6.9 ev/s                                                                                                                                                                                                                        

use cudaStream_t

Running 10 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1882.6 ±   2.2 ev/s (4000 events)
  1894.7 ±   1.9 ev/s (4000 events)
  1879.5 ±   1.3 ev/s (4000 events)
  1893.3 ±   1.5 ev/s (4000 events)
  1879.6 ±   1.6 ev/s (4000 events)
  1885.2 ±   1.2 ev/s (4000 events)
  1883.5 ±   1.6 ev/s (4000 events)
  1869.7 ±   1.4 ev/s (4000 events)
  1884.0 ±   1.7 ev/s (4000 events)
  1882.5 ±   1.2 ev/s (4000 events)
 --------------------
  1883.5 ±   7.1 ev/s

However, I noticed that cuda::stream_t already keeps track of the associated device; so if we keep using it we could drop the device from CUDAScopedContext, CUDAContextToken.h, etc ?

@makortel
Copy link
Author

Thanks Andrea for the test.

Do we have any other compelling arguments to go to "raw" cudaStream_t? If not, I'd stay with what we have (least amount of work), but OTOH I don't have strong feelings in either direction.

However, I noticed that cuda::stream_t already keeps track of the associated device; so if we keep using it we could drop the device from CUDAScopedContext, CUDAContextToken.h, etc ?

Yes, we could drop the explicit (redundant) device member.

@fwyzard
Copy link

fwyzard commented Mar 14, 2019

Do we have any other compelling arguments to go to "raw" cudaStream_t?

No, for me the performance was the only argument, and since the benchmark did not see any impact, we can keep using cuda::stream_t<>.

If not, I'd stay with what we have (least amount of work).

Agreed.
I am not fond of the API wrappers, but mostly it's because every time I need to read their implementation to figure out what they do in terms of he CUDA API.
If you find more convenient to use them than not, we should keep them.

And possibly update the external - but the author changes the extension of the files from .h to .hpp, which is a pain :-(

@makortel
Copy link
Author

I am not fond of the API wrappers, but mostly it's because every time I need to read their implementation to figure out what they do in terms of he CUDA API.
If you find more convenient to use them than not, we should keep them.

Well, I find myself going to read their implementation on the exact details every now and then. I do like some of the abstractions convenient (like event.has_occurred()), but then e.g. cuda::memory::async::copy() does not take cuda::stream_t<> but cudaStream_t (reminds me of parts of std taking char * instead of std::string). There are parts in the wrappers that I find more even more convenient (like scoped_override_t for setting temporarily the device, or throw_if_error()).

Actually possibly my best argument for staying with cuda::stream_t<> is its destructor, which sets the current device correctly before calling cudaStreamDestroy(). If we'd use cudaStream_t directly, we'd have to take care of that. Ok, we actually could do that since we have edm::ReusableObjectHolder per device

std::vector<edm::ReusableObjectHolder<cuda::stream_t<>>> cache;

so in the destructor of the cache we could loop over the vector and set the device correctly. Relying on the destructors is more convenient though.

So I have rather mixed feelings on the wrappers, but not strong enough to clearly say one or the other.

@makortel
Copy link
Author

A tiny argument against cuda::stream_t<>: since it does not assume that the device of the stream is the current one, it has to call cudaGetDevice()/cudaSetDevice() before doing anything. We take care of setting the current device already elsewhere, so those calls are overhead (tiny, but still).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants