Physical CUDA stream management #279

makortel · 2019-03-11T20:12:11Z

Currently (and in #100) we use CUDA stream class cuda::stream_t<> from the CUDA API wrappers. This issue is to discuss whether we should continue to do so, or switch to the cudaStream_t (or do something else)

Pros of cuda::stream_t<>

cuda::stream<>::enqueue::callback() gives very easy way to use lambdas as callbacks

Cons of cuda::stream_t<>

It does not assume that the device of the stream would the current one
- We tend to always set the current device properly beforehand
- Leads to unnecessary calls to cudaGetDevice()/cudaSetDevice()
At one point in Next prototype of the framework integration #100 the non-existence of default constructor gave some headache (could not use std::optional), but later with the caching of streams in CUDAService this point is no longer an issue (needs std::shared_ptr anyway)

The text was updated successfully, but these errors were encountered:

fwyzard · 2019-03-11T21:48:17Z

With #100, we cache the CUDA streams with an edm::ReusableObjectHolder<cuda::stream_t<>>, so we work with a shared_ptr<cuda::stream_t<>>.
cuda::stream_t<> contains (by value) a cudaStream_t, which is in turn a pointer (to struct CUstream_st, an opaque type).

We could get rid of one level of pointer indirection working with an edm::ReusableObjectHolder<struct CUstream_st>, as long as we use an explicit "deleter" for the involved unique_ptr and shared_ptr, that calls cudaStreamDestroy().

Something along the lines of

struct opaque;
typedef opaque * opaque_t;

void deleter(opaque_t msg) {
  std::cout << (char *) msg << std::endl;
}

int main(void) {
  char * data = new char[64];
  strncpy(data, "Hello world", 64);

  opaque_t ptr = reinterpret_cast<opaque *>(data);
  auto shared = std::shared_ptr<opaque>(ptr, &deleter);
  auto unique = std::unique_ptr<opaque, decltype((deleter))>(ptr, deleter);

  return 0;
}

makortel · 2019-03-11T22:13:14Z

Thanks @fwyzard, so it was just a matter of defining the custom deleters. Which actually make this approach not to work with edm::ReusableObjectHolder<T> out of the box (it uses tbb::concurrent_queue<T *> internally and effectively assumes that T * is destructed with delete T.

On the other hand it looks like extending the edm::ReusableObjectHolder<T> to support custom deleters (in the same manner as std::unique_ptr) could be relatively straightforward.

makortel · 2019-03-11T22:14:50Z

Can we trust that cudaStream_t will stay as a pointer-to-something?

(ok, likely NVIDIA will continue to like to hide the internals so "yes", but want to write the question out loud anyway)

fwyzard · 2019-03-11T22:26:26Z

Some CUDA API functions (e.g. cudaStreamCreateWithFlags) mention NULL explicitly as a possible value for a cudaStream_t, so there is some acknowledgement that it is a pointer type.

On the other hand the special values 1 and 2 are also used...

#define cudaStreamLegacy                    ((cudaStream_t)0x1)
#define cudaStreamPerThread                 ((cudaStream_t)0x2)

makortel · 2019-03-13T20:31:50Z

We should probably do the same trick with the cached CUDA events.

fwyzard · 2019-03-14T13:09:09Z

Adding support for a custom deleter to ReusableObjectHolder is not that complicated: cms-sw@89fab9b .

On the other hand, changing CUDAProduct, CUDAScopedContext and CUDAContextToken to work directly with cudaStream_t instead of cuda::stream_t<> did not seem to make any difference performance-wise:

reference + custom deleter

Running 10 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1889.3 ±   1.7 ev/s (4000 events)
  1896.5 ±   1.9 ev/s (4000 events)
  1886.1 ±   2.7 ev/s (4000 events)
  1887.6 ±   2.0 ev/s (4000 events)
  1890.8 ±   2.1 ev/s (4000 events)
  1873.3 ±   2.0 ev/s (4000 events)
  1887.5 ±   1.6 ev/s (4000 events)
  1883.3 ±   2.1 ev/s (4000 events)
  1875.4 ±   1.8 ev/s (4000 events)
  1887.4 ±   1.9 ev/s (4000 events)                                                                                                                                                                                                          
 --------------------                                                                                                                                                                                                                        
  1885.7 ±   6.9 ev/s

use cudaStream_t

Running 10 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1882.6 ±   2.2 ev/s (4000 events)
  1894.7 ±   1.9 ev/s (4000 events)
  1879.5 ±   1.3 ev/s (4000 events)
  1893.3 ±   1.5 ev/s (4000 events)
  1879.6 ±   1.6 ev/s (4000 events)
  1885.2 ±   1.2 ev/s (4000 events)
  1883.5 ±   1.6 ev/s (4000 events)
  1869.7 ±   1.4 ev/s (4000 events)
  1884.0 ±   1.7 ev/s (4000 events)
  1882.5 ±   1.2 ev/s (4000 events)
 --------------------
  1883.5 ±   7.1 ev/s

However, I noticed that cuda::stream_t already keeps track of the associated device; so if we keep using it we could drop the device from CUDAScopedContext, CUDAContextToken.h, etc ?

makortel · 2019-03-14T20:43:03Z

Thanks Andrea for the test.

Do we have any other compelling arguments to go to "raw" cudaStream_t? If not, I'd stay with what we have (least amount of work), but OTOH I don't have strong feelings in either direction.

However, I noticed that cuda::stream_t already keeps track of the associated device; so if we keep using it we could drop the device from CUDAScopedContext, CUDAContextToken.h, etc ?

Yes, we could drop the explicit (redundant) device member.

fwyzard · 2019-03-14T21:54:18Z

Do we have any other compelling arguments to go to "raw" cudaStream_t?

No, for me the performance was the only argument, and since the benchmark did not see any impact, we can keep using cuda::stream_t<>.

If not, I'd stay with what we have (least amount of work).

Agreed.
I am not fond of the API wrappers, but mostly it's because every time I need to read their implementation to figure out what they do in terms of he CUDA API.
If you find more convenient to use them than not, we should keep them.

And possibly update the external - but the author changes the extension of the files from .h to .hpp, which is a pain :-(

makortel · 2019-03-15T17:55:51Z

I am not fond of the API wrappers, but mostly it's because every time I need to read their implementation to figure out what they do in terms of he CUDA API.
If you find more convenient to use them than not, we should keep them.

Well, I find myself going to read their implementation on the exact details every now and then. I do like some of the abstractions convenient (like event.has_occurred()), but then e.g. cuda::memory::async::copy() does not take cuda::stream_t<> but cudaStream_t (reminds me of parts of std taking char * instead of std::string). There are parts in the wrappers that I find more even more convenient (like scoped_override_t for setting temporarily the device, or throw_if_error()).

Actually possibly my best argument for staying with cuda::stream_t<> is its destructor, which sets the current device correctly before calling cudaStreamDestroy(). If we'd use cudaStream_t directly, we'd have to take care of that. Ok, we actually could do that since we have edm::ReusableObjectHolder per device

cmssw/HeterogeneousCore/CUDAServices/src/CUDAService.cc

Line 501 in 0170373

std::vector<edm::ReusableObjectHolder<cuda::stream_t<>>> cache;

so in the destructor of the cache we could loop over the vector and set the device correctly. Relying on the destructors is more convenient though.

So I have rather mixed feelings on the wrappers, but not strong enough to clearly say one or the other.

makortel · 2019-03-18T15:57:52Z

A tiny argument against cuda::stream_t<>: since it does not assume that the device of the stream is the current one, it has to call cudaGetDevice()/cudaSetDevice() before doing anything. We take care of setting the current device already elsewhere, so those calls are overhead (tiny, but still).

makortel mentioned this issue Mar 11, 2019

Next prototype of the framework integration #100

Merged

makortel mentioned this issue Mar 14, 2019

testHeterogeneousCoreCUDACore fails #287

Closed

makortel mentioned this issue Sep 11, 2019

Remove the use of CUDA API wrappers #386

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Physical CUDA stream management #279

Physical CUDA stream management #279

makortel commented Mar 11, 2019

fwyzard commented Mar 11, 2019 •

edited

Loading

makortel commented Mar 11, 2019

makortel commented Mar 11, 2019

fwyzard commented Mar 11, 2019

makortel commented Mar 13, 2019

fwyzard commented Mar 14, 2019

makortel commented Mar 14, 2019

fwyzard commented Mar 14, 2019

makortel commented Mar 15, 2019

makortel commented Mar 18, 2019

Physical CUDA stream management #279

Physical CUDA stream management #279

Comments

makortel commented Mar 11, 2019

fwyzard commented Mar 11, 2019 • edited Loading

makortel commented Mar 11, 2019

makortel commented Mar 11, 2019

fwyzard commented Mar 11, 2019

makortel commented Mar 13, 2019

fwyzard commented Mar 14, 2019

reference + custom deleter

use cudaStream_t

makortel commented Mar 14, 2019

fwyzard commented Mar 14, 2019

makortel commented Mar 15, 2019

makortel commented Mar 18, 2019

fwyzard commented Mar 11, 2019 •

edited

Loading