[cudadev] Macro based SoA #211

ericcano · 2021-08-27T14:44:10Z

Converted the SiPixelROCsStatusAndMapping to macro generated SoA.

This implements the SoA macros discussed in cms-patatrack/cmssw#272.

makortel

Thanks, very interesting! Here are my first comments, based on how the SoA is used. I'll need another round to look for the implementation.

makortel · 2021-09-01T20:42:38Z

src/cudadev/CondFormats/SiPixelROCsStatusAndMappingWrapper.cc

-
+  // TODO: check if cudaStreamDefault is appropriate
+  auto cablingMapMetadata = cablingMap.soaMetadata();
+  cablingMapHostBuffer = cms::cuda::make_host_unique<std::byte[]>(cablingMapMetadata.byteSize(), cudaStreamDefault);


Not really in principle, but it probably won't matter in practice. I'm going to rework the EventSetup side in this fall to allow the use of these smart pointers more naturally here.

OK, so leaving is as-is, thanks.

makortel · 2021-09-01T20:43:22Z

src/cudadev/CondFormats/SiPixelROCsStatusAndMappingWrapper.cc

  return data.cablingMapDevice;
 }

 const unsigned char* SiPixelROCsStatusAndMappingWrapper::getModToUnpAllAsync(cudaStream_t cudaStream) const {
  const auto& data =
      modToUnp_.dataForCurrentDeviceAsync(cudaStream, [this](ModulesToUnpack& data, cudaStream_t stream) {
-        cudaCheck(cudaMalloc((void**)&data.modToUnpDefault, pixelgpudetails::MAX_SIZE_BYTE_BOOL));
-        cudaCheck(cudaMemcpyAsync(data.modToUnpDefault,
+        data.modToUnpDefault = cms::cuda::make_device_unique<unsigned char []>(pixelgpudetails::MAX_SIZE_BYTE_BOOL, stream);


Also here the use of stream for memory allocations is not, in principle, appropriate, but it probably doesn't matter in practice as the ES product outlives the event loop and all GPU work.

Also leaving like this, then.

makortel · 2021-09-01T20:46:04Z

src/cudadev/CondFormats/SiPixelROCsStatusAndMappingWrapper.h

+    void allocate(size_t size, cudaStream_t stream) {
+      cablingMapDeviceBuffer = cms::cuda::make_device_unique<std::byte[]>(
+              SiPixelROCsStatusAndMapping::computeDataSize(size), stream);
+      new(&cablingMapDevice) SiPixelROCsStatusAndMapping(cablingMapDeviceBuffer.get(), size);


Should the destructor of earlier calingMapDevice object be called first?

Not in this case, as the SoA is just a non-owning view with nothing to destroy, but I agree it's a good practice before an in-place constructor, and in case the class evolves. I will add it.

src/cudadev/Makefile

makortel · 2021-09-01T20:51:55Z

src/cudadev/plugin-SiPixelClusterizer/SiPixelROCsStatusAndMappingWrapperESProducer.cc

-    in.read(reinterpret_cast<char*>(&obj), sizeof(SiPixelROCsStatusAndMapping));
+    // We use default alignment
+    auto objBuffer = std::make_unique<std::byte[]>(SiPixelROCsStatusAndMapping::computeDataSize(pixelgpudetails::MAX_SIZE));
+    SiPixelROCsStatusAndMapping obj(objBuffer.get(), pixelgpudetails::MAX_SIZE);


IIUC, the memory needs to be managed outside of the SoA structure. I find that inconvenient (that we experienced already with the SimpleVector etc).

On a further thought, in the memory ownership model we've mainly discussing about there is a distinction between an owner object and a non-owning "view". I realize the SoA here looks a lot like such non-owning view, in which case leaving the memory management outside of the SoA is completely understandable. But I wonder if it would be possible to craft a generic owner class in this case?

I think this is an important place to look. Currently, while running cudadev with those parameters: ./cudadev --maxEvents 10000 --numberOfThreads 16 --numberOfStreams 24, we can see in NSight that CPU threads spend most of their time waiting for the mutex of the caching memory allocator. The GPU and copy engine are not saturated.

A single memory handler per producer allocating a single chuck of memory and cutting it into one or multiple views should reduce the number of allocations by a factor of a few units (up to 8 here: https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/cudadev/CUDADataFormats/SiPixelDigisCUDA.cc#L8-L15 ). I am planning to look into this now as there is a good performance gain opportunity here.

Eventually, this memory manager could also handle the various copies in host, pinned, host and devices memory, and the transfers.

Currently, while running cudadev with those parameters: ./cudadev --maxEvents 10000 --numberOfThreads 16 --numberOfStreams 24, we can see in NSight that CPU threads spend most of their time waiting for the mutex of the caching memory allocator.

Is the bottleneck really the mutex in the caching allocator? The last time I looked into that (which admittedly was nearly 2 years ago and in CMSSW) the CPU-side bottleneck was the mutex in the CUDA itself. Things have certainly evolved since then, but I'd like to see the evidence.

What GPU are you testing on? In my tests on V100 cudadev saturates around --numberOfThreads 10, so at least there going much beyond will just result in more lock contention. Luckily(?) this setup also reaches 92-98 % GPU utilization as reported by nvidia-smi (whatever that really means).

I had a look with NSight. Here is a snapshot:

This is with a T4 (with 2 Intel(R) Xeon(R) Gold 6338 for a total of 64 cores/128 threads). NSight saw 23 threads in total, so CPU is not used at the maximum. The very bottom blue/purple bars show the usage of the GPU processing and copy engine, respectively. We are clearly not saturating the GPU (~50%, judging by eye).

It's easy to differentiate the CUDA locking from the memory manager locking: CUDA uses a pthread_rwlock_wrlock and NSight shows the CUDA call under the system call (see the red cudaMem... on the 4th thread on the left), while the memory manager locking uses pthread_mutex_lock.

All can be checked with a tooltip stack trace:

I tried to minimize the locking in the memory manager by deferring potentially expensive calls our of the critical section, but this did not help. One of such expensive calls can be visualized here:

We can see all other threads waiting for the mutex in this snapshot.

I added deferred calls here: ericcano@40abde0 that's a risk free change, but it did not help much.

I also moved the cudaEventRecord out of the critical section, but this requires to change to a 2 pass strategy, which actually reduces performance (probably adding to the contention as each pass requires a lock): ericcano@a3fdba0

So, in all, reducing the number of allocations by gathering SoA columns in a single block should improve the performance (looking into this now).

This is very interesting. Have you tried what happens with smaller number of CPU threads, say between 5 and 10? I'd expect T4 to saturate with at most the same amount of CPU threads as V100, so I'd expect 23 to suffer heavily from lock contention.

Nevertheless I agree it is clear that the caching allocator must be improved. Given also the discussion in #216, I opened an issue to discuss that further #218.

I moved the relevant code out to another branch: https://github.com/ericcano/pixeltrack-standalone/tree/memoryManagement

src/cudadev/plugin-SiPixelClusterizer/SiPixelRawToClusterGPUKernel.cu

src/cudadev/CUDACore/CachingDeviceAllocator.h

… copyable. The requirement for trivially constructible prevented creating structures on the host side and then memory copying it to the device side.

The SoA store divides a user pre-allocated buffer in adjacent columns of values. The columns are bytes aligned to a setable alignment and their length is determined at run time. The SoA views allow grouping columns from multiple stores in a logical entity. For example when some columns are used on device-only and otheres transmitted to host. The two groups are on two different stores, but joined together in a logic view. Views can also provide access to a subset of a store.

ericcano · 2021-10-13T10:25:22Z

The latest rewrite of the branch fixes the "one has have to know what he/she's doing" approach where views were overlaid on one another to provide multiple sets of columns depending on the use pattern and transfer needs.

The new approach uses a 3 layers with:

Buffer: a buffer of std::byte, allocated on host or device
Store: a collection of contiguous, aligned columns of data of same length
View: a collection of pointers to columns from one or several stores, adjacent or not.

At typical example can be see in 3e9b8fb#diff-a84742dff2728673b16912f2e5297015c8c9d47ab27a616fc757d450f2d01808, where a store is used on both the host and the device, and the data gets transferred between the 2, but in addition, extra columns are present on the device side only, and located in a second store. Both stores use the same buffer to save memory allocation, and a view merges both together in a logical view to provide unified access.

Utility functions allow easy computation of store sizes (computeDataSize(size_t length)) and locating stores position in shared buffers (nextByte()).

4 examples where migrated to this scheme (1 condition structure and 3 data structures).

Views do not support scalar (non-column) elements yet, but those are not so common (to be added though).

Support for Eigen columns in views could be added too.

Finally, views could be templated to add cache optimization flavors (non-coherent cache, streamed reads and writes).

The Eigen columns, present in the stores are explicitly not supported (assertion).

makortel

Got to around SiPixelDigisCUDA.cc.

src/cudadev/CUDACore/device_unique_ptr.h

src/cudadev/CUDACore/host_unique_ptr.h

src/cudadev/CUDADataFormats/SiPixelClustersCUDA.h

src/cudadev/CUDADataFormats/SiPixelDigisCUDA.cc

makortel

Got up to TrackingRecHit2DSOAView.h.

src/cudadev/CUDADataFormats/SiPixelDigisCUDA.h

src/cudadev/CUDADataFormats/SiPixelDigisCUDA.cc

src/cudadev/CUDADataFormats/TrackingRecHit2DHostSOAView.h

src/cudadev/CUDADataFormats/TrackingRecHit2DSOAView.h

makortel

Got to SoAStore.h. I might have easily misunderstood something.

src/cudadev/CondFormats/SiPixelROCsStatusAndMapping.h

src/cudadev/CondFormats/SiPixelROCsStatusAndMappingWrapper.cc

src/cudadev/DataFormats/SoACommon.h

src/cudadev/DataFormats/SoAStore.h

makortel

Got to the end.

src/cudadev/DataFormats/SoAView.h

src/cudadev/plugin-SiPixelRecHits/gpuPixelRecHits.h

makortel · 2021-11-23T22:54:23Z

I've been thinking about the names. I find the name Store confusing too since it does not own the storage, and therefore acts more like a View. Throwing out some ideas

SoAStore -> SoAView, SoAView -> SoACombView
- The SoAView acts like a combination of columns from other "Views"
SoAStore -> SoALayout, keep SoAView

ericcano · 2021-12-13T22:18:13Z

With all the comments reviewed and the todo list recorded in the last section of SoA.md I will close this PR to preserve the branch and create a new one with the new https://github.com/ericcano/pixeltrack-standalone/tree/macroBasedSoA-PR211-followup branch.

ericcano · 2021-12-13T22:24:29Z

This branch will now be followed up in #287 .

makortel added the cuda label Sep 1, 2021

makortel reviewed Sep 1, 2021

View reviewed changes

ericcano changed the title ~~Macro based so a~~ Macro based SoA Sep 2, 2021

makortel reviewed Sep 3, 2021

View reviewed changes

src/cudadev/CUDACore/CachingDeviceAllocator.h Outdated Show resolved Hide resolved

src/cudadev/CUDACore/CachingDeviceAllocator.h Outdated Show resolved Hide resolved

makortel mentioned this pull request Sep 3, 2021

[cudadev] Improve caching allocator performance #218

Open

4 tasks

ericcano force-pushed the macroBasedSoA branch 2 times, most recently from ab8d2c6 to b1fd2d0 Compare September 8, 2021 08:18

ericcano force-pushed the macroBasedSoA branch from 18e9118 to 84694d2 Compare September 29, 2021 12:37

ericcano force-pushed the macroBasedSoA branch from 84694d2 to b46db04 Compare October 13, 2021 09:41

ericcano added 6 commits October 13, 2021 11:53

Downgraded the requirement for device/pinned host memory to trivially…

8da281e

… copyable. The requirement for trivially constructible prevented creating structures on the host side and then memory copying it to the device side.

Converted SiPixelROCsStatusAndMapping to SoA.

e8e16c9

Moved SiPixelClustersCUDA to SoA store.

673eb28

Moved SiPixelDigisCUDA to SoA store and views.

3e9b8fb

Moved TrackingRecHit2D to SoA store and views.

c7b84b6

ericcano force-pushed the macroBasedSoA branch from b46db04 to c7b84b6 Compare October 13, 2021 09:53

ericcano added 2 commits October 19, 2021 10:27

Updated SoA view so they support scalars in addition to columns.

1b1bcc2

The Eigen columns, present in the stores are explicitly not supported (assertion).

Added unit test to validate SoA stores and view.

03727bd

ericcano changed the title ~~Macro based SoA~~ [cudadev] Macro based SoA Nov 1, 2021

makortel reviewed Nov 16, 2021

View reviewed changes

src/cudadev/CUDACore/device_unique_ptr.h Show resolved Hide resolved

src/cudadev/CUDACore/host_unique_ptr.h Show resolved Hide resolved

src/cudadev/CUDADataFormats/SiPixelClustersCUDA.h Show resolved Hide resolved

src/cudadev/CUDADataFormats/SiPixelDigisCUDA.cc Show resolved Hide resolved

makortel reviewed Nov 17, 2021

View reviewed changes

makortel reviewed Nov 18, 2021

View reviewed changes

makortel reviewed Nov 23, 2021

View reviewed changes

fwyzard mentioned this pull request Dec 8, 2021

implement a SoA abstraction #281

Open

ericcano mentioned this pull request Dec 13, 2021

[cudadev] Macro based SoA (followup of #PR211) #287

Open

ericcano closed this Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudadev] Macro based SoA #211

[cudadev] Macro based SoA #211

ericcano commented Aug 27, 2021

makortel left a comment

makortel Sep 1, 2021

ericcano Sep 2, 2021

makortel Sep 1, 2021

ericcano Sep 2, 2021

makortel Sep 1, 2021

ericcano Sep 2, 2021

makortel Sep 1, 2021

makortel Sep 2, 2021

ericcano Sep 2, 2021

makortel Sep 2, 2021

ericcano Sep 3, 2021

makortel Sep 3, 2021

ericcano Sep 29, 2021

ericcano commented Oct 13, 2021 •

edited

Loading

makortel left a comment

makortel left a comment

makortel left a comment

makortel left a comment

makortel commented Nov 23, 2021

ericcano commented Dec 13, 2021

ericcano commented Dec 13, 2021

[cudadev] Macro based SoA #211

[cudadev] Macro based SoA #211

Conversation

ericcano commented Aug 27, 2021

makortel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericcano commented Oct 13, 2021 • edited Loading

makortel left a comment

Choose a reason for hiding this comment

makortel left a comment

Choose a reason for hiding this comment

makortel left a comment

Choose a reason for hiding this comment

makortel left a comment

Choose a reason for hiding this comment

makortel commented Nov 23, 2021

ericcano commented Dec 13, 2021

ericcano commented Dec 13, 2021

ericcano commented Oct 13, 2021 •

edited

Loading