Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudadev] Macro based SoA #211

Closed
wants to merge 8 commits into from

Conversation

ericcano
Copy link
Contributor

Converted the SiPixelROCsStatusAndMapping to macro generated SoA.

This implements the SoA macros discussed in cms-patatrack/cmssw#272.

@makortel makortel added the cuda label Sep 1, 2021
Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very interesting! Here are my first comments, based on how the SoA is used. I'll need another round to look for the implementation.


// TODO: check if cudaStreamDefault is appropriate
auto cablingMapMetadata = cablingMap.soaMetadata();
cablingMapHostBuffer = cms::cuda::make_host_unique<std::byte[]>(cablingMapMetadata.byteSize(), cudaStreamDefault);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really in principle, but it probably won't matter in practice. I'm going to rework the EventSetup side in this fall to allow the use of these smart pointers more naturally here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so leaving is as-is, thanks.

return data.cablingMapDevice;
}

const unsigned char* SiPixelROCsStatusAndMappingWrapper::getModToUnpAllAsync(cudaStream_t cudaStream) const {
const auto& data =
modToUnp_.dataForCurrentDeviceAsync(cudaStream, [this](ModulesToUnpack& data, cudaStream_t stream) {
cudaCheck(cudaMalloc((void**)&data.modToUnpDefault, pixelgpudetails::MAX_SIZE_BYTE_BOOL));
cudaCheck(cudaMemcpyAsync(data.modToUnpDefault,
data.modToUnpDefault = cms::cuda::make_device_unique<unsigned char []>(pixelgpudetails::MAX_SIZE_BYTE_BOOL, stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here the use of stream for memory allocations is not, in principle, appropriate, but it probably doesn't matter in practice as the ES product outlives the event loop and all GPU work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also leaving like this, then.

void allocate(size_t size, cudaStream_t stream) {
cablingMapDeviceBuffer = cms::cuda::make_device_unique<std::byte[]>(
SiPixelROCsStatusAndMapping::computeDataSize(size), stream);
new(&cablingMapDevice) SiPixelROCsStatusAndMapping(cablingMapDeviceBuffer.get(), size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the destructor of earlier calingMapDevice object be called first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this case, as the SoA is just a non-owning view with nothing to destroy, but I agree it's a good practice before an in-place constructor, and in case the class evolves. I will add it.

src/cudadev/Makefile Outdated Show resolved Hide resolved
in.read(reinterpret_cast<char*>(&obj), sizeof(SiPixelROCsStatusAndMapping));
// We use default alignment
auto objBuffer = std::make_unique<std::byte[]>(SiPixelROCsStatusAndMapping::computeDataSize(pixelgpudetails::MAX_SIZE));
SiPixelROCsStatusAndMapping obj(objBuffer.get(), pixelgpudetails::MAX_SIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the memory needs to be managed outside of the SoA structure. I find that inconvenient (that we experienced already with the SimpleVector etc).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a further thought, in the memory ownership model we've mainly discussing about there is a distinction between an owner object and a non-owning "view". I realize the SoA here looks a lot like such non-owning view, in which case leaving the memory management outside of the SoA is completely understandable. But I wonder if it would be possible to craft a generic owner class in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an important place to look. Currently, while running cudadev with those parameters: ./cudadev --maxEvents 10000 --numberOfThreads 16 --numberOfStreams 24, we can see in NSight that CPU threads spend most of their time waiting for the mutex of the caching memory allocator. The GPU and copy engine are not saturated.

A single memory handler per producer allocating a single chuck of memory and cutting it into one or multiple views should reduce the number of allocations by a factor of a few units (up to 8 here: https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/cudadev/CUDADataFormats/SiPixelDigisCUDA.cc#L8-L15 ). I am planning to look into this now as there is a good performance gain opportunity here.

Eventually, this memory manager could also handle the various copies in host, pinned, host and devices memory, and the transfers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, while running cudadev with those parameters: ./cudadev --maxEvents 10000 --numberOfThreads 16 --numberOfStreams 24, we can see in NSight that CPU threads spend most of their time waiting for the mutex of the caching memory allocator.

Is the bottleneck really the mutex in the caching allocator? The last time I looked into that (which admittedly was nearly 2 years ago and in CMSSW) the CPU-side bottleneck was the mutex in the CUDA itself. Things have certainly evolved since then, but I'd like to see the evidence.

What GPU are you testing on? In my tests on V100 cudadev saturates around --numberOfThreads 10, so at least there going much beyond will just result in more lock contention. Luckily(?) this setup also reaches 92-98 % GPU utilization as reported by nvidia-smi (whatever that really means).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look with NSight. Here is a snapshot:

image

This is with a T4 (with 2 Intel(R) Xeon(R) Gold 6338 for a total of 64 cores/128 threads). NSight saw 23 threads in total, so CPU is not used at the maximum. The very bottom blue/purple bars show the usage of the GPU processing and copy engine, respectively. We are clearly not saturating the GPU (~50%, judging by eye).

It's easy to differentiate the CUDA locking from the memory manager locking: CUDA uses a pthread_rwlock_wrlock and NSight shows the CUDA call under the system call (see the red cudaMem... on the 4th thread on the left), while the memory manager locking uses pthread_mutex_lock.

All can be checked with a tooltip stack trace:

image

I tried to minimize the locking in the memory manager by deferring potentially expensive calls our of the critical section, but this did not help. One of such expensive calls can be visualized here:
image

We can see all other threads waiting for the mutex in this snapshot.

I added deferred calls here: ericcano@40abde0 that's a risk free change, but it did not help much.

I also moved the cudaEventRecord out of the critical section, but this requires to change to a 2 pass strategy, which actually reduces performance (probably adding to the contention as each pass requires a lock): ericcano@a3fdba0

So, in all, reducing the number of allocations by gathering SoA columns in a single block should improve the performance (looking into this now).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting. Have you tried what happens with smaller number of CPU threads, say between 5 and 10? I'd expect T4 to saturate with at most the same amount of CPU threads as V100, so I'd expect 23 to suffer heavily from lock contention.

Nevertheless I agree it is clear that the caching allocator must be improved. Given also the discussion in #216, I opened an issue to discuss that further #218.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the relevant code out to another branch: https://github.com/ericcano/pixeltrack-standalone/tree/memoryManagement

@ericcano ericcano changed the title Macro based so a Macro based SoA Sep 2, 2021
… copyable.

The requirement for trivially constructible prevented creating structures on the host side and then memory copying it to the device side.
The SoA store divides a user pre-allocated buffer in adjacent columns of values. The columns are bytes aligned to a setable alignment
and their length is determined at run time.

The SoA views allow grouping columns from multiple stores in a logical entity. For example when some columns are used on device-only
and otheres transmitted to host. The two groups are on two different stores, but joined together in a logic view.

Views can also provide access to a subset of a store.
@ericcano
Copy link
Contributor Author

ericcano commented Oct 13, 2021

The latest rewrite of the branch fixes the "one has have to know what he/she's doing" approach where views were overlaid on one another to provide multiple sets of columns depending on the use pattern and transfer needs.

The new approach uses a 3 layers with:

  • Buffer: a buffer of std::byte, allocated on host or device
  • Store: a collection of contiguous, aligned columns of data of same length
  • View: a collection of pointers to columns from one or several stores, adjacent or not.

At typical example can be see in 3e9b8fb#diff-a84742dff2728673b16912f2e5297015c8c9d47ab27a616fc757d450f2d01808, where a store is used on both the host and the device, and the data gets transferred between the 2, but in addition, extra columns are present on the device side only, and located in a second store. Both stores use the same buffer to save memory allocation, and a view merges both together in a logical view to provide unified access.

Utility functions allow easy computation of store sizes (computeDataSize(size_t length)) and locating stores position in shared buffers (nextByte()).

4 examples where migrated to this scheme (1 condition structure and 3 data structures).

Views do not support scalar (non-column) elements yet, but those are not so common (to be added though).

Support for Eigen columns in views could be added too.

Finally, views could be templated to add cache optimization flavors (non-coherent cache, streamed reads and writes).

@ericcano ericcano changed the title Macro based SoA [cudadev] Macro based SoA Nov 1, 2021
Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got to around SiPixelDigisCUDA.cc.

src/cudadev/CUDACore/device_unique_ptr.h Show resolved Hide resolved
src/cudadev/CUDACore/host_unique_ptr.h Show resolved Hide resolved
Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got up to TrackingRecHit2DSOAView.h.

Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got to SoAStore.h. I might have easily misunderstood something.

src/cudadev/DataFormats/SoACommon.h Show resolved Hide resolved
src/cudadev/DataFormats/SoACommon.h Show resolved Hide resolved
src/cudadev/DataFormats/SoACommon.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAStore.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAStore.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAStore.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAStore.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAStore.h Show resolved Hide resolved
Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got to the end.

src/cudadev/DataFormats/SoAView.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAView.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAView.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAView.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAView.h Show resolved Hide resolved
src/cudadev/DataFormats/SoAView.h Show resolved Hide resolved
@makortel
Copy link
Collaborator

I've been thinking about the names. I find the name Store confusing too since it does not own the storage, and therefore acts more like a View. Throwing out some ideas

  • SoAStore -> SoAView, SoAView -> SoACombView
    • The SoAView acts like a combination of columns from other "Views"
  • SoAStore -> SoALayout, keep SoAView

@ericcano
Copy link
Contributor Author

With all the comments reviewed and the todo list recorded in the last section of SoA.md I will close this PR to preserve the branch and create a new one with the new https://github.com/ericcano/pixeltrack-standalone/tree/macroBasedSoA-PR211-followup branch.

@ericcano
Copy link
Contributor Author

This branch will now be followed up in #287 .

@ericcano ericcano closed this Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants