-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cudadev] Macro based SoA #211
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, very interesting! Here are my first comments, based on how the SoA is used. I'll need another round to look for the implementation.
|
||
// TODO: check if cudaStreamDefault is appropriate | ||
auto cablingMapMetadata = cablingMap.soaMetadata(); | ||
cablingMapHostBuffer = cms::cuda::make_host_unique<std::byte[]>(cablingMapMetadata.byteSize(), cudaStreamDefault); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really in principle, but it probably won't matter in practice. I'm going to rework the EventSetup side in this fall to allow the use of these smart pointers more naturally here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so leaving is as-is, thanks.
return data.cablingMapDevice; | ||
} | ||
|
||
const unsigned char* SiPixelROCsStatusAndMappingWrapper::getModToUnpAllAsync(cudaStream_t cudaStream) const { | ||
const auto& data = | ||
modToUnp_.dataForCurrentDeviceAsync(cudaStream, [this](ModulesToUnpack& data, cudaStream_t stream) { | ||
cudaCheck(cudaMalloc((void**)&data.modToUnpDefault, pixelgpudetails::MAX_SIZE_BYTE_BOOL)); | ||
cudaCheck(cudaMemcpyAsync(data.modToUnpDefault, | ||
data.modToUnpDefault = cms::cuda::make_device_unique<unsigned char []>(pixelgpudetails::MAX_SIZE_BYTE_BOOL, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here the use of stream
for memory allocations is not, in principle, appropriate, but it probably doesn't matter in practice as the ES product outlives the event loop and all GPU work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also leaving like this, then.
void allocate(size_t size, cudaStream_t stream) { | ||
cablingMapDeviceBuffer = cms::cuda::make_device_unique<std::byte[]>( | ||
SiPixelROCsStatusAndMapping::computeDataSize(size), stream); | ||
new(&cablingMapDevice) SiPixelROCsStatusAndMapping(cablingMapDeviceBuffer.get(), size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the destructor of earlier calingMapDevice
object be called first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this case, as the SoA is just a non-owning view with nothing to destroy, but I agree it's a good practice before an in-place constructor, and in case the class evolves. I will add it.
in.read(reinterpret_cast<char*>(&obj), sizeof(SiPixelROCsStatusAndMapping)); | ||
// We use default alignment | ||
auto objBuffer = std::make_unique<std::byte[]>(SiPixelROCsStatusAndMapping::computeDataSize(pixelgpudetails::MAX_SIZE)); | ||
SiPixelROCsStatusAndMapping obj(objBuffer.get(), pixelgpudetails::MAX_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the memory needs to be managed outside of the SoA structure. I find that inconvenient (that we experienced already with the SimpleVector
etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a further thought, in the memory ownership model we've mainly discussing about there is a distinction between an owner object and a non-owning "view". I realize the SoA here looks a lot like such non-owning view, in which case leaving the memory management outside of the SoA is completely understandable. But I wonder if it would be possible to craft a generic owner class in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an important place to look. Currently, while running cudadev
with those parameters: ./cudadev --maxEvents 10000 --numberOfThreads 16 --numberOfStreams 24
, we can see in NSight that CPU threads spend most of their time waiting for the mutex of the caching memory allocator. The GPU and copy engine are not saturated.
A single memory handler per producer allocating a single chuck of memory and cutting it into one or multiple views should reduce the number of allocations by a factor of a few units (up to 8 here: https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/cudadev/CUDADataFormats/SiPixelDigisCUDA.cc#L8-L15 ). I am planning to look into this now as there is a good performance gain opportunity here.
Eventually, this memory manager could also handle the various copies in host, pinned, host and devices memory, and the transfers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, while running
cudadev
with those parameters:./cudadev --maxEvents 10000 --numberOfThreads 16 --numberOfStreams 24
, we can see in NSight that CPU threads spend most of their time waiting for the mutex of the caching memory allocator.
Is the bottleneck really the mutex in the caching allocator? The last time I looked into that (which admittedly was nearly 2 years ago and in CMSSW) the CPU-side bottleneck was the mutex in the CUDA itself. Things have certainly evolved since then, but I'd like to see the evidence.
What GPU are you testing on? In my tests on V100 cudadev
saturates around --numberOfThreads 10
, so at least there going much beyond will just result in more lock contention. Luckily(?) this setup also reaches 92-98 % GPU utilization as reported by nvidia-smi
(whatever that really means).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look with NSight. Here is a snapshot:
This is with a T4 (with 2 Intel(R) Xeon(R) Gold 6338 for a total of 64 cores/128 threads). NSight saw 23 threads in total, so CPU is not used at the maximum. The very bottom blue/purple bars show the usage of the GPU processing and copy engine, respectively. We are clearly not saturating the GPU (~50%, judging by eye).
It's easy to differentiate the CUDA locking from the memory manager locking: CUDA uses a pthread_rwlock_wrlock and NSight shows the CUDA call under the system call (see the red cudaMem...
on the 4th thread on the left), while the memory manager locking uses pthread_mutex_lock.
All can be checked with a tooltip stack trace:
I tried to minimize the locking in the memory manager by deferring potentially expensive calls our of the critical section, but this did not help. One of such expensive calls can be visualized here:
We can see all other threads waiting for the mutex in this snapshot.
I added deferred calls here: ericcano@40abde0 that's a risk free change, but it did not help much.
I also moved the cudaEventRecord out of the critical section, but this requires to change to a 2 pass strategy, which actually reduces performance (probably adding to the contention as each pass requires a lock): ericcano@a3fdba0
So, in all, reducing the number of allocations by gathering SoA columns in a single block should improve the performance (looking into this now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting. Have you tried what happens with smaller number of CPU threads, say between 5 and 10? I'd expect T4 to saturate with at most the same amount of CPU threads as V100, so I'd expect 23 to suffer heavily from lock contention.
Nevertheless I agree it is clear that the caching allocator must be improved. Given also the discussion in #216, I opened an issue to discuss that further #218.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the relevant code out to another branch: https://github.com/ericcano/pixeltrack-standalone/tree/memoryManagement
ab8d2c6
to
b1fd2d0
Compare
18e9118
to
84694d2
Compare
84694d2
to
b46db04
Compare
… copyable. The requirement for trivially constructible prevented creating structures on the host side and then memory copying it to the device side.
The SoA store divides a user pre-allocated buffer in adjacent columns of values. The columns are bytes aligned to a setable alignment and their length is determined at run time. The SoA views allow grouping columns from multiple stores in a logical entity. For example when some columns are used on device-only and otheres transmitted to host. The two groups are on two different stores, but joined together in a logic view. Views can also provide access to a subset of a store.
b46db04
to
c7b84b6
Compare
The latest rewrite of the branch fixes the "one has have to know what he/she's doing" approach where views were overlaid on one another to provide multiple sets of columns depending on the use pattern and transfer needs. The new approach uses a 3 layers with:
At typical example can be see in 3e9b8fb#diff-a84742dff2728673b16912f2e5297015c8c9d47ab27a616fc757d450f2d01808, where a store is used on both the host and the device, and the data gets transferred between the 2, but in addition, extra columns are present on the device side only, and located in a second store. Both stores use the same buffer to save memory allocation, and a view merges both together in a logical view to provide unified access. Utility functions allow easy computation of store sizes ( 4 examples where migrated to this scheme (1 condition structure and 3 data structures). Views do not support scalar (non-column) elements yet, but those are not so common (to be added though). Support for Eigen columns in views could be added too. Finally, views could be templated to add cache optimization flavors (non-coherent cache, streamed reads and writes). |
The Eigen columns, present in the stores are explicitly not supported (assertion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got to around SiPixelDigisCUDA.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got up to TrackingRecHit2DSOAView.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got to SoAStore.h
. I might have easily misunderstood something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got to the end.
I've been thinking about the names. I find the name
|
With all the comments reviewed and the todo list recorded in the last section of SoA.md I will close this PR to preserve the branch and create a new one with the new https://github.com/ericcano/pixeltrack-standalone/tree/macroBasedSoA-PR211-followup branch. |
This branch will now be followed up in #287 . |
Converted the SiPixelROCsStatusAndMapping to macro generated SoA.
This implements the SoA macros discussed in cms-patatrack/cmssw#272.