-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace use of CUDA API wrapper unique_ptrs with CUDAUtilities unique_ptrs #396
Replace use of CUDA API wrapper unique_ptrs with CUDAUtilities unique_ptrs #396
Conversation
…e_device_unique()
…utils::device::unique_ptr()
Validation summaryReference release CMSSW_11_0_0_pre7 at 411b633 Validation plots/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
No changes to physics or timing, as expected. |
In fact, this PR touches only test files, no part of the reconstruction. |
However, we do see some unexpected changes in the realistic TTbar workflow on GPU: Pixel Tracks from PV
|
Validation summaryReference release CMSSW_11_0_0_pre7 at 411b633 Validation plots/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
Let's see what happens re-running the validation... In fact, looking at the validation of some recent PRs, the result for the "Pixel Tracks from PV" seem to alternate randomly between 4950 and 5017 tracks. Do we have some non-perfect reproducibility in the vertexing code ? |
@@ -51,15 +51,15 @@ int main(void) { | |||
float ge[6 * size]; | |||
|
|||
auto current_device = cuda::device::current::get(); | |||
auto d_xl = cuda::memory::device::make_unique<float[]>(current_device, size); | |||
auto d_yl = cuda::memory::device::make_unique<float[]>(current_device, size); | |||
auto d_xl = cudautils::make_device_unique<float[]>(size, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do people think it would make sense to (ab)use cudaStreamDefault
instead of nullptr
to speficy the default stream ?
I say "abuse" because cudaStreamDefault
is meant to specify the default stream creation flags - however the name and value (0x00) would make it a good candidate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit afraid that the "abuse" would lead to confusion at some point.
I'm thinking (*) to add an overload on the caching allocator that would not take a stream at all (or use the nullptr
to signify no-stream; although that choice would make it impossible to use the allocator with the default stream), in which case the memory block is truly freed at the destructor of the unique_ptr
(instead of delaying the "true free" until the work using the memory block has finished). My main challenge is the naming of the smart pointers: using unique_ptr
for both would likely be confusing (in a sense the current unique_ptr
could be argued to be confusing as well).
(*) e.g. for caching memory allocations in ESProducts, and to reduce the use of CUDA events in the caching allocator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading up on the CUDA documentation, there are actually two options for the "default" stream:
- the "legacy default stream"; this synchronises with all (not non-blocking) streams on the same device
- the "per-thread default stream"; this is per-thread, and does not synchronise with other streams (except for the legacy one)
Passing 0
or nullptr
will use either of those behaviours depending on the nvcc --default-stream
option or the CUDA_API_PER_THREAD_DEFAULT_STREAM
symbol; the default is the "legacy" stream.
Purely from the API point of view, I would use
cudautils::make_device_unique<T>(size, nullptr);
for the unspecified default streamcudautils::make_device_unique<T>(size, cudaStreamLegacy);
for the legacy default streamcudautils::make_device_unique<T>(size, cudaStreamPerThread);
for the per-thread default streamcudautils::make_device_unique<T>(size);
for the synchronous behaviour
to keep the possibility of passing nullptr
for the generic default stream.
With that naming scheme, cudaStreamDefault
makes a lot of sense for the unspecified default stream.
My main challenge is the naming of the smart pointers: using
unique_ptr
for both would likely be confusing (in a sense the currentunique_ptr
could be argued to be confusing as well).
Then I would suggest unique_ptr
and make_device_unique
for the synchronous behaviour, and something like async_unique_ptr
and make_device_async_unique
or unique_ptr_async
and make_device_unique_async
for the ones that use a stream ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just stick to unique_ptr
...
In fact, looking at the validation of some recent PRs, the result for the "Pixel Tracks from PV" seem to alternate randomly between 4950 and 5017 tracks.
Do we have some non-perfect reproducibility in the vertexing code ?
not necessarely in the vertex code.
The association to PV is a specific algorithm in Validation code (that we need to describe in the paper, btw) Matti???
v.
|
From the second round of validation:
Not bad for a PR that touches only test files... |
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
…_ptrs (#396) Replace cuda::memory::device::make_unique() calls with cudautils::make_device_unique() Replace cuda::memory::host::make_unique() with cudautils::make_host_unique()
PR description:
This PR is part of #386, and replaces the use of CUDA API wrapper
unique_ptr
s:cuda::memory::device::make_unique()
andcuda::memory::host::make_unique()
with, respectively,cudautils::make_device_unique()
andcudautils::make_host_unique()
. For this purpose also thecuda::memory::device::unique_ptr()
andcuda::memory::host::unique_ptr()
have been replaced with, respectively,cudautils::device::unique_ptr()
andcudautils::host::unique_ptr()
PR validation:
Unit tests run, code formatting was run