-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] New version of CudaCompat #428
base: CMSSW_11_0_X_Patatrack
Are you sure you want to change the base?
[RFC] New version of CudaCompat #428
Conversation
#include<cstdio> | ||
|
||
#undef __global__ | ||
#define __global__ inline __attribute__((always_inline)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is needed to avoid multiple definition of the same symbol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative is to have also the c++ definition in its own .cc (not inlined)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this sample case, wouldn't it be enough to have everything in the .cu file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, of course that does not work...
What we are trying to do for Cupla and Alpaka is to have the whole implementation in something like test/implement/Launch_t.cc
, and then let scram build two versions by having
test/Launch_t.cpp
#define CUDA_KERNELS_ON_CPU
#include "implement/Launch_t.cc"
test/Launch_t.cu
#include "implement/Launch_t.cc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the idea of the test is to have a single file compiled by gcc twice (see buildfile)
to test that indeed we can launch kernels from gcc and that the same code will run instead on cpu if CUDA_KERNELS_ON_CPU
is defined (in this case as a compiler option).
of course for cuda we need the additional .cu file to compile the device code.
For symmetry one can claim that cpu kernels should be compiled in their on cc (as at the end I do in the vertex producer together with a minimal driver).
Still for cpu the code must be forced inlined to avoid multiple symbols.
so in my opinion (at least with this model)
#define __global__ inline __attribute__((always_inline))
in case of cpu code will be required (and apparently does not harm cuda code).
This is done in cudaCompact.h. I tried to keep this specific test as self-included as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and that is my previous standard: one .h, one .cc one .cu.
here I tried one .cc and one .cu the latter with ONLY the kernel, no driver code.
I want to test launching from code compiled with gcc (having in mind that both cpu and gpu code shall resides in the same "load units", which is not the case in this test I agree).
I can build two tests (or three) to see what is needed to have both gpu and cpu code compiled, loaded and then run in the same executable (with eventually the driver code compiled by gcc even for the gpu case).
@@ -295,14 +375,16 @@ int main() { | |||
continue; | |||
} | |||
|
|||
#ifdef __CUDACC__ | |||
#ifndef CUDA_KERNELS_ON_CPU | |||
cudaCheck(cudaMemcpy(zv, LOC_ONGPU(zv), nv * sizeof(float), cudaMemcpyDeviceToHost)); | |||
cudaCheck(cudaMemcpy(wv, LOC_ONGPU(wv), nv * sizeof(float), cudaMemcpyDeviceToHost)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a typical case one does not want to make any memcpy on cpu....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the semantic can be one where making the copy is necessary also on cpu (e.g. because the function that launched the kernel does not keep it alive, or because the kernel makes changes to it that should not be reflected in the original buffer) or one where the copy is only required because of the different memory areas (e.g. the original buffer is guaranteed to stay alive, and the kernel does not make any changes to it).
Do you think it would make sense to define a couple of functions like cudautils::copy
and cudautils::mirror
?
Then, the first could always be a copy (either cudaMemcpy
or a plain copy) while the latter could be elided when running on the cpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a matter of discussion and prototyping.
I was thinking to some "magic" specialization of copy, and indeed your proposal is interesting as semantically expressive.
In reality in CMSSW production this never happens has
- we copy to host memory using specific constructs by Matti
- we do it explicitly in ad-hoc modules (SoAFromCUDA) and coded into the data format itself
2a) what I currently do is that the producer in the cpu WF have actually the name of the SoAFromCUDA producer/converter in the gpu WF so that the SoAonCPU have the same name in both wf
RecoPixelVertexing/PixelVertexFinding/src/PixelVertexProducerCUDA.cc
Outdated
Show resolved
Hide resolved
I think I understand the rationale (i.e. allocating/deallocating memory without calling the objects' constructors/destructors). IIRC on the GPU side at some point we were checking that the types being allocated had a trivial constructor and destructor. Is that still the case ? Would it make sense to check here as well ? |
mostly optimization (see how messy was the allocation before for GPUCells). |
|
ported "PixelTriplets" as well. |
@@ -72,7 +72,7 @@ namespace cudautils { | |||
inline void launchFinalize(Histo *__restrict__ h, | |||
uint8_t *__restrict__ ws | |||
#ifndef __CUDACC__ | |||
= cudaStreamDefault | |||
= nullptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious if using cudaStreamDefault
was giving problems ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no,to me nullptr makes more sense for a pointer (even if they are all == 0 )
#include <functional> | ||
|
||
#include <cstdlib> | ||
#include <cuda_runtime.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from a look at the file, I think #include <cuda_runtime.h>
could be removed ?
} // namespace cpu | ||
|
||
template <typename T> | ||
typename cpu::impl::make_cpu_unique_selector<T>::non_array make_cpu_unique(cudaStream_t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to better understand this: calling make_cpu_unique
would be roughly equivalent to c++20's std::make_unique_default_init
, plus it sets the deleter to just call free()
instead of calling the destructors ?
I am curious if using cudaStreamDefault was giving problems ?
no, was working file as I suspect it is just == nullptr
I was just puzzed reading the source and did not understood what was going on...
All these defaults are there just in case somebody would like to use this code in plain c++
In Heterogenous code the defaults are not needed as handled in the client code somehow...
(if the client code had to be identical on GPU and CPU)
|
On 8 Dec, 2019, at 11:45 AM, Andrea Bocci ***@***.***> wrote:
trying to better understand this: calling make_cpu_unique would be roughly equivalent to c++20's std::make_unique_default_init, plus it sets the deleter to just call free() instead of calling the destructors ?
when the committee will eventually agree on syntax and semantic yes. (and still calling free() is not very symmetric...)
v.
|
OK, I think I only need to understand the |
Validation summaryReference release CMSSW_11_0_0_pre13 at 91be707 Validation plots/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
On 8 Dec, 2019, at 11:38 AM, Andrea Bocci ***@***.***> wrote:
from a look at the file, I think #include <cuda_runtime.h> could be removed ?
nope
```
/home/innocent/Compat/CMSSW_11_0_0_pre13_Patatrack/src/HeterogeneousCore/CUDAUtilities/interface/cpu_unique_ptr.h:65:92: error: 'cudaStream_t' was not declared in this scope
typename cpu::impl::make_cpu_unique_selector<T>::non_array make_cpu_unique_uninitialized(cudaStream_t) {
```
|
OK, I think I only need to understand the Launch_t tests now :)
hope that
RecoLocalTracker/SiPixelRecHits/test/RecHitSoATest.*
in VinInn:CudaCompatReLoadMore
will be more clear...
will push in here ar some point
|
added an example of "heterogenous" analyzer using the "new" syntax. |
@@ -94,10 +94,14 @@ namespace cudautils { | |||
} // namespace detail | |||
|
|||
// wrappers for cudaLaunchKernel | |||
|
|||
inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add the inline
because it makes sense on its own
void launch(void (*kernel)(), LaunchParameters config) { | ||
#ifdef CUDA_KERNELS_ON_CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I really, really do not want to add a dependency on #ifdefs etc. here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will find a less intrusive solution
7ec0a22
to
e41560b
Compare
I was thinking we could try to have three set of utilities:
- one for CUDA-only stuff, e.g. the allocators, launch, etc.
- one for generic, non-CUDA stuff, e.g. (if it happens) a SoA wrapper,
etc.
- one for the compatibility layer for running the same code on CPU and GPU
I admit I did not consider much the dependencies among different utilities,
so this might end up not working.
|
This is a preview of a new version of my Quick&Dirty "make CUDA Kernel running on CPU"
CUDA_KERNELS_ON_CPU
cudaUtils::launch
to trivially invoke the kernel in case the above flag is definedmake_cpu_unique
and corresponding specialization ofunique_ptr
to invokemalloc/free
for symmetry with cuda (and avoid calls to constructors and destructors that anyhow are not called in the cuda case at the time of allocation)Traits
to use the aboveI have ported the Vertex producer, now the implementation in
gpuVertexFinderImpl.h
does not have ANY compile time flag related to cpu or gpu.default in GPU.
in gpuVertexFinder.cc the first line is
#define CUDA_KERNELS_ON_CPU
even if cudaCompact still defines it if not compiled by nvcc I plan to finish the port after the review.
the cpu and gpu code MUST be defined in different compilation unit.
the cuda kernel requires of course nvcc or clang
please see few more comments inlines