[RFC] New version of CudaCompat #428

VinInn · 2019-12-05T09:48:18Z

This is a preview of a new version of my Quick&Dirty "make CUDA Kernel running on CPU"

now everything is driven by a new cpp compile-time flag CUDA_KERNELS_ON_CPU
I have modified cudaUtils::launch to trivially invoke the kernel in case the above flag is defined
I have introduced new make_cpu_unique and corresponding specialization of unique_ptr to invoke malloc/free for symmetry with cuda (and avoid calls to constructors and destructors that anyhow are not called in the cuda case at the time of allocation)
modified the Traits to use the above

I have ported the Vertex producer, now the implementation in gpuVertexFinderImpl.h does not have ANY compile time flag related to cpu or gpu.
default in GPU.
in gpuVertexFinder.cc the first line is
#define CUDA_KERNELS_ON_CPU
even if cudaCompact still defines it if not compiled by nvcc I plan to finish the port after the review.

the cpu and gpu code MUST be defined in different compilation unit.
the cuda kernel requires of course nvcc or clang

please see few more comments inlines

HeterogeneousCore/CUDAUtilities/interface/launch.h

VinInn · 2019-12-05T09:49:54Z

HeterogeneousCore/CUDAUtilities/test/Launch_t.cpp

+#include<cstdio>
+
+#undef __global__
+#define __global__ inline __attribute__((always_inline))


this is needed to avoid multiple definition of the same symbol

alternative is to have also the c++ definition in its own .cc (not inlined)

For this sample case, wouldn't it be enough to have everything in the .cu file ?

Sorry, of course that does not work...

What we are trying to do for Cupla and Alpaka is to have the whole implementation in something like test/implement/Launch_t.cc, and then let scram build two versions by having

test/Launch_t.cpp

#define CUDA_KERNELS_ON_CPU #include "implement/Launch_t.cc"

test/Launch_t.cu

#include "implement/Launch_t.cc"

Yes, but the idea of the test is to have a single file compiled by gcc twice (see buildfile)
to test that indeed we can launch kernels from gcc and that the same code will run instead on cpu if CUDA_KERNELS_ON_CPU is defined (in this case as a compiler option).
of course for cuda we need the additional .cu file to compile the device code.
For symmetry one can claim that cpu kernels should be compiled in their on cc (as at the end I do in the vertex producer together with a minimal driver).
Still for cpu the code must be forced inlined to avoid multiple symbols.

so in my opinion (at least with this model)

#define __global__ inline __attribute__((always_inline))

in case of cpu code will be required (and apparently does not harm cuda code).
This is done in cudaCompact.h. I tried to keep this specific test as self-included as possible.

Yes, and that is my previous standard: one .h, one .cc one .cu.
here I tried one .cc and one .cu the latter with ONLY the kernel, no driver code.
I want to test launching from code compiled with gcc (having in mind that both cpu and gpu code shall resides in the same "load units", which is not the case in this test I agree).
I can build two tests (or three) to see what is needed to have both gpu and cpu code compiled, loaded and then run in the same executable (with eventually the driver code compiled by gcc even for the gpu case).

RecoPixelVertexing/PixelVertexFinding/src/gpuVertexFinder.cc

RecoPixelVertexing/PixelVertexFinding/src/gpuVertexFinder.cu

RecoPixelVertexing/PixelVertexFinding/src/gpuVertexFinder.cc

RecoPixelVertexing/PixelVertexFinding/test/VertexFinder_t.h

VinInn · 2019-12-05T09:58:14Z

RecoPixelVertexing/PixelVertexFinding/test/VertexFinder_t.h

@@ -295,14 +375,16 @@ int main() {
        continue;
      }

-#ifdef __CUDACC__
+#ifndef CUDA_KERNELS_ON_CPU
      cudaCheck(cudaMemcpy(zv, LOC_ONGPU(zv), nv * sizeof(float), cudaMemcpyDeviceToHost));
      cudaCheck(cudaMemcpy(wv, LOC_ONGPU(wv), nv * sizeof(float), cudaMemcpyDeviceToHost));


this is a typical case one does not want to make any memcpy on cpu....

In general, the semantic can be one where making the copy is necessary also on cpu (e.g. because the function that launched the kernel does not keep it alive, or because the kernel makes changes to it that should not be reflected in the original buffer) or one where the copy is only required because of the different memory areas (e.g. the original buffer is guaranteed to stay alive, and the kernel does not make any changes to it).

Do you think it would make sense to define a couple of functions like cudautils::copy and cudautils::mirror ?
Then, the first could always be a copy (either cudaMemcpy or a plain copy) while the latter could be elided when running on the cpu.

This is indeed a matter of discussion and prototyping.
I was thinking to some "magic" specialization of copy, and indeed your proposal is interesting as semantically expressive.

In reality in CMSSW production this never happens has

we copy to host memory using specific constructs by Matti

we do it explicitly in ad-hoc modules (SoAFromCUDA) and coded into the data format itself
2a) what I currently do is that the producer in the cpu WF have actually the name of the SoAFromCUDA producer/converter in the gpu WF so that the SoAonCPU have the same name in both wf

HeterogeneousCore/CUDAUtilities/test/cpu_unique_ptr_t.cpp

RecoPixelVertexing/PixelVertexFinding/src/PixelVertexProducerCUDA.cc

RecoPixelVertexing/PixelVertexFinding/test/VertexFinder_t.h

fwyzard · 2019-12-05T14:13:06Z

3. I have introduced new  `make_cpu_unique` and corresponding specialization of `unique_ptr ` to invoke `malloc/free` for symmetry with cuda (and avoid calls to constructors and destructors that anyhow are not called in the cuda case at the time of allocation)

I think I understand the rationale (i.e. allocating/deallocating memory without calling the objects' constructors/destructors).
Is it only for optimisation purposes, or do we expect it to make a difference in behaviour ?

IIRC on the GPU side at some point we were checking that the types being allocated had a trivial constructor and destructor. Is that still the case ? Would it make sense to check here as well ?

VinInn · 2019-12-05T14:18:13Z

Is it only for optimisation purposes, or do we expect it to make a difference in behaviour ?

mostly optimization (see how messy was the allocation before for GPUCells).
I can expect some issue in behaviour if double initialization messes thing up...

VinInn · 2019-12-06T08:24:07Z

IIRC on the GPU side at some point we were checking that the types being allocated had a trivial constructor and destructor. Is that still the case ? Would it make sense to check here as well ?
we now have 2 interfaces and make_cpu is a exact copy of make_device (BUT for malloc) so yes, the check is done

… as such

VinInn · 2019-12-06T14:53:42Z

ported "PixelTriplets" as well.

fwyzard · 2019-12-08T10:29:40Z

HeterogeneousCore/CUDAUtilities/interface/HistoContainer.h

@@ -72,7 +72,7 @@ namespace cudautils {
  inline void launchFinalize(Histo *__restrict__ h,
                             uint8_t *__restrict__ ws
 #ifndef __CUDACC__
-                             = cudaStreamDefault
+                             = nullptr


I am curious if using cudaStreamDefault was giving problems ?

no,to me nullptr makes more sense for a pointer (even if they are all == 0 )

fwyzard · 2019-12-08T10:38:29Z

HeterogeneousCore/CUDAUtilities/interface/cpu_unique_ptr.h

+#include <functional>
+
+#include <cstdlib>
+#include <cuda_runtime.h>


from a look at the file, I think #include <cuda_runtime.h> could be removed ?

fwyzard · 2019-12-08T10:45:34Z

HeterogeneousCore/CUDAUtilities/interface/cpu_unique_ptr.h

+  }    // namespace cpu
+
+  template <typename T>
+  typename cpu::impl::make_cpu_unique_selector<T>::non_array make_cpu_unique(cudaStream_t) {


trying to better understand this: calling make_cpu_unique would be roughly equivalent to c++20's std::make_unique_default_init, plus it sets the deleter to just call free() instead of calling the destructors ?

VinInn · 2019-12-08T10:49:20Z

I am curious if using cudaStreamDefault was giving problems ?

no, was working file as I suspect it is just == nullptr I was just puzzed reading the source and did not understood what was going on... All these defaults are there just in case somebody would like to use this code in plain c++ In Heterogenous code the defaults are not needed as handled in the client code somehow... (if the client code had to be identical on GPU and CPU)

VinInn · 2019-12-08T10:51:55Z

On 8 Dec, 2019, at 11:45 AM, Andrea Bocci ***@***.***> wrote: trying to better understand this: calling make_cpu_unique would be roughly equivalent to c++20's std::make_unique_default_init, plus it sets the deleter to just call free() instead of calling the destructors ?

when the committee will eventually agree on syntax and semantic yes. (and still calling free() is not very symmetric...) v.

fwyzard · 2019-12-08T10:54:30Z

OK, I think I only need to understand the Launch_t tests now :)

fwyzard · 2019-12-08T10:55:32Z

Validation summary

Reference release CMSSW_11_0_0_pre13 at 91be707
Development branch cms-patatrack/CMSSW_11_0_X_Patatrack at d02f4be
Testing PRs:

[RFC] New version of CudaCompat #428 at 36b8343

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.501
tracking validation plots and summary for workflow 10824.502

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.501
tracking validation plots and summary for workflow 10824.502

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.501
tracking validation plots and summary for workflow 10824.502

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.501
- ✔️ step3.py: log
development release, workflow 10824.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.885502
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.501
- ✔️ step3.py: log
testing release, workflow 10824.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.885502

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.501
- ✔️ step3.py: log
development release, workflow 10824.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.885502
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.501
- ✔️ step3.py: log
testing release, workflow 10824.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.885502

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.501
- ✔️ step3.py: log
development release, workflow 10824.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.885502
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.501
- ✔️ step3.py: log
testing release, workflow 10824.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.885502

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/8d8d3c765fe092664e44c32187379f4895cbc210/log .

VinInn · 2019-12-08T10:55:58Z

On 8 Dec, 2019, at 11:38 AM, Andrea Bocci ***@***.***> wrote: from a look at the file, I think #include <cuda_runtime.h> could be removed ?

nope ``` /home/innocent/Compat/CMSSW_11_0_0_pre13_Patatrack/src/HeterogeneousCore/CUDAUtilities/interface/cpu_unique_ptr.h:65:92: error: 'cudaStream_t' was not declared in this scope typename cpu::impl::make_cpu_unique_selector<T>::non_array make_cpu_unique_uninitialized(cudaStream_t) { ```

VinInn · 2019-12-08T11:33:41Z

OK, I think I only need to understand the Launch_t tests now :)

hope that RecoLocalTracker/SiPixelRecHits/test/RecHitSoATest.* in VinInn:CudaCompatReLoadMore will be more clear... will push in here ar some point

VinInn · 2019-12-09T09:14:19Z

added an example of "heterogenous" analyzer using the "new" syntax.

fwyzard · 2019-12-13T14:51:13Z

HeterogeneousCore/CUDAUtilities/interface/launch.h

@@ -94,10 +94,14 @@ namespace cudautils {
  }  // namespace detail

  // wrappers for cudaLaunchKernel
-
+  inline


I will add the inline because it makes sense on its own

fwyzard · 2019-12-13T14:51:37Z

HeterogeneousCore/CUDAUtilities/interface/launch.h

  void launch(void (*kernel)(), LaunchParameters config) {
+#ifdef CUDA_KERNELS_ON_CPU


but I really, really do not want to add a dependency on #ifdefs etc. here.

will find a less intrusive solution

fwyzard · 2019-12-13T18:29:58Z

I was thinking we could try to have three set of utilities: - one for CUDA-only stuff, e.g. the allocators, launch, etc. - one for generic, non-CUDA stuff, e.g. (if it happens) a SoA wrapper, etc. - one for the compatibility layer for running the same code on CPU and GPU I admit I did not consider much the dependencies among different utilities, so this might end up not working.

VinInn added 8 commits December 4, 2019 12:43

more cudacompat

f761369

more cudacompat

466b1ae

improve test

69b6daa

compiles

f398b45

compiles

f7ec457

compiles

332cb45

now runs properly

0696159

remove cout

34f3704