Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] New version of CudaCompat #428

Open
wants to merge 24 commits into
base: CMSSW_11_0_X_Patatrack
Choose a base branch
from

Conversation

VinInn
Copy link

@VinInn VinInn commented Dec 5, 2019

This is a preview of a new version of my Quick&Dirty "make CUDA Kernel running on CPU"

  1. now everything is driven by a new cpp compile-time flag CUDA_KERNELS_ON_CPU
  2. I have modified cudaUtils::launch to trivially invoke the kernel in case the above flag is defined
  3. I have introduced new make_cpu_unique and corresponding specialization of unique_ptr to invoke malloc/free for symmetry with cuda (and avoid calls to constructors and destructors that anyhow are not called in the cuda case at the time of allocation)
  4. modified the Traits to use the above

I have ported the Vertex producer, now the implementation in gpuVertexFinderImpl.h does not have ANY compile time flag related to cpu or gpu.
default in GPU.
in gpuVertexFinder.cc the first line is
#define CUDA_KERNELS_ON_CPU
even if cudaCompact still defines it if not compiled by nvcc I plan to finish the port after the review.

the cpu and gpu code MUST be defined in different compilation unit.
the cuda kernel requires of course nvcc or clang

please see few more comments inlines

#include<cstdio>

#undef __global__
#define __global__ inline __attribute__((always_inline))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed to avoid multiple definition of the same symbol

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternative is to have also the c++ definition in its own .cc (not inlined)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this sample case, wouldn't it be enough to have everything in the .cu file ?

Copy link

@fwyzard fwyzard Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, of course that does not work...

What we are trying to do for Cupla and Alpaka is to have the whole implementation in something like test/implement/Launch_t.cc, and then let scram build two versions by having

test/Launch_t.cpp

#define CUDA_KERNELS_ON_CPU
#include "implement/Launch_t.cc"

test/Launch_t.cu

#include "implement/Launch_t.cc"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the idea of the test is to have a single file compiled by gcc twice (see buildfile)
to test that indeed we can launch kernels from gcc and that the same code will run instead on cpu if CUDA_KERNELS_ON_CPU is defined (in this case as a compiler option).
of course for cuda we need the additional .cu file to compile the device code.
For symmetry one can claim that cpu kernels should be compiled in their on cc (as at the end I do in the vertex producer together with a minimal driver).
Still for cpu the code must be forced inlined to avoid multiple symbols.

so in my opinion (at least with this model)

#define __global__ inline __attribute__((always_inline))

in case of cpu code will be required (and apparently does not harm cuda code).
This is done in cudaCompact.h. I tried to keep this specific test as self-included as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and that is my previous standard: one .h, one .cc one .cu.
here I tried one .cc and one .cu the latter with ONLY the kernel, no driver code.
I want to test launching from code compiled with gcc (having in mind that both cpu and gpu code shall resides in the same "load units", which is not the case in this test I agree).
I can build two tests (or three) to see what is needed to have both gpu and cpu code compiled, loaded and then run in the same executable (with eventually the driver code compiled by gcc even for the gpu case).

@@ -295,14 +375,16 @@ int main() {
continue;
}

#ifdef __CUDACC__
#ifndef CUDA_KERNELS_ON_CPU
cudaCheck(cudaMemcpy(zv, LOC_ONGPU(zv), nv * sizeof(float), cudaMemcpyDeviceToHost));
cudaCheck(cudaMemcpy(wv, LOC_ONGPU(wv), nv * sizeof(float), cudaMemcpyDeviceToHost));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a typical case one does not want to make any memcpy on cpu....

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the semantic can be one where making the copy is necessary also on cpu (e.g. because the function that launched the kernel does not keep it alive, or because the kernel makes changes to it that should not be reflected in the original buffer) or one where the copy is only required because of the different memory areas (e.g. the original buffer is guaranteed to stay alive, and the kernel does not make any changes to it).

Do you think it would make sense to define a couple of functions like cudautils::copy and cudautils::mirror ?
Then, the first could always be a copy (either cudaMemcpy or a plain copy) while the latter could be elided when running on the cpu.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a matter of discussion and prototyping.
I was thinking to some "magic" specialization of copy, and indeed your proposal is interesting as semantically expressive.

In reality in CMSSW production this never happens has

  1. we copy to host memory using specific constructs by Matti
  2. we do it explicitly in ad-hoc modules (SoAFromCUDA) and coded into the data format itself
    2a) what I currently do is that the producer in the cpu WF have actually the name of the SoAFromCUDA producer/converter in the gpu WF so that the SoAonCPU have the same name in both wf

@fwyzard
Copy link

fwyzard commented Dec 5, 2019

3. I have introduced new  `make_cpu_unique` and corresponding specialization of `unique_ptr ` to invoke `malloc/free` for symmetry with cuda (and avoid calls to constructors and destructors that anyhow are not called in the cuda case at the time of allocation)

I think I understand the rationale (i.e. allocating/deallocating memory without calling the objects' constructors/destructors).
Is it only for optimisation purposes, or do we expect it to make a difference in behaviour ?

IIRC on the GPU side at some point we were checking that the types being allocated had a trivial constructor and destructor. Is that still the case ? Would it make sense to check here as well ?

@VinInn
Copy link
Author

VinInn commented Dec 5, 2019

Is it only for optimisation purposes, or do we expect it to make a difference in behaviour ?

mostly optimization (see how messy was the allocation before for GPUCells).
I can expect some issue in behaviour if double initialization messes thing up...

@VinInn
Copy link
Author

VinInn commented Dec 6, 2019

IIRC on the GPU side at some point we were checking that the types being allocated had a trivial constructor and destructor. Is that still the case ? Would it make sense to check here as well ?
we now have 2 interfaces and make_cpu is a exact copy of make_device (BUT for malloc) so yes, the check is done

@VinInn
Copy link
Author

VinInn commented Dec 6, 2019

ported "PixelTriplets" as well.

@@ -72,7 +72,7 @@ namespace cudautils {
inline void launchFinalize(Histo *__restrict__ h,
uint8_t *__restrict__ ws
#ifndef __CUDACC__
= cudaStreamDefault
= nullptr
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious if using cudaStreamDefault was giving problems ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no,to me nullptr makes more sense for a pointer (even if they are all == 0 )

#include <functional>

#include <cstdlib>
#include <cuda_runtime.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a look at the file, I think #include <cuda_runtime.h> could be removed ?

} // namespace cpu

template <typename T>
typename cpu::impl::make_cpu_unique_selector<T>::non_array make_cpu_unique(cudaStream_t) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to better understand this: calling make_cpu_unique would be roughly equivalent to c++20's std::make_unique_default_init, plus it sets the deleter to just call free() instead of calling the destructors ?

@VinInn
Copy link
Author

VinInn commented Dec 8, 2019 via email

@VinInn
Copy link
Author

VinInn commented Dec 8, 2019 via email

@fwyzard
Copy link

fwyzard commented Dec 8, 2019

OK, I think I only need to understand the Launch_t tests now :)

@fwyzard
Copy link

fwyzard commented Dec 8, 2019

Validation summary

Reference release CMSSW_11_0_0_pre13 at 91be707
Development branch cms-patatrack/CMSSW_11_0_X_Patatrack at d02f4be
Testing PRs:

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.501
  • tracking validation plots and summary for workflow 10824.502

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.501
  • tracking validation plots and summary for workflow 10824.502

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.501
  • tracking validation plots and summary for workflow 10824.502

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.501
  • development release, workflow 10824.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • testing release, workflow 10824.5
  • testing release, workflow 10824.501
  • testing release, workflow 10824.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.885502

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.501
  • development release, workflow 10824.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • testing release, workflow 10824.5
  • testing release, workflow 10824.501
  • testing release, workflow 10824.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.885502

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.501
  • development release, workflow 10824.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • testing release, workflow 10824.5
  • testing release, workflow 10824.501
  • testing release, workflow 10824.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.885502

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/8d8d3c765fe092664e44c32187379f4895cbc210/log .

@VinInn
Copy link
Author

VinInn commented Dec 8, 2019 via email

@VinInn
Copy link
Author

VinInn commented Dec 8, 2019 via email

@VinInn
Copy link
Author

VinInn commented Dec 9, 2019

added an example of "heterogenous" analyzer using the "new" syntax.

@@ -94,10 +94,14 @@ namespace cudautils {
} // namespace detail

// wrappers for cudaLaunchKernel

inline
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the inline because it makes sense on its own

void launch(void (*kernel)(), LaunchParameters config) {
#ifdef CUDA_KERNELS_ON_CPU
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I really, really do not want to add a dependency on #ifdefs etc. here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will find a less intrusive solution

@fwyzard
Copy link

fwyzard commented Dec 13, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants