Evolution of the Alpaka "gpu framework" #39428

makortel · 2022-09-16T18:46:30Z

PR description:

This PR proposes an evolution of the "gpu framework" for Alpaka EDModules and ESModules. The current model (introduced in #38855) is similar to the model we have for CUDA EDModules, whereas this development builds on earlier prototypes (in cms-patatrack/pixeltrack-standalone#224, cms-patatrack/pixeltrack-standalone#256, cms-patatrack/pixeltrack-standalone#257 for CUDA, and in cms-patatrack/pixeltrack-standalone#314 for Alpaka). The new model was presented in the HLT GPU developments meeting in https://indico.cern.ch/event/1192252/#4-new-cmssw-accelerator-framew .

The two models can coexist. The EDModules/ESModules of the current model can not access the device data products of the new model. I believe (did not test) the new model EDModules would be able to use the ESProducts of the current model, and if really necessary, they could be made to use the EDProducts of the current model (but I'd prefer to avoid that).

Shortcomings of the CUDA model that this PR addresses are

Cumbersome interface (many reasons are listed in Improve CUDA EDProducers #30266)
- New interfaces resemble the interfaces elsewhere in the framework
ScopedContext* destructor does not check for errors in CUDA API calls in order to avoid exception being thrown from a destructor
- Now these functions are called from normal functions (in a base class)
Situation where a CUDA EDProducer produces at least two device products that are consumed by at least two different consumers can lead to situation where one CUDA stream is shared by independent branches of the DAG (described in more detail in [cudadev][RFC] Next version of CUDA EDModule API cms-patatrack/pixeltrack-standalone#224)
- The "metadata" object holding the Queue is now shared across all device products of an EDProducer
Requires CUDA runtime API calls in the EventSetup data formats, spreading the CUDA dependence wide in CMSSW
- All device API functions are now called in ESProducer, data formats only hold the data. The device runtime dependence is still there though (because the memory buffers have explicit dependence), but limited to the shared libraries of the corresponding backends
Effectively prevents kernel calls in ESProducers to produce some of the data, and an ESProducer to consume device data
- Both are now allowed
In the EventSetup system, the transferred-from-host memory is kept alive for the entire duration of the IOV, whereas the memory could be released as soon as all transfers have finished (if the host-memory product is not consumed for anything else)
- The model here allows an earlier (pinned) host memory release via the use of edm::ESTransientHandle. We'd still need to enable the actual memory optimization (Look into enabling edm::ESTransientHandle memory optimization again #31085)
EDModules check on every event if the ESProduct data is ready
- Now the ESProducts are enforced to be fully ready before consuming modules are run

The device product wrapper, that is currently cms::cuda::Product<T> / cms::alpakatools::Product<TQueue, T> is now made part of the framework as edm::DeviceProduct<T>. Making it a framework type allows some future developments that would be difficult without. Currently the backend/memory space specific metadata are included in the edm::DeviceProduct<T> in a type-erased way, such that only the correct backend code can access the data. This may stay or change in the future.

This PR makes the host-synchronous backend(s) to produce the T directly instead of wrapping it to edm::DeviceProduct<T> (which was left as future improvement in #38855).

The PR explores three possible models for ESProducts (which are indicated in comments in the code, these were also summarized in the slides linked above)

The product class is fully defined inside ALPAKA_ACCELERATOR_NAMESPACE
The product class is defined as a template with Device template argument
The product class is defined as PortableCollection

The PR currently has ~~three~~ two limitations compared to the current models

~~The automatic transfers from device to host are synchronous.~~ The automatic transfer from device to host are now asynchronous (with the aid of Added transformAsync ability to modules #39557)
- ~~This shortcoming will be improved in a future PR after an asynchronous version of Transformer module ability has been developed (as planned in Add Transformer ability to modules #38454)~~
- In the mean time, the blocking synchronization can be worked around with an explicit "Transcriber" module that uses the stream::SynchronizingEDProducer (i.e. ExternalWork module ability) as in Integration of alpaka and macro-based SoA in CMSSW #38855.
The system is limited to one device per backend, because of the EventSetup model introduced here.
- This shortcoming will be improved in the future by making the framework aware of different memory and execution spaces (to some degree).
- For the mean time, I'll craft a temporary solution that would wrap the ESProducts for all devices into one ESProduct (similarly to the current model)
The operations launched in ESProducer are synchronized in a blocking way before the consuming modules are run
- This will be improved in a future PR that will add a functionality similar to ExternalWork to ESProducers (Add ExternalWork-like mechanism for ESProducers #24185)
- In the mean time I'd expect the impact of these synchronizations to be minimal in production
  - The current CUDA system also leads to device-host synchronizations around IOV boundaries because of the memory allocation patterns (as the caching allocator is not used).
  - As far as I've understood, at the HLT the Records that have device ESProducts have IOVs that much longer than a single lumi, which limits the main impact to the beginning of the jobs
  - The transfers are initiated earlier and concurrently

The code has many TODO: comments of which most are meant for further evolution after this PR. I'm planning to address the following in this PR

Temporary mitigation ESProducts to be transferred for "every device" of a backend
Look into replacing most/all #ifdef for backend-specific behavior with if constexpr (or equivalent)
Few files that depend at most on Alpaka and FWCore/Utilities and are specific to CMS are currently placed in HeterogeneousCore/AlpakaCore and HeterogeneousCore/AlpakaInterface. They should be moved out from the former, but the latter seems non-ideal as well. New package?
Finalize the naming convention
- I'm specifically thinking the placement of Device in the names, currently they are along DeviceEvent, EDDeviceGetToken, global::EDProducer (i.e. no Device in name).
  - I wonder if there could be confusion between edm::global::EDProducer and ALPAKA_ACCELERATOR_NAMESPACE::global::EDProducer (where only the global::EDProducer part is visible in code)
  - Alternative could be along DeviceEvent, DeviceEDGetToken, global::DeviceEDProducer
  - Or device::Event, device::EDGetToken, device::global::EDProducer (all still enclosed in ALPAKA_ACCELERATOR_NAMESPACE)
~~Add README.md documenting the interface~~ (descoped from this PR)

PR validation:

Added unit test passes on both CPU and GPU.

cmsbuild · 2022-09-16T18:53:11Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39428/32137

This PR adds an extra 48KB to repository

cmsbuild · 2022-09-16T18:53:37Z

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

DataFormats/Common (core)
DataFormats/Portable (heterogeneous)
DataFormats/PortableTestObjects (heterogeneous)
HeterogeneousCore/AlpakaCore (heterogeneous)
HeterogeneousCore/AlpakaInterface (heterogeneous)
HeterogeneousCore/AlpakaTest (heterogeneous)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please review it and eventually sign? Thanks.
@missirol, @wddgit, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

makortel · 2022-09-16T18:56:11Z

enable gpu

makortel · 2022-09-16T18:56:17Z

@cmsbuild, please test

makortel · 2022-09-16T18:56:55Z

DataFormats/Portable/interface/alpaka/PortableCollection.h

@@ -39,4 +40,20 @@ namespace traits {

 }  // namespace traits

+namespace cms::alpakatools {
+  // TODO: Is this the right place for the specialization? Or should it be in PortableDeviceProduct?


@fwyzard Any thoughts on the placement of this template specialization?

As you comment in HeterogeneousCore/AlpakaCore/interface/alpaka/ProducerBase.h,

a specialization of cms::alpakatools::TransferToHost template [...] should be provided in the same file where T is defined.

So I think it makes sense to have the specialisation for ALPAKA_ACCELERATOR_NAMESPACE::PortableCollection<T> together with the declaration of ALPAKA_ACCELERATOR_NAMESPACE::PortableCollection<T>

I guess the alternative would be to specialize the TransferToHost for PortableDeviceCollection<T, TDev>, which for all practical purposes should be sufficient. That would also avoid the use of ALPAKA_ACCELERATOR_NAMESPACE in the specialization definition. (and I'm still uncertain which one would be better)

makortel · 2022-09-16T19:00:49Z

HeterogeneousCore/AlpakaCore/interface/alpaka/EDMetadata.h

+   */
+
+#ifdef ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
+  // All backends with a synchronous queue


@fwyzard I think all other uses of these #ifdefs to check if the backend's device is host, and/or if the queue is blocking, could be easily replaced with e.g. if constexpr and traits, except this one. I think it would be technically possible here as well, but I think it would be more cumbersome as the classes are quite different for the two case.

makortel · 2022-09-16T19:02:46Z

HeterogeneousCore/AlpakaCore/interface/alpaka/ESDeviceProduct.h

@@ -0,0 +1,52 @@
+#ifndef HeterogeneousCore_AlpakaCore_interface_alpaka_ESDeviceProduct_h
+#define HeterogeneousCore_AlpakaCore_interface_alpaka_ESDeviceProduct_h


This file is one of those which should not really be in HeterogeneousCore/AlpakaCore because of the package depending on FWCore/Framework, it depends only on Alpaka itself, but is specific to CMS (so would not really fit in HeterogeneousCore/AlpakaInterface).

makortel · 2022-09-16T19:04:14Z

HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h

+// TODO: this place is suboptimal given that the framework's
+// typelookup.h is in FWCore/Utilities (i.e. independent of the
+// framework itself). The ESDeviceProduct should be in the same
+// package.


This file is one of those which should not really be in HeterogeneousCore/AlpakaCore because of the package depending on FWCore/Framework. The code here depends only on Alpaka and FWCore/Utilities, and is specific to CMS (so would not really fit in HeterogeneousCore/AlpakaInterface).

makortel · 2022-09-16T19:09:01Z

HeterogeneousCore/AlpakaCore/src/alpaka/chooseDevice.cc

+    // process.options.accelerators to AlpakaServices via
+    // ProcessAcceleratorAlpaka.
+    edm::Service<ALPAKA_ACCELERATOR_NAMESPACE::AlpakaService> alpakaService;
+    if (not alpakaService->enabled()) {


I decided to duplicate the chooseDevice() from cms::alpakatools::chooseDevice() in order to avoid the templating, add easily the dependence on the corresponding AlpakaService (following the discussion in #39319 (comment)).

makortel · 2022-09-16T19:11:20Z

HeterogeneousCore/AlpakaInterface/interface/TransferToHost.h

+#ifndef HeterogeneousCore_AlpakaInterface_interface_TransferToHost_h
+#define HeterogeneousCore_AlpakaInterface_interface_TransferToHost_h
+
+// TODO: better package?


This file is one of those that should not be in a package depending on FWCore/Framework (but in one that fulfills the DataFormat constraints), but is specific to CMS so would not really fit in HeterogeneousCore/AlpakaInterface.

HeterogeneousCore/AlpakaCore/interface/alpaka/ESProducer.h

cmsbuild · 2022-09-16T23:00:10Z

-1

Failed Tests: HeaderConsistency UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2aec6d/27615/summary.html
COMMIT: d702fe1
CMSSW: CMSSW_12_6_X_2022-09-16-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39428/27615/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
Reco comparison had 3 failed jobs
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19876
DQMHistoTests: Total failures: 9
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 19867
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 5 differences found in the comparisons
DQMHistoTests: Total files compared: 51
DQMHistoTests: Total histograms compared: 3618332
DQMHistoTests: Total failures: 8
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3618302
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
Checked 212 log files, 49 edm output root files, 51 DQM output files
TriggerResults: no differences found

cmsbuild · 2022-11-22T19:17:00Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39428/33125

This PR adds an extra 12KB to repository

cmsbuild · 2022-11-22T19:17:26Z

Pull request #39428 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please check and sign again.

makortel · 2022-11-22T19:21:05Z

@cmsbuild, please test

Mostly for the record, the diff of the PR is the same as before.

makortel · 2022-11-22T19:21:28Z

+heterogeneous

(taking the liberty since there are no code changes since @fwyzard signing)

makortel · 2022-11-22T19:21:34Z

+core

cmsbuild · 2022-11-22T19:21:50Z

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild · 2022-11-22T23:41:16Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2aec6d/29196/summary.html
COMMIT: edb7cda
CMSSW: CMSSW_12_6_X_2022-11-22-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39428/29196/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 8 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3417239
DQMHistoTests: Total failures: 6
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3417211
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 206 log files, 48 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 9 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19876
DQMHistoTests: Total failures: 605
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 19271
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

rappoccio · 2022-11-22T23:54:04Z

+1

cmsbuild added this to the CMSSW_12_6_X milestone Sep 16, 2022

cmsbuild added code-checks-pending core-pending heterogeneous-pending orp-pending pending-signatures tests-pending labels Sep 16, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels Sep 16, 2022

makortel mentioned this pull request Sep 16, 2022

Add ExternalWork-like mechanism for ESProducers #24185

Open

cmsbuild added tests-started and removed tests-pending labels Sep 16, 2022

makortel commented Sep 16, 2022

View reviewed changes

HeterogeneousCore/AlpakaCore/interface/alpaka/ESProducer.h Show resolved Hide resolved

cmsbuild added tests-rejected and removed tests-started tests-rejected labels Sep 16, 2022

cmsbuild added code-checks-pending core-pending heterogeneous-pending pending-signatures tests-pending labels Nov 22, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels Nov 22, 2022

cmsbuild added tests-started and removed tests-pending labels Nov 22, 2022

cmsbuild added core-approved fully-signed heterogeneous-approved and removed core-pending pending-signatures heterogeneous-pending labels Nov 22, 2022

cmsbuild added tests-approved and removed tests-started labels Nov 22, 2022

cmsbuild added orp-approved and removed orp-pending labels Nov 22, 2022

cmsbuild merged commit 08fcc8e into cms-sw:master Nov 22, 2022

makortel mentioned this pull request Nov 24, 2022

Follow-ups to "new Alpaka GPU framework" #40140

Open

20 tasks

makortel deleted the alpakaFramework branch January 13, 2023 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evolution of the Alpaka "gpu framework" #39428

Evolution of the Alpaka "gpu framework" #39428

makortel commented Sep 16, 2022 •

edited

Loading

cmsbuild commented Sep 16, 2022

cmsbuild commented Sep 16, 2022

makortel commented Sep 16, 2022

makortel commented Sep 16, 2022

makortel Sep 16, 2022

fwyzard Nov 8, 2022

makortel Nov 9, 2022

makortel Sep 16, 2022

makortel Sep 16, 2022

makortel Sep 16, 2022

makortel Sep 16, 2022

makortel Sep 16, 2022

cmsbuild commented Sep 16, 2022

cmsbuild commented Nov 22, 2022

cmsbuild commented Nov 22, 2022

makortel commented Nov 22, 2022

makortel commented Nov 22, 2022

makortel commented Nov 22, 2022

cmsbuild commented Nov 22, 2022

cmsbuild commented Nov 22, 2022

rappoccio commented Nov 22, 2022

		@@ -0,0 +1,52 @@
		#ifndef HeterogeneousCore_AlpakaCore_interface_alpaka_ESDeviceProduct_h
		#define HeterogeneousCore_AlpakaCore_interface_alpaka_ESDeviceProduct_h

Evolution of the Alpaka "gpu framework" #39428

Evolution of the Alpaka "gpu framework" #39428

Conversation

makortel commented Sep 16, 2022 • edited Loading

PR description:

PR validation:

cmsbuild commented Sep 16, 2022

cmsbuild commented Sep 16, 2022

makortel commented Sep 16, 2022

makortel commented Sep 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsbuild commented Sep 16, 2022

Unit Tests

GPU Comparison Summary

Comparison Summary

cmsbuild commented Nov 22, 2022

cmsbuild commented Nov 22, 2022

makortel commented Nov 22, 2022

makortel commented Nov 22, 2022

makortel commented Nov 22, 2022

cmsbuild commented Nov 22, 2022

cmsbuild commented Nov 22, 2022

Comparison Summary

GPU Comparison Summary

rappoccio commented Nov 22, 2022

makortel commented Sep 16, 2022 •

edited

Loading