Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evolution of the Alpaka "gpu framework" #39428

Merged
merged 3 commits into from
Nov 22, 2022

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Sep 16, 2022

PR description:

This PR proposes an evolution of the "gpu framework" for Alpaka EDModules and ESModules. The current model (introduced in #38855) is similar to the model we have for CUDA EDModules, whereas this development builds on earlier prototypes (in cms-patatrack/pixeltrack-standalone#224, cms-patatrack/pixeltrack-standalone#256, cms-patatrack/pixeltrack-standalone#257 for CUDA, and in cms-patatrack/pixeltrack-standalone#314 for Alpaka). The new model was presented in the HLT GPU developments meeting in https://indico.cern.ch/event/1192252/#4-new-cmssw-accelerator-framew .

The two models can coexist. The EDModules/ESModules of the current model can not access the device data products of the new model. I believe (did not test) the new model EDModules would be able to use the ESProducts of the current model, and if really necessary, they could be made to use the EDProducts of the current model (but I'd prefer to avoid that).

Shortcomings of the CUDA model that this PR addresses are

  • Cumbersome interface (many reasons are listed in Improve CUDA EDProducers #30266)
    • New interfaces resemble the interfaces elsewhere in the framework
  • ScopedContext* destructor does not check for errors in CUDA API calls in order to avoid exception being thrown from a destructor
    • Now these functions are called from normal functions (in a base class)
  • Situation where a CUDA EDProducer produces at least two device products that are consumed by at least two different consumers can lead to situation where one CUDA stream is shared by independent branches of the DAG (described in more detail in [cudadev][RFC] Next version of CUDA EDModule API cms-patatrack/pixeltrack-standalone#224)
    • The "metadata" object holding the Queue is now shared across all device products of an EDProducer
  • Requires CUDA runtime API calls in the EventSetup data formats, spreading the CUDA dependence wide in CMSSW
    • All device API functions are now called in ESProducer, data formats only hold the data. The device runtime dependence is still there though (because the memory buffers have explicit dependence), but limited to the shared libraries of the corresponding backends
  • Effectively prevents kernel calls in ESProducers to produce some of the data, and an ESProducer to consume device data
    • Both are now allowed
  • In the EventSetup system, the transferred-from-host memory is kept alive for the entire duration of the IOV, whereas the memory could be released as soon as all transfers have finished (if the host-memory product is not consumed for anything else)
  • EDModules check on every event if the ESProduct data is ready
    • Now the ESProducts are enforced to be fully ready before consuming modules are run

The device product wrapper, that is currently cms::cuda::Product<T> / cms::alpakatools::Product<TQueue, T> is now made part of the framework as edm::DeviceProduct<T>. Making it a framework type allows some future developments that would be difficult without. Currently the backend/memory space specific metadata are included in the edm::DeviceProduct<T> in a type-erased way, such that only the correct backend code can access the data. This may stay or change in the future.

This PR makes the host-synchronous backend(s) to produce the T directly instead of wrapping it to edm::DeviceProduct<T> (which was left as future improvement in #38855).

The PR explores three possible models for ESProducts (which are indicated in comments in the code, these were also summarized in the slides linked above)

  1. The product class is fully defined inside ALPAKA_ACCELERATOR_NAMESPACE
  2. The product class is defined as a template with Device template argument
  3. The product class is defined as PortableCollection

The PR currently has three two limitations compared to the current models

  1. The automatic transfers from device to host are synchronous. The automatic transfer from device to host are now asynchronous (with the aid of Added transformAsync ability to modules #39557)
  2. The system is limited to one device per backend, because of the EventSetup model introduced here.
    • This shortcoming will be improved in the future by making the framework aware of different memory and execution spaces (to some degree).
    • For the mean time, I'll craft a temporary solution that would wrap the ESProducts for all devices into one ESProduct (similarly to the current model)
  3. The operations launched in ESProducer are synchronized in a blocking way before the consuming modules are run
    • This will be improved in a future PR that will add a functionality similar to ExternalWork to ESProducers (Add ExternalWork-like mechanism for ESProducers #24185)
    • In the mean time I'd expect the impact of these synchronizations to be minimal in production
      • The current CUDA system also leads to device-host synchronizations around IOV boundaries because of the memory allocation patterns (as the caching allocator is not used).
      • As far as I've understood, at the HLT the Records that have device ESProducts have IOVs that much longer than a single lumi, which limits the main impact to the beginning of the jobs
      • The transfers are initiated earlier and concurrently

The code has many TODO: comments of which most are meant for further evolution after this PR. I'm planning to address the following in this PR

  • Temporary mitigation ESProducts to be transferred for "every device" of a backend
  • Look into replacing most/all #ifdef for backend-specific behavior with if constexpr (or equivalent)
  • Few files that depend at most on Alpaka and FWCore/Utilities and are specific to CMS are currently placed in HeterogeneousCore/AlpakaCore and HeterogeneousCore/AlpakaInterface. They should be moved out from the former, but the latter seems non-ideal as well. New package?
  • Finalize the naming convention
    • I'm specifically thinking the placement of Device in the names, currently they are along DeviceEvent, EDDeviceGetToken, global::EDProducer (i.e. no Device in name).
      • I wonder if there could be confusion between edm::global::EDProducer and ALPAKA_ACCELERATOR_NAMESPACE::global::EDProducer (where only the global::EDProducer part is visible in code)
      • Alternative could be along DeviceEvent, DeviceEDGetToken, global::DeviceEDProducer
      • Or device::Event, device::EDGetToken, device::global::EDProducer (all still enclosed in ALPAKA_ACCELERATOR_NAMESPACE)
  • Add README.md documenting the interface (descoped from this PR)

PR validation:

Added unit test passes on both CPU and GPU.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39428/32137

  • This PR adds an extra 48KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

  • DataFormats/Common (core)
  • DataFormats/Portable (heterogeneous)
  • DataFormats/PortableTestObjects (heterogeneous)
  • HeterogeneousCore/AlpakaCore (heterogeneous)
  • HeterogeneousCore/AlpakaInterface (heterogeneous)
  • HeterogeneousCore/AlpakaTest (heterogeneous)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please review it and eventually sign? Thanks.
@missirol, @wddgit, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@@ -39,4 +40,20 @@ namespace traits {

} // namespace traits

namespace cms::alpakatools {
// TODO: Is this the right place for the specialization? Or should it be in PortableDeviceProduct?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwyzard Any thoughts on the placement of this template specialization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you comment in HeterogeneousCore/AlpakaCore/interface/alpaka/ProducerBase.h,

a specialization of cms::alpakatools::TransferToHost template [...] should be provided in the same file where T is defined.

So I think it makes sense to have the specialisation for ALPAKA_ACCELERATOR_NAMESPACE::PortableCollection<T> together with the declaration of ALPAKA_ACCELERATOR_NAMESPACE::PortableCollection<T>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the alternative would be to specialize the TransferToHost for PortableDeviceCollection<T, TDev>, which for all practical purposes should be sufficient. That would also avoid the use of ALPAKA_ACCELERATOR_NAMESPACE in the specialization definition. (and I'm still uncertain which one would be better)

*/

#ifdef ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
// All backends with a synchronous queue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwyzard I think all other uses of these #ifdefs to check if the backend's device is host, and/or if the queue is blocking, could be easily replaced with e.g. if constexpr and traits, except this one. I think it would be technically possible here as well, but I think it would be more cumbersome as the classes are quite different for the two case.

@@ -0,0 +1,52 @@
#ifndef HeterogeneousCore_AlpakaCore_interface_alpaka_ESDeviceProduct_h
#define HeterogeneousCore_AlpakaCore_interface_alpaka_ESDeviceProduct_h
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is one of those which should not really be in HeterogeneousCore/AlpakaCore because of the package depending on FWCore/Framework, it depends only on Alpaka itself, but is specific to CMS (so would not really fit in HeterogeneousCore/AlpakaInterface).

// TODO: this place is suboptimal given that the framework's
// typelookup.h is in FWCore/Utilities (i.e. independent of the
// framework itself). The ESDeviceProduct should be in the same
// package.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is one of those which should not really be in HeterogeneousCore/AlpakaCore because of the package depending on FWCore/Framework. The code here depends only on Alpaka and FWCore/Utilities, and is specific to CMS (so would not really fit in HeterogeneousCore/AlpakaInterface).

// process.options.accelerators to AlpakaServices via
// ProcessAcceleratorAlpaka.
edm::Service<ALPAKA_ACCELERATOR_NAMESPACE::AlpakaService> alpakaService;
if (not alpakaService->enabled()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to duplicate the chooseDevice() from cms::alpakatools::chooseDevice() in order to avoid the templating, add easily the dependence on the corresponding AlpakaService (following the discussion in #39319 (comment)).

#ifndef HeterogeneousCore_AlpakaInterface_interface_TransferToHost_h
#define HeterogeneousCore_AlpakaInterface_interface_TransferToHost_h

// TODO: better package?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is one of those that should not be in a package depending on FWCore/Framework (but in one that fulfills the DataFormat constraints), but is specific to CMS so would not really fit in HeterogeneousCore/AlpakaInterface.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: HeaderConsistency UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2aec6d/27615/summary.html
COMMIT: d702fe1
CMSSW: CMSSW_12_6_X_2022-09-16-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39428/27615/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • Reco comparison had 3 failed jobs
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19876
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19867
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 5 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3618332
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3618302
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39428/33125

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

Pull request #39428 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please check and sign again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

Mostly for the record, the diff of the PR is the same as before.

@makortel
Copy link
Contributor Author

+heterogeneous

(taking the liberty since there are no code changes since @fwyzard signing)

@makortel
Copy link
Contributor Author

+core

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2aec6d/29196/summary.html
COMMIT: edb7cda
CMSSW: CMSSW_12_6_X_2022-11-22-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39428/29196/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3417239
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3417211
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 206 log files, 48 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19876
  • DQMHistoTests: Total failures: 605
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19271
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 08fcc8e into cms-sw:master Nov 22, 2022
@makortel makortel deleted the alpakaFramework branch January 13, 2023 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants