Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend onnxruntime gpu interface to producers using onnxruntime #39402

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

davidlange6
Copy link
Contributor

Extends #36963 by adding a backend parameter to models, used by

cms::Ort::getSessionOptions(iConfig.getParameterstd::string("onnx_backend"));

Current options are
cpu -> Use CPU backend
cuda -> Use cuda backend
default -> Use best available

The model used in BoostedJetONNXJetTagsProducer crashes on GPU if the full optimization is included. I reduced this optimization in case a GPU is used (following recipes found on the web). The sort of error one gets is

Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data); 
2022-08-26 13:26:51.964709271 [E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running FusedConv node. Name:'Conv_98_Add_99_Relu_100'
 Status Message: CUDNN error executing cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data)
----- Begin Fatal Exception 26-Aug-2022 14:26:51 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0

So far I do not see any significant performance improvement (at least on lxplus-gpu) nor loss. At least BoostedJetONNXJetTagsProducer.cc can be improved to send more than one jet to onnxruntime at a time.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32108

  • This PR adds an extra 28KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32110

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 15, 2022

A new Pull Request was created by @davidlange6 (David Lange) for master.

It involves the following packages:

  • PhysicsTools/ONNXRuntime (reconstruction)
  • RecoBTag/ONNXRuntime (reconstruction)
  • RecoParticleFlow/PFProducer (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@AlexDeMoor, @mmarionncern, @JyothsnaKomaragiri, @AnnikaStein, @riga, @emilbols, @lgray, @missirol, @hatakeyamak, @andrzejnovak, @demuller, @seemasharmafnal this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@davidlange6
Copy link
Contributor Author

enable gpu

@davidlange6
Copy link
Contributor Author

please test

@mandrenguyen
Copy link
Contributor

assign heterogenous

@mandrenguyen
Copy link
Contributor

assign heterogeneous
(helps if you can spell)

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@kpedro88
Copy link
Contributor

We also encountered this ONNX issue in SONIC tests. I think it's microsoft/onnxruntime#12321. There's a fix merged, but not in a release yet.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-edc989/27572/summary.html
COMMIT: 9433eda
CMSSW: CMSSW_12_6_X_2022-09-15-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39402/27572/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • Reco comparison had 3 failed jobs
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19876
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19868
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: found differences in 1 / 3 workflows

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 5 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3618326
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3618296
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@cmsbuild cmsbuild modified the milestones: CMSSW_13_3_X, CMSSW_14_0_X Nov 6, 2023
@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2024

Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2024

Pull request #39402 was updated. @wpmccormack, @fwyzard, @valsdav, @makortel can you please check and sign again.

@cmsbuild cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 6, 2024
@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2024

-1

Failed Tests: Build HeaderConsistency ClangBuild
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-edc989/37239/summary.html
COMMIT: 54d730f
CMSSW: CMSSW_14_0_X_2024-02-05-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39402/37239/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/RecoParticleFlow/PFProducer/plugins/PFCandidateChecker.cc
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/RecoParticleFlow/PFProducer/plugins/PFConcretePFCandidateProducer.cc
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/RecoParticleFlow/PFProducer/plugins/PFEGammaProducer.cc
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/RecoParticleFlow/PFProducer/plugins/PFElectronTranslator.cc
In file included from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/RecoParticleFlow/PFProducer/plugins/MLPFProducer.cc:8:
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/PhysicsTools/ONNXRuntime/interface/ONNXSessionOptions.h:4:10: fatal error: HeterogeneousCore/CUDAServices/interface/CUDAService.h: No such file or directory
    4 | #include "HeterogeneousCore/CUDAServices/interface/CUDAService.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/RecoParticleFlow/PFProducer/plugins/PFLinker.cc
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/RecoParticleFlow/PFProducer/plugins/PFProducer.cc


Clang Build

I found compilation error while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

>> Creating project symlinks
>> Entering Package PhysicsTools/ONNXRuntime
>> Entering Package RecoBTag/ONNXRuntime
>> Entering Package RecoParticleFlow/PFProducer
>> Compile sequence completed for CMSSW CMSSW_14_0_X_2024-02-05-2300
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1
+ eval scram build outputlog '&&' '(python3' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/cms-bot/buildLogAnalyzer.py --logDir /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/tmp/el8_amd64_gcc12/cache/log/src '||' 'true)'
++ scram build outputlog
>> Entering Package PhysicsTools/ONNXRuntime
Entering library rule at PhysicsTools/ONNXRuntime
>> Compiling  /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_0_X_2024-02-05-2300/src/PhysicsTools/ONNXRuntime/src/ONNXRuntime.cc


@valsdav
Copy link
Contributor

valsdav commented Feb 6, 2024

@davidlange6 if you don't mind I can take a look and try to have an uniform interface with the TensorFlow options one. Maybe we can close this PR and restart?

@mandrenguyen
Copy link
Contributor

-reconstruction
Close this one?

@cmsbuild
Copy link
Contributor

Milestone for this pull request has been moved to CMSSW_14_2_X. Please open a backport if it should also go in to CMSSW_14_1_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_14_1_X, CMSSW_14_2_X Aug 27, 2024
@fwyzard
Copy link
Contributor

fwyzard commented Sep 2, 2024

@davidlange6 @valsdav should this PR be resurrected, or closed ?

@cmsbuild cmsbuild modified the milestones: CMSSW_14_1_X, CMSSW_14_2_X Sep 2, 2024
@valsdav
Copy link
Contributor

valsdav commented Sep 4, 2024

If @davidlange6 agrees, I would prefer to close this PR and open one with similar changes to make the ONNX backend selection the same as the one we have now in the TensorFlow code. I can take care of that 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants