Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TF] Added build with GPU support but default is to build without GPU #7617

Merged
merged 7 commits into from
Feb 23, 2022

Conversation

smuzaffar
Copy link
Contributor

No description provided.

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_12_3_X/master.

@smuzaffar, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22341/summary.html
COMMIT: 52fd78c
CMSSW: CMSSW_12_3_X_2022-02-10-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22341/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 66.035s, Critical Path: 2.67s
INFO: 54 processes: 50 internal, 4 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.HeV0DH (%build)


RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.HeV0DH (%build)



@smuzaffar
Copy link
Contributor Author

please test

@smuzaffar
Copy link
Contributor Author

please test for slc7_ppc64le_gcc11

@smuzaffar
Copy link
Contributor Author

please test for slc7_aarch64_gcc11

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22354/summary.html
COMMIT: 52fd78c
CMSSW: CMSSW_12_3_X_2022-02-10-2300/slc7_aarch64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22354/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

File "/data/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/pkgtools/scheduler.py", line 267, in doSerial
result = commandSpec[0](*commandSpec[1:])
File "./pkgtools/cmsBuild", line 3651, in installPackage
File "./pkgtools/cmsBuild", line 3399, in installRpm
RpmInstallFailed: Failed to install package cudnn. Reason:
error: Failed dependencies:
	libm.so.6(GLIBC_2.27)(64bit) is needed by external+cudnn+8.2.2.26-5a09bc859d16df5e6a023381eff0b19e-1-1.aarch64

* The action "build-external+tensorflow-sources+2.6.0-e16c1637b92da7a7da348b55d10b8992" was not completed successfully because The following dependencies could not complete:
install-external+cudnn+8.2.2.26-5a09bc859d16df5e6a023381eff0b19e
* The action "build-external+tensorflow+2.6.0-d7d45dfb8a5d2a6b123ee55227ad554e" was not completed successfully because The following dependencies could not complete:


@cmsbuild
Copy link
Contributor

Pull request #7617 was updated.

@smuzaffar
Copy link
Contributor Author

please test for slc7_aarch64_gcc11

@smuzaffar
Copy link
Contributor Author

please test for slc7_ppc64le_gcc11

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

Pull request #7617 was updated.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22567/summary.html
COMMIT: a907ab6
CMSSW: CMSSW_12_3_X_2022-02-21-2300/slc7_amd64_gcc10
Additional Tests: GPU,THREADING,PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22567/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 14 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19811
  • DQMHistoTests: Total failures: 2437
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 17374
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3965143
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3965113
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Feb 22, 2022

@tvami, note that this change allow us to build TF with GPU support but I have disabled it by default. TF GPU build currently breaks GPU based tests ( see #7617 (comment) ).

@smuzaffar
Copy link
Contributor Author

with TF GPU we get errors like

2022-02-21 12:56:42.686281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-21 12:56:42.689865: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-21 12:56:42.693618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30005 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
2022-02-21 12:56:42.963412: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 29.30G (31463047168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-02-21 12:56:42.967774: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 26.37G (28316741632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-02-21 12:56:42.971858: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 23.73G (25485066240 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

@tvami
Copy link

tvami commented Feb 22, 2022

Hi @smuzaffar ok thanks, I think this is a good first step. We could integrate this and go to the GPU experts for looking into this problem. Maybe we could do a cmssw github issue about it? What do you think?

@tvami
Copy link

tvami commented Feb 22, 2022

Or maybe I could tag @cms-sw/heterogeneous-l2 here

@makortel
Copy link
Contributor

Looking at the log
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22532/runTheMatrixGPU-results/11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
it is not clear to me if the root error would be

2022-02-21 12:56:39.046596: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

or

2022-02-21 12:56:42.963412: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 29.30G (31463047168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
...
2022-02-21 12:56:43.092337: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 1.01G (1080340992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

or

2022-02-21 12:56:44.784921: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

The job finally fails with Failed to get convolution algorithm, that is probably related to the cudnn initialization failure.

Whether or not that failure is connected to the memory allocation failures is not clear to me (there is no information in the log e.g. if any allocation actually succeeds). The implication of the first error (warning?) is also unclear.

Would ML folks have any better insight? @vlimant @gkasieczka @riga @yongbinfeng

@fwyzard
Copy link
Contributor

fwyzard commented Feb 23, 2022

According to this StackOverflow answer, this

2022-02-21 12:56:39.046596: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

is just a warning, and it should not be the cause of the failure.

@tvami
Copy link

tvami commented Feb 23, 2022

I guess the real reason for the wf to fail is the initialization issue that Matti mentioned:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22532/runTheMatrixGPU-results/11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

----- Begin Fatal Exception 21-Feb-2022 12:56:45 UTC-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=DeepTauId label='hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: Unknown: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node inner_egamma_conv_1/convolution}}]]
	 [[inner_all_dropout_4/Identity/_7]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node inner_egamma_conv_1/convolution}}]]

@fwyzard
Copy link
Contributor

fwyzard commented Feb 23, 2022

How can I reproduce the failing tests ?

I tried on lxplus-gpu:

/cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22532/install.sh
cd CMSSW_12_3_X_2022-02-20-2300/src
cmsenv
runTheMatrix.py -w gpu -l 11634.506

and it worked fine:

Preparing to run 11634.506 TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano
...
11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Wed Feb 23 01:40:53 2022-date Wed Feb 23 01:30:54 2022; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

@tvami
Copy link

tvami commented Feb 23, 2022

@fwyzard I think that's expected bc the current version of this PR switches the TF support off by default.
I think this commit was the one that did this default version change
62b6e91
and indeed it does seem to show that the left has %define enable_cuda 1 in the tensorflow-requires.file

@fwyzard
Copy link
Contributor

fwyzard commented Feb 23, 2022 via email

@smuzaffar
Copy link
Contributor Author

@fwyzard , let me merge this PR and I will open a new PR with GPU enabled which you can use for testing.

@smuzaffar
Copy link
Contributor Author

@fwyzard , tensorflow with GPU support is now available via #7648 . You can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/22659/install.sh area to tst it.

One thing with TF and GPU is that when GPU is available then TF will use it. I guess that is the reason many tests failed for ppc64le ( #7617 (comment) ) as our powerpc nodes has GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants