-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TF] Added build with GPU support but default is to build without GPU #7617
Conversation
please test |
A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_12_3_X/master. @smuzaffar, @iarspider can you please review it and eventually sign? Thanks. |
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22341/summary.html External BuildI found compilation error when building: Target //tensorflow/tools/pip_package:build_pip_package failed to build INFO: Elapsed time: 66.035s, Critical Path: 2.67s INFO: 54 processes: 50 internal, 4 local. FAILED: Build did NOT complete successfully FAILED: Build did NOT complete successfully error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.HeV0DH (%build) RPM build errors: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.HeV0DH (%build) |
please test |
please test for slc7_ppc64le_gcc11 |
please test for slc7_aarch64_gcc11 |
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22354/summary.html External BuildI found compilation error when building: File "/data/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/pkgtools/scheduler.py", line 267, in doSerial result = commandSpec[0](*commandSpec[1:]) File "./pkgtools/cmsBuild", line 3651, in installPackage File "./pkgtools/cmsBuild", line 3399, in installRpm RpmInstallFailed: Failed to install package cudnn. Reason: error: Failed dependencies: libm.so.6(GLIBC_2.27)(64bit) is needed by external+cudnn+8.2.2.26-5a09bc859d16df5e6a023381eff0b19e-1-1.aarch64 * The action "build-external+tensorflow-sources+2.6.0-e16c1637b92da7a7da348b55d10b8992" was not completed successfully because The following dependencies could not complete: install-external+cudnn+8.2.2.26-5a09bc859d16df5e6a023381eff0b19e * The action "build-external+tensorflow+2.6.0-d7d45dfb8a5d2a6b123ee55227ad554e" was not completed successfully because The following dependencies could not complete: |
Pull request #7617 was updated. |
please test for slc7_aarch64_gcc11 |
please test for slc7_ppc64le_gcc11 |
please test |
Pull request #7617 was updated. |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22567/summary.html GPU Comparison SummarySummary:
Comparison SummarySummary:
|
@tvami, note that this change allow us to build TF with GPU support but I have disabled it by default. TF GPU build currently breaks GPU based tests ( see #7617 (comment) ). |
with TF GPU we get errors like
|
Hi @smuzaffar ok thanks, I think this is a good first step. We could integrate this and go to the GPU experts for looking into this problem. Maybe we could do a cmssw github issue about it? What do you think? |
Or maybe I could tag @cms-sw/heterogeneous-l2 here |
Looking at the log
or
or
The job finally fails with Whether or not that failure is connected to the memory allocation failures is not clear to me (there is no information in the log e.g. if any allocation actually succeeds). The implication of the first error (warning?) is also unclear. Would ML folks have any better insight? @vlimant @gkasieczka @riga @yongbinfeng |
According to this StackOverflow answer, this
is just a warning, and it should not be the cause of the failure. |
I guess the real reason for the wf to fail is the initialization issue that Matti mentioned:
|
How can I reproduce the failing tests ? I tried on /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22532/install.sh
cd CMSSW_12_3_X_2022-02-20-2300/src
cmsenv
runTheMatrix.py -w gpu -l 11634.506 and it worked fine:
|
I thought the area I used for the test was before that change.
Otherwise, how can I reproduce the error ?
|
@fwyzard , let me merge this PR and I will open a new PR with GPU enabled which you can use for testing. |
@fwyzard , tensorflow with GPU support is now available via #7648 . You can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/22659/install.sh area to tst it. One thing with TF and GPU is that when GPU is available then TF will use it. I guess that is the reason many tests failed for |
No description provided.