Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools for CUDA modules #28537

Merged
merged 2 commits into from
Jan 20, 2020
Merged

Tools for CUDA modules #28537

merged 2 commits into from
Jan 20, 2020

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Dec 3, 2019

PR description:

This PR imports the tools for CUDA modules (or "CUDA framework" or "heterogeneous framework") from Patatrack. There is documentation in HeterogeneousCore/CUDACore/README.md and examples in HeterogeneousCore/CUDATest (that get run along unit tests). Here are also two talks with higher-level description

PR validation:

Unit tests run on both GPU and non-GPU machines.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2019

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2019

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28537/13010

  • This PR adds an extra 108KB to repository

@smuzaffar
Copy link
Contributor

enable GPU

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2019

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

CUDADataFormats/Common
FWCore/Concurrency
HeterogeneousCore/CUDACore
HeterogeneousCore/CUDAServices
HeterogeneousCore/CUDATest
HeterogeneousCore/CUDAUtilities

The following packages do not have a category, yet:

CUDADataFormats/Common
HeterogeneousCore/CUDACore
HeterogeneousCore/CUDATest
HeterogeneousCore/CUDAUtilities
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@wddgit this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2019

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2019

@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2019

@smuzaffar With "enable GPU", will the tests be run only on GPU-equipped machines? Is it possible to run the tests also in machines without a GPU?

@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2019

@Dr15Jones @fwyzard Please review.

@smuzaffar
Copy link
Contributor

@makortel , it will run on both (GPU and non-GPU)

@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2019

@smuzaffar Thanks. And how does the system distinguish which tests to run on GPU machines and which on non-GPU machines? (pardon my ignorance)

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2019

-1

Tested at: 3854fbb

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fc491b/3777/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests RelVals

  • Unit Tests:

I found errors in the following unit tests:

---> test testHeterogeneousCoreCUDACoreStreamEvent had ERRORS
---> test testHeterogeneousCoreCUDACore had ERRORS
---> test EcalADCToGeV_update_test had ERRORS
---> test EcalTPGCrystalStatus_O2O_test had ERRORS
---> test EcalTPGPhysicsConst_O2O_test had ERRORS
---> test EcalTPGWeightIdMap_O2O_test had ERRORS
---> test EcalTPGFineGrainEBGroup_O2O_test had ERRORS
---> test EcalTPGBadStripStatus_O2O_test had ERRORS
---> test EcalTPGLinearizationConst_O2O_test had ERRORS
---> test EcalTPGLutGroup_O2O_test had ERRORS
---> test EcalDCS_O2O_test had ERRORS
---> test EcalTPGTowerStatus_O2O_test had ERRORS
---> test EcalTPGFineGrainTowerEE_O2O_test had ERRORS
---> test EcalDAQ_O2O_test had ERRORS
---> test EcalTPGPedestals_O2O_test had ERRORS
---> test test_hcal_digi had ERRORS
---> test test_hcal_reco had ERRORS
---> test EcalTPGSpike_O2O_test had ERRORS
---> test EcalTPGFineGrainEBIdMap_O2O_test had ERRORS
---> test EcalTPGFineGrainStripEE_O2O_test had ERRORS
---> test EcalTPGWeightGroup_O2O_test had ERRORS
---> test EcalTPGLutIdMap_O2O_test had ERRORS
---> test EcalTPGSlidingWindow_O2O_test had ERRORS
---> test SiStripDCS_O2O_test had ERRORS
---> test EcalLaser_O2O_test had ERRORS
---> test SiStripDAQ_O2O_test had ERRORS
---> test RunInfoStart_O2O_test had ERRORS

  • RelVals:

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2019

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

@fwyzard
Copy link
Contributor

fwyzard commented Jan 18, 2020

based on the validation at cms-patatrack#429 .

@cmsbuild
Copy link
Contributor

+1
Tested at: f250584
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fc491b/4282/summary.html
CMSSW: CMSSW_11_1_X_2020-01-17-2300
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison not run due to Build errors/Fireworks only changes/No short matrix requested (RelVals and Igprof tests were also skipped)

@smuzaffar
Copy link
Contributor

Gpu tests were also run, can you check if it actually run on gpu #28537 (comment)

@fwyzard
Copy link
Contributor

fwyzard commented Jan 18, 2020

Gpu tests were also run, can you check if it actually run on gpu #28537 (comment)

The matrix workflows did run, but since they are a clone of the cpu ones for the moment, I cannot check much.

The unit test did run on a machine with a GPU:

Detected 1 CUDA Capable device(s)
Device 0 memory total 34089730048 free 33742323712

@makortel
Copy link
Contributor Author

+core

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @silviodonato, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1
@smuzaffar @makortel how can we validate these new tools in the future? Are you going to run automatic tests on GPU along with ppc64l2 and aarch64 on IBs from now on?
@makortel @fwyzard if you expect 10824.501(PixelOnlyCPU) and 10824.502(PixelOnlyGPU) give exactly the same results, you might add a test checking this (as far as I understood, at the moment we are only checking if the two workflows run without error)

@cmsbuild cmsbuild merged commit 8711c52 into cms-sw:master Jan 20, 2020
@silviodonato
Copy link
Contributor

Hi @makortel and all, it seems this PR is breaking the compatibility with ARM in several unit tests
CUDADataFormats/Common
HeterogeneousCore/CUDACore
HeterogeneousCore/CUDAServices
HeterogeneousCore/CUDATest
HeterogeneousCore/CUDAUtilities

https://cmssdt.cern.ch/SDT/cgi-bin/showBuildLogs.py/slc7_aarch64_gcc820/www/mon/11.1-mon-11/CMSSW_11_1_X_2020-01-20-1100

I think we can discuss about this issue tomorrow during the ORP or Core Software meeting.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 20, 2020

The problem might be that on ARM we have an older version of CUDA that does not support GCC 8.
Do we have any GCC 7 builds ?

@smuzaffar
Copy link
Contributor

No, we do not have GCC7 ARM builds

@makortel
Copy link
Contributor Author

@silviodonato

@smuzaffar @makortel how can we validate these new tools in the future? Are you going to run automatic tests on GPU along with ppc64l2 and aarch64 on IBs from now on?

To my knowledge we are currently running a specific GPU subset of tests on a GPU machine for amd64 (not sure about aarch64 where we have CUDA, on ppc64le not until we get CUDA runtime included). I hope @smuzaffar @fwyzard can correct/elaborate.

@makortel @fwyzard if you expect 10824.501(PixelOnlyCPU) and 10824.502(PixelOnlyGPU) give exactly the same results, you might add a test checking this (as far as I understood, at the moment we are only checking if the two workflows run without error)

Right, this is something to think about.

@makortel
Copy link
Contributor Author

The problem might be that on ARM we have an older version of CUDA that does not support GCC 8.

I agree this is probably the root cause

>> Compiling  /home/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/d8ffa4e08dcd74d5b6ce350965a96084/opt/cmssw/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-01-20-1100/src/HeterogeneousCore/CUDATest/plugins/TestCUDAAnalyzerGPUKernel.cu 
/home/cmsbld/jenkins_b/workspace/build-any-ib/w/slc7_aarch64_gcc820/external/cuda/10.0.326-bcolbf/bin/nvcc -dc -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_11_1_X_2020-01-20-1100' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_11_1_X_2020-01-20-1100' -I/home/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/d8ffa4e08dcd74d5b6ce350965a96084/opt/cmssw/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-01-20-1100/src -I/home/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/d8ffa4e08dcd74d5b6ce350965a96084/opt/cmssw/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-01-20-1100/poison -I/home/cmsbld/jenkins_b/workspace/build-any-ib/w/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_1_X_2020-01-19-0000/src -I/home/cmsbld/jenkins_b/workspace/build-any-ib/w/slc7_aarch64_gcc820/external/cuda/10.0.326-bcolbf/include -gencode arch=compute_72,code=sm_72 -O3 -std=c++14 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --source-in-ptx --cudart=shared --compiler-options '-O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -ftree-vectorize -Wstrict-overflow -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fsigned-char -fsigned-bitfields -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -std=c++14  -fPIC   '  /home/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/d8ffa4e08dcd74d5b6ce350965a96084/opt/cmssw/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-01-20-1100/src/HeterogeneousCore/CUDATest/plugins/TestCUDAAnalyzerGPUKernel.cu -o tmp/slc7_aarch64_gcc820/src/HeterogeneousCore/CUDATest/plugins/HeterogeneousCoreCUDATestAuto/TestCUDAAnalyzerGPUKernel.cu.o
In file included from /home/cmsbld/jenkins_b/workspace/build-any-ib/w/slc7_aarch64_gcc820/external/cuda/10.0.326-bcolbf/include/cuda_runtime.h:83,
                 from <command-line>:
  /home/cmsbld/jenkins_b/workspace/build-any-ib/w/slc7_aarch64_gcc820/external/cuda/10.0.326-bcolbf/include/crt/host_config.h:129:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
  #error -- unsupported GNU version! gcc versions later than 7 are not supported!
  ^~~~~

and the rest (TBB and ROOT header include failures) are somehow knock-on effects.

Assuming we're going to stick with GCC 8 on ARM, how feasible would be updating the CUDA? Or could we consider dropping cuda-gcc-support from ARM until CUDA gets updated?

@smuzaffar
Copy link
Contributor

@silviodonato , yes the core issue is that we have an old version of CUDA for ARM which does not support gcc8. This is known issue.

I am working on improving the test suit so that we can run PR tests for all possible architectures. As we do not have GPU resources for ARM and Power, so for now GPU tests can only be run on AMD64.

@makortel makortel deleted the cudaFramework branch January 22, 2020 16:00
fwyzard pushed a commit to cms-patatrack/cmssw that referenced this pull request Mar 5, 2020
fwyzard pushed a commit to cms-patatrack/cmssw that referenced this pull request Mar 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants