Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Tests - pixel crashes on GPU Hilton #34659

Closed
tsusa opened this issue Jul 27, 2021 · 28 comments
Closed

GPU Tests - pixel crashes on GPU Hilton #34659

tsusa opened this issue Jul 27, 2021 · 28 comments

Comments

@tsusa
Copy link
Contributor

tsusa commented Jul 27, 2021

Following a crash mentioned in [1], slide 5, we run

cmsDriver.py step3 --conditions auto:run3_hlt -s RAW2DIGI:RawToDigi_pixelOnly,RECO:reconstruction_pixelTrackingOnly,DQM:@pixelTrackingOnlyDQM --process reRECO --data --era Run3 --eventcontent RECO,MINIAOD,DQM --hltProcess reHLT --procModifiers pixelNtupletFit,gpu --scenario pp --datatier RECO,MINIAOD,DQMIO --filein file:/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/CRUZET_2021_data/run_343762.root

It crashes at [2] (&inputDataWrapped is != 0 at that point) with an error

terminate called after throwing an instance of 'std::runtime_error'
what():
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_3-slc7_amd64_gcc900/build/CMSSW_11_3_3-build/tmp/BUILDROOT/402e2a5eeeb9630ea9f5469bb50cc947/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/src/HeterogeneousCore/CUDACore/src/ScopedContext.cc, line 86:
cudaCheck(cudaStreamAddCallback(stream, cudaScopedContextCallback, new CallbackData{waitingTaskHolder_, device}, 0));
cudaErrorIllegalAddress: an illegal memory access was encountered

@makortel, could you please have a look?

[1] https://indico.cern.ch/event/1062405/contributions/4468133/attachments/2288269/3889700/HLTReport_CRuZeT_RunOrganization_27.07.2021_Zarucki.pdf
[2] https://github.com/cms-sw/cmssw/blob/master/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc#L73

@czangela

@cmsbuild
Copy link
Contributor

A new Issue was created by @tsusa Tatjana Susa.

@Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign heterogeneous, reconstruction, hlt

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,hlt,reconstruction

@slava77,@fwyzard,@Martin-Grunewald,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor

slava77 commented Jul 27, 2021

@czangela @VinInn

it looks like #34503 was not enough

@makortel
Copy link
Contributor

makortel commented Jul 27, 2021

@tsusa

Following a crash mentioned in [1], slide 5

Is this the CMSONS-13106? (I'm not authorized to view that ticket)

Is the error fully reproducible? (I guess so but want to make sure)

The

terminate called after throwing an instance of 'std::runtime_error'

tells that there are likely two exceptions in flight (which leads to call to terminate(); could tell better with fuller log), and the printed out what() is from the second exception, while the first exception would be more interesting. A way to get handle on that would be along

$ gdb cmsRun
> catch throw
> run step3_...py
wait
> bt # to show the stack trace of the exception; hoping that there are no other silently swallowed exceptions being thrown before

On the other hand, given the CUDA error message cudaErrorIllegalAddress: an illegal memory access was encountered I don't have high hopes for the first exception to tell us much. If it is a CUDA error, the error message is likely the same (i.e. some kernel accessed invalid device memory).

Given the statement of "data without pixel" on the slides, I'd concur with @slava77 (#34659 (comment)) that some protection is missing. I wonder what the printout

auto const& inputData = ctx.get(inputDataWrapped);
nHits_ = inputData.nHits();
LogDebug("SiPixelRecHitFromCUDA") << "converting " << nHits_ << " Hits";

would say.

@slava77
Copy link
Contributor

slava77 commented Jul 27, 2021

based on https://cmsoms.cern.ch/cms/runs/report?cms_run=343762&cms_run_sequence=GLOBAL-RUN
the pixel FEDs are not included

@makortel
Copy link
Contributor

The

terminate called after throwing an instance of 'std::runtime_error'

tells that there are likely two exceptions in flight (which leads to call to terminate(); could tell better with fuller log), and the printed out what() is from the second exception, while the first exception would be more interesting.

Just to add explicitly, I will improve this part of cms::cuda::ScopedContext during the fall (#30266) so that we could avoid the second exception in such cases.

@gennai
Copy link
Contributor

gennai commented Jul 28, 2021

Hi @makortel , this is the log from the Hilton machine when we first experienced the crash.

hlt_run343762_pid209249.log

@tsusa
Copy link
Contributor Author

tsusa commented Jul 28, 2021

@tsusa

Following a crash mentioned in [1], slide 5

Is this the CMSONS-13106? (I'm not authorized to view that ticket)

Is the error fully reproducible? (I guess so but want to make sure)

The

terminate called after throwing an instance of 'std::runtime_error'

tells that there are likely two exceptions in flight (which leads to call to terminate(); could tell better with fuller log), and the printed out what() is from the second exception, while the first exception would be more interesting. A way to get handle on that would be along

$ gdb cmsRun
> catch throw
> run step3_...py
wait
> bt # to show the stack trace of the exception; hoping that there are no other silently swallowed exceptions being thrown before

On the other hand, given the CUDA error message cudaErrorIllegalAddress: an illegal memory access was encountered I don't have high hopes for the first exception to tell us much. If it is a CUDA error, the error message is likely the same (i.e. some kernel accessed invalid device memory).

Given the statement of "data without pixel" on the slides, I'd concur with @slava77 (#34659 (comment)) that some protection is missing. I wonder what the printout

auto const& inputData = ctx.get(inputDataWrapped);
nHits_ = inputData.nHits();
LogDebug("SiPixelRecHitFromCUDA") << "converting " << nHits_ << " Hits";

would say.

@makortel, here is a printout

Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 21195]
[Detaching after fork from child process 21257]
[Detaching after fork from child process 21259]
[Detaching after fork from child process 21262]
[Detaching after fork from child process 21264]
[Detaching after fork from child process 21278]
[Detaching after fork from child process 21280]
[Detaching after fork from child process 21312]
[New Thread 0x7fffc7b00700 (LWP 21320)]
[Detaching after fork from child process 21324]
[New Thread 0x7fffc1024700 (LWP 21333)]
[New Thread 0x7fffc0823700 (LWP 21334)]
[New Thread 0x7fffaf3ff700 (LWP 21338)]
[New Thread 0x7fff9ffff700 (LWP 21339)]
[New Thread 0x7fff9f7fe700 (LWP 21343)]
[New Thread 0x7fff9effd700 (LWP 21344)]
[New Thread 0x7fff93fff700 (LWP 21348)]
[New Thread 0x7fff91bfe700 (LWP 21352)]
[New Thread 0x7fff913fd700 (LWP 21353)]
[New Thread 0x7fff90bfc700 (LWP 21357)]
[New Thread 0x7fff7bdff700 (LWP 21358)]
[New Thread 0x7fff7b5fe700 (LWP 21362)]
[New Thread 0x7fff6ffff700 (LWP 21366)]
[New Thread 0x7fff6f7fe700 (LWP 21368)]
[New Thread 0x7fff6effd700 (LWP 21372)]
[New Thread 0x7fff59fff700 (LWP 21373)]
[New Thread 0x7fff597fe700 (LWP 21377)]
28-Jul-2021 12:26:52 CEST Initiating request to open file file:/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/CRUZET_2021_data/run_343762.root
28-Jul-2021 12:26:53 CEST Successfully opened file file:/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/CRUZET_2021_data/run_343762.root
Begin processing the 1st record. Run 343762, Event 6791, LumiSection 1 on stream 0 at 28-Jul-2021 12:27:02.669 CEST
[New Thread 0x7fff09fff700 (LWP 22346)]
[New Thread 0x7fff091ff700 (LWP 22347)]
[Switching to Thread 0x7fff09fff700 (LWP 22346)]

Thread 20 "cmsRun" hit Catchpoint 1 (exception thrown), 0x00007ffff59de32e in __cxxabiv1::__cxa_throw (obj=0x7fff09213080, tinfo=0x7ffff7854398 ,
dest=0x7ffff78153c0 cms::Exception::~Exception()) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78
78 ../../../../libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.
(gdb) bt
#0 0x00007ffff59de32e in __cxxabiv1::__cxa_throw (obj=0x7fff09213080, tinfo=0x7ffff7854398 , dest=0x7ffff78153c0 cms::Exception::~Exception())
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78
#1 0x00007fffc000651d in (anonymous namespace)::cudaScopedContextCallback(CUstream_st*, cudaError, void*) [clone .cold] ()
from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/lib/slc7_amd64_gcc900/libHeterogeneousCoreCUDACore.so
#2 0x00007fffdb74607e in ?? () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/external/slc7_amd64_gcc900/lib/libcudart.so.11.0
#3 0x00007fffda03a990 in ?? () from /lib64/libcuda.so.1
#4 0x00007fffda0cebd6 in ?? () from /lib64/libcuda.so.1
#5 0x00007ffff5406ea5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007ffff512f9fd in clone () from /lib64/libc.so.6
(gdb)

@tsusa
Copy link
Contributor Author

tsusa commented Jul 28, 2021

Two additional tests were done:

[1] https://github.com/cms-sw/cmssw/blob/master/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc#L77

@makortel
Copy link
Contributor

The log and the gdb look a bit different than I expected, so I'm taking a look (I'm also able to reproduce).

  • step3 was repeated with the last 12_0_X IB (CMSSW_12_0_X_2021-07-27-2300) -> works fine. [1] correctly prints that number of hits is 0.

This observation makes me wonder if 11_3_3 could be missing some pixel development that is in 12_0_X that could make a difference?

@makortel
Copy link
Contributor

I added a cudaDeviceSynchronize() to the beginning of SiPixelRecHitFromCUDA::produce() (and set a breakpoint in gdb early to prevent the construction of cms::cuda::ScopedContextAcquire for its destructor to not to terminate the program), and already the cudaDeviceSynchronize() reports cudaErrorIllegalAddress: an illegal memory access was encountered.

@makortel
Copy link
Contributor

Running with CUDA_LAUNCH_BLOCKING=1 shows

....../CMSSW_11_3_3/src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

Thread 1 (Thread 0x7fb9c7dfd540 (LWP 26108)):
#3  0x00007fb9bf945922 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fb9c975b387 in raise () from /lib64/libc.so.6
#6  0x00007fb9c975ca78 in abort () from /lib64/libc.so.6
#7  0x00007fb9ca0c5683 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00007fb9ca0d10a6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007fb9ca0d01a9 in __cxa_call_terminate (ue_header=ue_header@entry=0x7fb946c038a0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#10 0x00007fb9ca0d0ad4 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0x7fb946c038a0, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:677
#11 0x00007fb9c9d1f7b3 in _Unwind_RaiseException_Phase2 (exc=0x7fb946c038a0, context=0x7ffd1f7e8a60, frames_p=0x7ffd1f7e8968) at ../../../libgcc/unwind.inc:64
#12 0x00007fb9c9d20016 in _Unwind_Resume (exc=0x7fb946c038a0) at ../../../libgcc/unwind.inc:241
#13 0x00007fb9c0074d2d in notcub::CachingDeviceAllocator::DeviceFree(int, void*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/lib/slc7_amd64_gcc900/libHeterogeneousCoreCUDAUtilities.so
#14 0x00007fb9c0072365 in cms::cuda::free_device(int, void*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/lib/slc7_amd64_gcc900/libHeterogeneousCoreCUDAUtilities.so
#15 0x00007fb95d84762f in CAHitNtupletGeneratorKernels<cms::cudacompat::GPUTraits>::~CAHitNtupletGeneratorKernels() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#16 0x00007fb95d8251ff in CAHitNtupletGeneratorOnGPU::makeTuplesAsync(TrackingRecHit2DHeterogeneous<cms::cudacompat::GPUTraits> const&, float, CUstream_st*) const [clone .cold] () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#17 0x00007fb95d834b3e in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so

Current Modules:

Module: CAHitNtupletCUDA:pixelTracksCUDA (crashed)
Module: none

This is awfully similar to cms-patatrack#306 (and cms-patatrack/pixeltrack-standalone#188), but with already single process run with single EDM stream.

@makortel
Copy link
Contributor

CAHitNtupletGeneratorOnGPU::makeTuplesAsync() does not have a check for zero hits. Adding one (similar to the one in makeTuples()) right after constructing tracks allowed the job to complete processing the event.

On makeTuples(), maybe it would make sense to move the check

if (0 == hits_d.nHits())
return tracks;

earlier, right after
PixelTrackHeterogeneous tracks(std::make_unique<pixelTrack::TrackSoA>());

?

@makortel
Copy link
Contributor

makortel commented Jul 28, 2021

I still don't understand why any cudaCheck(cudaDeviceSynchronize() added before return in


does not catch the error. But adding it into the destructor of CAHitNtupletGeneratorKernelsGPU
~CAHitNtupletGeneratorKernels() = default;

does.

Just by poking around a bit I found out that without the zero hits protection it is specifically

kernels.classifyTuples(hits_d, soa, stream);

that causes the "illegal memory access" error.

@makortel
Copy link
Contributor

makortel commented Jul 28, 2021

Just by poking around a bit I found out that without the zero hits protection it is specifically

kernels.classifyTuples(hits_d, soa, stream);

that causes the "illegal memory access" error.

To be even more specific, the following piece appears to be causing the error (when not including the zero hit check)

if (params_.doSharedHitCut_) {
// remove duplicates (tracks that share a hit)
numberOfBlocks = (HitToTuple::capacity() + blockSize - 1) / blockSize;
kernel_sharedHitCleaner<<<numberOfBlocks, blockSize, 0, cudaStream>>>(
hh.view(), tuples_d, tracks_d, quality_d, params_.minHitsForSharingCut_, device_hitToTuple_.get());
cudaCheck(cudaGetLastError());
}

This piece of code was changed in 12_0_X in #33371.

@mmusich
Copy link
Contributor

mmusich commented Jul 29, 2021

thanks @makortel for the detailed analysis. see fix at #34684

@perrotta
Copy link
Contributor

perrotta commented Aug 3, 2021

Can anybody confirm that #34684 fixed the issue also on hilton?
If so, this github issue can get closed (and concentrate on the Hcal ones reported in #34197)

@mzarucki
Copy link
Contributor

mzarucki commented Aug 3, 2021

Dear all,

I have just summarised our (TSG FOG) investigations into the different Pixel, HCAL and ECAL crashes that we were seeing related to the missing protections for when a subsystem is not included in the run: #34197

Concerning the Pixel PR #34684, as indicated (and reported in today's daily run meeting) we have tested it by pointing Hilton to a local CMSSW 11_3_3 install, with the PR included on top. We did not see any crashes when using the PixelOnly menu (e-log) when running over data from run 343762 which excludes ECAL, HCAL and Pixel.

Best regards,
Mateusz on behalf of TSG FOG

@perrotta
Copy link
Contributor

Can this issue get closed now, then?

@fwyzard
Copy link
Contributor

fwyzard commented Aug 10, 2021

+heterogeneous

@mzarucki
Copy link
Contributor

Dear all,

From the FOG side, I would like to report that we have tested the full GPU menu in CMSSW_11_3_4 in run 344449 with Pixel, ECAL and HCAL out of the run and we saw no issues (as reported in this e-log and today's Daily Run meeting just now). This confirms that the updated protections as working well.

Best regards,
Mateusz on behalf of FOG

@mmusich
Copy link
Contributor

mmusich commented Aug 11, 2021

@cms-sw/hlt-l2 @cms-sw/reconstruction-l2 please sign

@Martin-Grunewald
Copy link
Contributor

What is there to sign? This is not a PR. ??

@mmusich
Copy link
Contributor

mmusich commented Aug 11, 2021

that's an issue, when it's resolved it needs to be signed in order to be closed.

@Martin-Grunewald
Copy link
Contributor

+1
ah ok!

@slava77
Copy link
Contributor

slava77 commented Aug 11, 2021

+reconstruction

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@qliphy qliphy closed this as completed Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests