Ecal and Hcal Local GPU Reco crashes on missing detector #34197

Sam-Harper · 2021-06-21T12:44:10Z

A bug was exposed last MWGR in that both HCAL and ECAL local reconstruction on a GPU do not have protections when the respective detector is out.

This is explicitly in the HBHERecHitProducerGPU

cmsRun: /build/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_1-slc7_amd64_gcc900/build/CMSSW_11_3_1-build/tmp/BUILDROOT/ebcff69f73237d0d563ea1e7c33f2d83/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/src/RecoLocalCalo/HcalRecProducers/src/MahiGPU.cu:1064: void hcal::reconstruction::entryPoint(const hcal::reconstruction::InputDataGPU&, hcal::reconstruction::OutputDataGPU&, const hcal::reconstruction::ConditionsProducts&, hcal::reconstruction::ScratchDataGPU&, const hcal::reconstruction::ConfigParameters&, cudaStream_t): Assertion `startingSample == 0 || startingSample == 2' failed.

and a similar crash was observed in ECAL

To reproduce simply run over any run with the appropriate detector missing.

cmsbuild · 2021-06-21T12:44:31Z

A new Issue was created by @Sam-Harper Sam Harper.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2021-06-21T12:59:26Z

assign reconstruction, heterogeneous

cmsbuild · 2021-06-21T12:59:46Z

New categories assigned: heterogeneous,reconstruction

@slava77,@fwyzard,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2021-06-21T12:59:55Z

FYI @cms-sw/ecal-dpg-l2 @cms-sw/hcal-dpg-l2 @vkhristenko

fwyzard · 2021-06-21T13:33:38Z

An other corner case that needs to be tested and possibly addressed is the unpacking and local reconstruction of calibration data.

It may also be useful to check if data taken in non-standard conditions (different zero suppression, etc.) is properly handled.

thomreis · 2021-06-22T13:54:59Z

Is there a recipe to reproduce this? Running at P5 with a specific HLT configuration? @amassiro FYI even though I guess you know already.

fwyzard · 2021-06-23T16:59:56Z

@mzarucki does FOG have some samples to reproduce the issues ?

mzarucki · 2021-06-24T07:39:50Z

Hi all,

@smorovic put the file from the run 342053 with ECAL in, HCAL out (e-log) that was crashing in NFS /nfshome0/smorovic/gpu-test

The HLT menu used was /cdaq/cosmic/commissioning2021/MWGR4/Cosmics_GPUTest1_V2/V2 which requires the upgraded ConfDB (briefly described here).

Cheers,
Mateusz

PS. The test we did with HCAL in and ECAL out was run 342175 (e-log)

fwyzard · 2021-06-24T08:10:45Z

Hi Mateusz, thanks... however I think if we want the DPGs to investigate the problem, we need some more conventional instructions. 1. can you provide the instructions (CMSSW release, etc.) and a full fump of the menu that can be run out of the box ? 2. do we have files from the ECAL-only and HCAL-only runs ? if we do, could you list them explicitly? if we don't, vcan you list explicitly some other files that should be used ? 3. can you test the instructions (release, config file, input files) to make sure the crash is reproducible on the online GPU machines (without hikton/hltd) ? Of course, "you" as in "anybody from FOG" :-) This should make it easier for the DPGs to reproduce the problems, and thus fix them. .A

mzarucki · 2021-06-25T12:49:15Z

Hi Andrea, all,

I came up with the simplest set of instructions to reproduce the errors, by copying the HLT menu file and raw input files from @smorovic into my directory on NFS (/nfshome0/mzarucki/GPUTests).

As a first step, one would need to be logged into one of the GPU machines with a working area set up - all instructions are included in the GPU development Twiki: https://twiki.cern.ch/twiki/bin/viewauth/CMS/TriggerDevelopmentWithGPUs

Concerning the working area, it could be either on NFS (/nfshome0/) or on the /data/user/$USER directory of the machine, as described here.

The release to set up is CMSSW_11_3_1_patch1. You can copy the HLT GPU menu configuration file from: /nfshome0/mzarucki/GPUTests/Cosmics_GPUTest1_V2-V2.py. On top of the standard configuration, there is a block of code which sets up the DAQ source, including the local input file:

process.EvFDaqDirector = cms.Service(
    "EvFDaqDirector",
    runNumber=cms.untracked.uint32(342053),
    baseDir=cms.untracked.string("tmp"),
    buBaseDir=cms.untracked.string(
        "/nfshome0/mzarucki/CMSSW_11_3_1_patch1/src"
    ),
    useFileBroker=cms.untracked.bool(False),
    fileBrokerKeepAlive=cms.untracked.bool(True),
    fileBrokerPort=cms.untracked.string("8080"),
    fileBrokerUseLocalLock=cms.untracked.bool(True),
    fuLockPollInterval=cms.untracked.uint32(2000),
    requireTransfersPSet=cms.untracked.bool(False),
    selectedTransferMode=cms.untracked.string(""),
    mergingPset=cms.untracked.string(""),
    outputAdler32Recheck=cms.untracked.bool(False),
)

process.source.fileNames = cms.untracked.vstring("/nfshome0/mzarucki/GPUTests/ECALin_HCALout/run342053/run342053_ls0013_index000002.raw")
#process.source.fileNames = cms.untracked.vstring("/nfshome0/mzarucki/GPUTests/HCALin_ECALout/run342110/run342110_ls0023_index000000.raw")
process.source.fileListMode = True

where one would have to update the buBaseDir to your CMSSW directory.

To recreate the HCAL crash from run 342053 with ECAL in and HCAL out (e-log), one would use the file that is already set up (/nfshome0/mzarucki/GPUTests/ECALin_HCALout/run342053/run342053_ls0013_index000002.raw).

To recreate the ECAL crash from run 342175 with HCAL in and ECAL out (e-log), we have another raw file locally available from run 342110 with the same setup (/nfshome0/mzarucki/GPUTests/HCALin_ECALout/run342110/run342110_ls0023_index000000.raw). In the EvFDaqDirector one would have to update the runNumber, as well as the process.source.fileNames of course.

@fwyzard, @Sam-Harper please comment if there is anything else to add.

Cheers,
Mateusz

PS. @Sam-Harper has updated the hltGetConfiguration command for dumping GPU menus, however, this is not (yet) working on the GPU machines. Therefore, a local copy is the simplest alternative. The menus that were ran are also temporarily saved in /var/log/hltd/pid/ on the machines that were used in the DAQ configuration (fu-c2a02-37-[01-04]).

thomreis · 2021-06-30T12:31:43Z

Thanks @mzarucki for the detailed recipe. I think this should be enough to get going.
However, @fwyzard is the 11_3_1_patch1 release installed on the GPU machines? I find only 11_3_0_patch1.

fwyzard · 2021-06-30T12:39:55Z

@thomreis all offline releases are actually available from /cvmfs; I've added instruction for using them to the twiki: https://twiki.cern.ch/twiki/bin/view/CMS/TriggerDevelopmentWithGPUs#Setting_up_a_working_area_fo_AN2 .

thomreis · 2021-06-30T12:52:57Z

@fwyzard thanks I got it to work. I must have had a cached version of the twiki without those instructions.
I can now reproduce the crash.

fwyzard · 2021-06-30T13:58:00Z

no, I just added them one hour ago :-)

thomreis · 2021-06-30T16:14:32Z

I have made PRs to master and CMSSW_11_3_X with a fix for the ECAL out crash.

Sam-Harper · 2021-06-30T16:51:32Z

thanks Thomas!

Sam-Harper · 2021-06-30T16:54:12Z

@cms-sw/hcal-dpg-l2 is there a timeline the HCAL DPG to have a look at this? Thanks!

mseidel42 · 2021-06-30T18:07:03Z

@mariadalfonso could you have a look, please?

slava77 · 2021-07-22T16:09:08Z

+reconstruction

ECAL: Fix crash in ECAL local reco on GPU if ECAL is not in the run #34292 and Fix crash in ECAL local reco on GPU if ECAL is not in the run - 113X #34293
HCAL: HBHEGPU: protection when detector is out #34429 (master) and HBHEGPU: protection when detector is out #34428 (113X)

mzarucki · 2021-08-04T08:08:34Z

Hi @fwyzard,

Yes, our previous tests have already shown this. I have re-done them and confirm that we see no crashes when running only one set of path types in 11_3_3 + fixes. We saw this already for ECAL in the pure 11_3_3 release, and we see this for Pixel (11_3_3 + Pixel PR #34684) and HCAL (11_3_3 + HCAL PR #34750).

Cheers,
Mateusz

thomreis · 2021-08-04T08:19:06Z

I could reproduce the crash with the configuration of @mzarucki. It happens always with 2 or more threads.

I have extended a bit the error message just before the crash and it seems that there is a wrong number of channels passed in the digis.

%MSG-e EcalUncalibRecHitProducerGPU:   EcalUncalibRecHitProducerGPU:hltEcalUncalibRecHitGPU 04-Aug-2021 10:10:48 CEST  Run: 343762 Event: 2109
Max number of channels exceeded in barrel or endcap. Number of barrel channels: 2294272000 with maxNumberHitsEB=61200, number of endcap channels: 32689 with maxNumberHitsEE=14648
%MSG
----- Begin Fatal Exception 04-Aug-2021 10:10:54 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 343762 lumi: 1 event: 2109 stream: 1
   [1] Running path 'HLT_ECALHT1_GPU_v1'
   [2] Prefetching for module CaloTowersCreator/'hltTowerMakerEcal'
   [3] Prefetching for module EcalRecHitProducer/'hltEcalRecHit@cuda'
   [4] Prefetching for module EcalUncalibRecHitConvertGPU2CPUFormat/'hltEcalUncalibRecHit@cuda'
   [5] Prefetching for module EcalCPUUncalibRecHitProducer/'hltEcalUncalibRecHitSoA'
   [6] Calling method for module EcalUncalibRecHitProducerGPU/'hltEcalUncalibRecHitGPU'
Exception Message:
A std::exception was thrown.

/data/user/treis/ecal_local_reco/CMSSW_11_3_3/src/RecoLocalCalo/EcalRecProducers/plugins/EcalUncalibRecHitMultiFitAlgoGPU.cu, line 118:
cudaCheck(cudaGetLastError());
cudaErrorInvalidConfiguration: invalid configuration argument
----- End Fatal Exception -------------------------------------------------

In principle the number of channels should be zero.
The crash is not always the same after the error message. I also got this one already:

terminate called after throwing an instance of 'std::runtime_error'
  what():  
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_3-slc7_amd64_gcc900/build/CMSSW_11_3_3-build/tmp/BUILDROOT/402e2a5eeeb9630ea9f5469bb50cc947/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/src/HeterogeneousCore/CUDACore/src/ScopedContext.cc, line 86:
cudaCheck(cudaStreamAddCallback(stream, cudaScopedContextCallback, new CallbackData{waitingTaskHolder_, device}, 0));
cudaErrorIllegalAddress: an illegal memory access was encountered

fwyzard · 2021-08-04T08:28:13Z

The crash is not always the same after the error message.

The problem is that - if the error happens inside a kernel, running asynchronously on the GPU - it till be reported by the first CUDA runtime call after it, in any thread.

This can happen in the memory allocator or in the framework support calls, because they are quite frequent and kind of wrap all other CUDA modules, even if the error is not there.

thomreis · 2021-08-04T08:31:33Z

While the crash happens in the EcalUncalibRecHitProducerGPU the reason for it seems to be in the EcalRawToDigiGPU module already where the false number of channels is stored in the digis.

fwyzard · 2021-08-04T08:42:26Z

looking at the code of EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc:

void EcalRawToDigiGPU::acquire(edm::Event const& event,
                               edm::EventSetup const& setup,
                               edm::WaitingTaskWithArenaHolder holder) {
...
  // unpack if at least one FED has data
  if (counter > 0) {
    ecal::raw::entryPoint(
        inputCPU, inputGPU, outputGPU_, scratchGPU, outputCPU_, conditions, ctx.stream(), counter, currentCummOffset);
  }
}

void EcalRawToDigiGPU::produce(edm::Event& event, edm::EventSetup const& setup) {
  cms::cuda::ScopedContextProduce ctx{cudaState_};

  // get the number of channels
  outputGPU_.digisEB.size = outputCPU_.nchannels[0];
  outputGPU_.digisEE.size = outputCPU_.nchannels[1];

  ctx.emplace(event, digisEBToken_, std::move(outputGPU_.digisEB));
  ctx.emplace(event, digisEEToken_, std::move(outputGPU_.digisEE));

  // reset ptrs that are carried as members
  outputCPU_.nchannels.reset();
}

my guess is that by not calling ecal::raw::entryPoint(...), outputCPU_.nchannels is not properly set, and outputGPU_.digisEB.size gets a random value (same for EE).

fwyzard · 2021-08-04T08:43:26Z

So, it might be enough to add

  outputCPU_.nchannels[0] = 0;
  outputCPU_.nchannels[1] = 0;

right before

  // unpack if at least one FED has data
  if (counter > 0) {
    ecal::raw::entryPoint(
        inputCPU, inputGPU, outputGPU_, scratchGPU, outputCPU_, conditions, ctx.stream(), counter, currentCummOffset);
  }

fwyzard · 2021-08-04T08:43:59Z

or there may be other fields that should be initialised properly instead of skipping the call altogether - I don't know by heart

thomreis · 2021-08-04T08:49:02Z

Yes that is my guess as well. Checking that now.

thomreis · 2021-08-04T08:51:19Z

Confirmed.

fwyzard · 2021-08-04T08:52:28Z

I can confirm it as well, with this patch

diff --git a/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc b/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
index 4dcb1bd0e26e..36fdaeb4cfe9 100644
--- a/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
+++ b/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
@@ -134,6 +134,10 @@ void EcalRawToDigiGPU::acquire(edm::Event const& event,
     ++counter;
   }
 
+  // reset the number of channels
+  outputCPU_.nchannels[0] = 0;
+  outputCPU_.nchannels[1] = 0;
+
   // unpack if at least one FED has data
   if (counter > 0) {
     ecal::raw::entryPoint(

the ECAL plus Pixel job runs to completion:

TrigReport ---------- Path   Summary ------------
TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
...
TrigReport     1   38      21765      21765          0          0 Status_OnGPU
TrigReport     1   39      21765          0      21765          0 HLT_L1RandomTrigType_v1
TrigReport     1   40      21765      21765          0          0 HLT_L1PhysicsTrigType_v1
TrigReport     1   41      21765      21765          0          0 HLT_Pixel_v1
TrigReport     1   42      21765          0      21765          0 HLT_ECALHT1_v1
TrigReport     1   43      21765          0      21765          0 HLT_ECALHT1_NoGPU_v1
TrigReport     1   44      21765          0      21765          0 HLT_ECALHT1_GPU_v1

thomreis · 2021-08-04T08:56:50Z

So

  // output cpu
  outputCPU_ = {cms::cuda::make_host_unique<uint32_t[]>(2, ctx.stream())};

in https://github.com/cms-sw/cmssw/blob/master/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc#L108 does not initialise the object?

fwyzard · 2021-08-04T09:00:05Z

I think it just allocates the memory, but doesn't perform any initialisation.

thomreis · 2021-08-04T09:02:14Z

OK. I'll prepare a PR.
Thanks for the help.

fwyzard · 2021-08-04T09:03:06Z

By the way, looking again at the code, it's a bit of a waste to deallocate and reallocate 2 integers at every event...
It should be few ms faster to allocate them once (in the constructor, but only if CUDA is available) and simply resetting the values to 0 at the end of produce.

But let's keep this separate from the fix itself.

thomreis · 2021-08-04T13:41:33Z

PRs with the fix for the ECAL crash:

mzarucki · 2021-08-04T21:27:06Z

Hi all,

Just wanted to confirm with you that running our Hilton GPU tests with the full GPU menu [1] over run 343762 with all three PRs (Pixel #34684, HCAL #34750 and ECAL #34768) on top of CMSSW_11_3_3 we do not see any more crashes (as documented in this e-log).

Thank you for the quick reaction.

Best regards,
Mateusz on behalf of FOG

[1] /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2

perrotta · 2021-08-10T06:41:43Z

Can this issue get closed, then?

fwyzard · 2021-08-10T07:29:02Z

+heterogeneous

cmsbuild · 2021-08-10T07:29:25Z

This issue is fully signed and ready to be closed.

mzarucki · 2021-08-10T07:43:03Z

Dear all,

From the FOG side, I would like to report that we have tested the full GPU menu in CMSSW_11_3_4 in run 344449 with ECAL, HCAL and Pixel out of the run and we saw no issues (as reported in this e-log and today's Daily Run meeting just now). This confirms that the updated protections as working well.

Best regards,
Mateusz on behalf of FOG

cmsbuild added the pending-assignment label Jun 21, 2021

cmsbuild added heterogeneous-pending pending-signatures reconstruction-pending and removed pending-assignment labels Jun 21, 2021

This was referenced Jun 30, 2021

Fix crash in ECAL local reco on GPU if ECAL is not in the run #34292

Merged

Fix crash in ECAL local reco on GPU if ECAL is not in the run - 113X #34293

Merged

This was referenced Jul 9, 2021

HBHEGPU: protection when detector is out #34428

Merged

HBHEGPU: protection when detector is out #34429

Merged

czangela mentioned this issue Jul 15, 2021

Pixel Local GPU reco crashes on missing detector input (no Pixel FED) #34496

Closed

cmsbuild removed the reconstruction-pending label Jul 22, 2021

This was referenced Aug 4, 2021

Fix crash of ECAL GPU reco when ECAL is out of the run. #34765

Merged

Fix crash of ECAL GPU reco when ECAL is out of the run - 12_0_X #34766

Merged

Fix crash of ECAL GPU reco when ECAL is out of the run - 11_3_X #34768

Merged

This was referenced Aug 5, 2021

Protect HCAL GPU-related modules against empty events [12.0.x] #34749

Merged

Protect HCAL GPU-related modules against empty events [11.3.x] #34750

Merged

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Aug 10, 2021

qliphy closed this as completed Aug 10, 2021

Ecal and Hcal Local GPU Reco crashes on missing detector #34197

Ecal and Hcal Local GPU Reco crashes on missing detector #34197

Comments

Sam-Harper commented Jun 21, 2021 • edited Loading

cmsbuild commented Jun 21, 2021

makortel commented Jun 21, 2021

cmsbuild commented Jun 21, 2021

makortel commented Jun 21, 2021

fwyzard commented Jun 21, 2021 • edited Loading

thomreis commented Jun 22, 2021

fwyzard commented Jun 23, 2021

mzarucki commented Jun 24, 2021 • edited Loading

fwyzard commented Jun 24, 2021 via email

mzarucki commented Jun 25, 2021 • edited Loading

thomreis commented Jun 30, 2021

fwyzard commented Jun 30, 2021

thomreis commented Jun 30, 2021

fwyzard commented Jun 30, 2021

thomreis commented Jun 30, 2021

Sam-Harper commented Jun 30, 2021

Sam-Harper commented Jun 30, 2021

mseidel42 commented Jun 30, 2021

slava77 commented Jul 22, 2021

mzarucki commented Aug 4, 2021 • edited Loading

thomreis commented Aug 4, 2021

fwyzard commented Aug 4, 2021

thomreis commented Aug 4, 2021

fwyzard commented Aug 4, 2021

fwyzard commented Aug 4, 2021

fwyzard commented Aug 4, 2021

thomreis commented Aug 4, 2021

thomreis commented Aug 4, 2021

fwyzard commented Aug 4, 2021

thomreis commented Aug 4, 2021

fwyzard commented Aug 4, 2021

thomreis commented Aug 4, 2021

fwyzard commented Aug 4, 2021

thomreis commented Aug 4, 2021

mzarucki commented Aug 4, 2021 • edited Loading

perrotta commented Aug 10, 2021

fwyzard commented Aug 10, 2021

cmsbuild commented Aug 10, 2021

mzarucki commented Aug 10, 2021

Sam-Harper commented Jun 21, 2021 •

edited

Loading

fwyzard commented Jun 21, 2021 •

edited

Loading

mzarucki commented Jun 24, 2021 •

edited

Loading

mzarucki commented Jun 25, 2021 •

edited

Loading

mzarucki commented Aug 4, 2021 •

edited

Loading

mzarucki commented Aug 4, 2021 •

edited

Loading