-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ecal and Hcal Local GPU Reco crashes on missing detector #34197
Comments
A new Issue was created by @Sam-Harper Sam Harper. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction, heterogeneous |
FYI @cms-sw/ecal-dpg-l2 @cms-sw/hcal-dpg-l2 @vkhristenko |
An other corner case that needs to be tested and possibly addressed is the unpacking and local reconstruction of calibration data. It may also be useful to check if data taken in non-standard conditions (different zero suppression, etc.) is properly handled. |
Is there a recipe to reproduce this? Running at P5 with a specific HLT configuration? @amassiro FYI even though I guess you know already. |
@mzarucki does FOG have some samples to reproduce the issues ? |
Hi all, @smorovic put the file from the run 342053 with ECAL in, HCAL out (e-log) that was crashing in NFS The HLT menu used was Cheers, PS. The test we did with HCAL in and ECAL out was run 342175 (e-log) |
Hi Mateusz,
thanks... however I think if we want the DPGs to investigate the problem,
we need some more conventional instructions.
1. can you provide the instructions (CMSSW release, etc.) and a full fump
of the menu that can be run out of the box ?
2. do we have files from the ECAL-only and HCAL-only runs ? if we do, could
you list them explicitly? if we don't, vcan you list explicitly some other
files that should be used ?
3. can you test the instructions (release, config file, input files) to
make sure the crash is reproducible on the online GPU machines (without
hikton/hltd) ?
Of course, "you" as in "anybody from FOG" :-)
This should make it easier for the DPGs to reproduce the problems, and thus
fix them.
.A
|
Hi Andrea, all, I came up with the simplest set of instructions to reproduce the errors, by copying the HLT menu file and raw input files from @smorovic into my directory on NFS ( As a first step, one would need to be logged into one of the GPU machines with a working area set up - all instructions are included in the GPU development Twiki: https://twiki.cern.ch/twiki/bin/viewauth/CMS/TriggerDevelopmentWithGPUs Concerning the working area, it could be either on NFS ( The release to set up is
where one would have to update the To recreate the HCAL crash from run 342053 with ECAL in and HCAL out (e-log), one would use the file that is already set up ( To recreate the ECAL crash from run 342175 with HCAL in and ECAL out (e-log), we have another raw file locally available from run 342110 with the same setup ( @fwyzard, @Sam-Harper please comment if there is anything else to add. Cheers, PS. @Sam-Harper has updated the |
@thomreis all offline releases are actually available from |
@fwyzard thanks I got it to work. I must have had a cached version of the twiki without those instructions. |
no, I just added them one hour ago :-) |
I have made PRs to |
thanks Thomas! |
@cms-sw/hcal-dpg-l2 is there a timeline the HCAL DPG to have a look at this? Thanks! |
@mariadalfonso could you have a look, please? |
Hi @fwyzard, Yes, our previous tests have already shown this. I have re-done them and confirm that we see no crashes when running only one set of path types in 11_3_3 + fixes. We saw this already for ECAL in the pure 11_3_3 release, and we see this for Pixel (11_3_3 + Pixel PR #34684) and HCAL (11_3_3 + HCAL PR #34750). Cheers, |
I could reproduce the crash with the configuration of @mzarucki. It happens always with 2 or more threads. I have extended a bit the error message just before the crash and it seems that there is a wrong number of channels passed in the digis.
In principle the number of channels should be zero.
|
The problem is that - if the error happens inside a kernel, running asynchronously on the GPU - it till be reported by the first CUDA runtime call after it, in any thread. This can happen in the memory allocator or in the framework support calls, because they are quite frequent and kind of wrap all other CUDA modules, even if the error is not there. |
While the crash happens in the |
looking at the code of void EcalRawToDigiGPU::acquire(edm::Event const& event,
edm::EventSetup const& setup,
edm::WaitingTaskWithArenaHolder holder) {
...
// unpack if at least one FED has data
if (counter > 0) {
ecal::raw::entryPoint(
inputCPU, inputGPU, outputGPU_, scratchGPU, outputCPU_, conditions, ctx.stream(), counter, currentCummOffset);
}
}
void EcalRawToDigiGPU::produce(edm::Event& event, edm::EventSetup const& setup) {
cms::cuda::ScopedContextProduce ctx{cudaState_};
// get the number of channels
outputGPU_.digisEB.size = outputCPU_.nchannels[0];
outputGPU_.digisEE.size = outputCPU_.nchannels[1];
ctx.emplace(event, digisEBToken_, std::move(outputGPU_.digisEB));
ctx.emplace(event, digisEEToken_, std::move(outputGPU_.digisEE));
// reset ptrs that are carried as members
outputCPU_.nchannels.reset();
} my guess is that by not calling |
So, it might be enough to add outputCPU_.nchannels[0] = 0;
outputCPU_.nchannels[1] = 0; right before
|
or there may be other fields that should be initialised properly instead of skipping the call altogether - I don't know by heart |
Yes that is my guess as well. Checking that now. |
Confirmed. |
I can confirm it as well, with this patch diff --git a/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc b/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
index 4dcb1bd0e26e..36fdaeb4cfe9 100644
--- a/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
+++ b/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
@@ -134,6 +134,10 @@ void EcalRawToDigiGPU::acquire(edm::Event const& event,
++counter;
}
+ // reset the number of channels
+ outputCPU_.nchannels[0] = 0;
+ outputCPU_.nchannels[1] = 0;
+
// unpack if at least one FED has data
if (counter > 0) {
ecal::raw::entryPoint( the ECAL plus Pixel job runs to completion:
|
So
in https://github.com/cms-sw/cmssw/blob/master/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc#L108 does not initialise the object? |
I think it just allocates the memory, but doesn't perform any initialisation. |
OK. I'll prepare a PR. |
By the way, looking again at the code, it's a bit of a waste to deallocate and reallocate 2 integers at every event... But let's keep this separate from the fix itself. |
PRs with the fix for the ECAL crash: |
Hi all, Just wanted to confirm with you that running our Hilton GPU tests with the full GPU menu [1] over run 343762 with all three PRs (Pixel #34684, HCAL #34750 and ECAL #34768) on top of CMSSW_11_3_3 we do not see any more crashes (as documented in this e-log). Thank you for the quick reaction. Best regards, [1] /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2 |
Can this issue get closed, then? |
+heterogeneous |
This issue is fully signed and ready to be closed. |
Dear all, From the FOG side, I would like to report that we have tested the full GPU menu in CMSSW_11_3_4 in run 344449 with ECAL, HCAL and Pixel out of the run and we saw no issues (as reported in this e-log and today's Daily Run meeting just now). This confirms that the updated protections as working well. Best regards, |
A bug was exposed last MWGR in that both HCAL and ECAL local reconstruction on a GPU do not have protections when the respective detector is out.
This is explicitly in the HBHERecHitProducerGPU
and a similar crash was observed in ECAL
To reproduce simply run over any run with the appropriate detector missing.
The text was updated successfully, but these errors were encountered: