-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in the quadruplets workflow on CPU #564
Comments
@VinInn could you have a look ? |
with pre7 cannot reproduce on patatrack02
and takes forever |
with pre8 doesn't crash either... |
do not understand: the file is served from Italy |
This seems frequently reproducible with pre10:
results in
The full report is attached: crash.log. |
this is suspicious (TLS)
need to verify if shows up consistently... |
confirmed (only if running from local file, from xrootd does not) |
minimal recompiled with -g |
eventually crashed
|
the arrays are not initialized on CPU? |
cudaCompat issue |
fixed this is the patch with all assert in place
|
OK, I admit I'm confused: do we ever use a |
I think yes (clustering?). When the blockId has a specific meaning (such as detectors) Why is crashing now: no clue. Maybe streams and threads are not one-to-one. |
here you see
you are right. on CPU we DO NOT run the patatrack-clusterizer.... |
They are not guaranteed to be, no. We also had rare cases where TBB after a while "retires" a worker thread, and spawns a new one; if the |
ok, most probably we need to |
With these changes diff --git a/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h b/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h
index f9b4b2f8a4c1..e8aa4cdc1b06 100644
--- a/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h
+++ b/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h
@@ -21,11 +21,11 @@ namespace cms {
uint32_t x, y, z;
};
#endif
+
const dim3 threadIdx = {0, 0, 0};
+ const dim3 blockIdx = {0, 0, 0};
const dim3 blockDim = {1, 1, 1};
-
- extern thread_local dim3 blockIdx;
- extern thread_local dim3 gridDim;
+ const dim3 gridDim = {1, 1, 1};
template <typename T1, typename T2>
T1 atomicCAS(T1* address, T1 compare, T2 val) {
@@ -78,10 +78,7 @@ namespace cms {
return *x;
}
- inline void resetGrid() {
- blockIdx = {0, 0, 0};
- gridDim = {1, 1, 1};
- }
+ inline void resetGrid() {}
} // namespace cudacompat
} // namespace cms
diff --git a/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc b/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc
index 7b8efda8e381..0b94c8f1d4b8 100644
--- a/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc
+++ b/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc
@@ -1,12 +1,5 @@
#include "HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h"
-namespace cms {
- namespace cudacompat {
- thread_local dim3 blockIdx;
- thread_local dim3 gridDim;
- } // namespace cudacompat
-} // namespace cms
-
namespace {
struct InitGrid {
InitGrid() { cms::cudacompat::resetGrid(); }
diff --git a/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h b/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h
index 0da24cef219e..987b0af91dbd 100644
--- a/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h
+++ b/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h
@@ -157,8 +157,8 @@ namespace gpuVertexFinder {
// std::cout << "found " << (*ws_d).nvIntermediate << " vertices " << std::endl;
fitVertices(soa, ws_d.get(), 50.);
// one block per vertex!
- blockIdx.x = 0;
- gridDim.x = 1;
+ assert(blockIdx.x == 0);
+ assert(gridDim.x == 1);
splitVertices(soa, ws_d.get(), 9.f);
resetGrid();
fitVertices(soa, ws_d.get(), 5000.); all gridDim.x = MaxNumModules; //not needed in the kernel for this specific case; So... would it be OK to simple make the Othwerise, since
would it make sense to wrap all CPU "kernel" calls in something like |
Once I did wrap every kernel in a modified version of "your" launch and modified all drivers (cu and cc) and many of them were identical at that point. |
For what concern the Clusterizer, as I said, the kernel must be modified to be independent from the grid size. |
Yes; I that that was #428 ?
About the time: I'd still rather do it after the integration upstream, now that it's finally getting close to happening.
I take your word for it - I just don't find where the grid size is passed to the cpu kernel(s) ? |
this code
depends on the grid size instead it should loop as we loop for the threadIdx so on cpu (in the test) we are forced to loop in the driver....
|
OK, I see. So
? Do you think we can change the kernels to run with the If the kernels can be changes to work with the
Otherwise (if the kernels cannot be made to use a
@VinInn, which of these two, or what other option, would you prefer ? @makortel since this touches also on the more "core" part of the compatibility layer, do you have any opinions ? Second question:
@tsusa @mmusich do you have any opinions about this last point ? |
I think the best option is to adopt the modification you (@fwyzard) propose. One possibility is to integrate your changes and "inhibit" the test ( The clusterizer was never integrated in CPU workflows as it requires the Raw2dDgi to be ported first (and that was supposed to happen after integration to benefit of a coherent integration of Legacy and SoA code) |
If decision is to change clusterizer kernel now I can work on that this week (provided all other changes to LocalReco had been already integrated: I do not want to run in merging issues) |
Looking at the open PRs (https://github.com/cms-patatrack/cmssw/pulls/) I don't think there should be conflicts, but I'm not 100% sure. So I, if I understood correctly your comments, I think we could
The first should happen before the integration (I can do it tomorrow or Wednesday). |
@fwyzard |
Thanks to long DAQ meeting, the first part should be done by #586 . |
clusterizer fixed in #588 . |
#586 looks good to me |
In recent Patatrack releases (both
CMSSW_11_2_0_pre7
with the currentCMSSW_11_2_X_Patatrack
branch, andCMSSW_11_2_0_pre8
with the currentmaster
branch) I see frequent crashes in the11634.501
workflow, that is, Patatrack pixel quadruplets running on the CPU.This may actually have been there for a while, and have been revealed by the update to the workflow (before we were actually testing the legacy pixel tracks with the new fits).
I've observed this using the relvals and global tags from
CMSSW_11_2_0_pre3
andCMSSW_11_2_0_pre7
, so it's likely not dependent on the input data.To reproduce it with pre7:
To reproduce it with pre8 (only the Patatrack branch and global tag are different):
The input file is available under
/gpu_data/store/...
on the online machines, under/data/store/...
on vocms006, and otherwiase over xrootd.The text was updated successfully, but these errors were encountered: