You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to run over datasets (in particular the large files) on CRAB, using central nanoAOD-tools modules, jobs are often failing due to hitting their wall clock times. The main offender seems to be jetmetUncertainties.py, as this performs a lot of looping over the jets and uncertainties in a given event.
Would it be possible to optimise this (perhaps using vectorisation and tools from numpy/scipy) so the module can run faster and reduce the frequency at which the jobs fail? This is starting to become a bottleneck for our analysis because of the length of time it takes to fully run over a dataset successfully.
The text was updated successfully, but these errors were encountered:
For bamboo (an RDataFrame-based analysis framework) I wanted to have the option to calculate the jet variations on-demand (and skip the postprocessing step), so I made a C++ implementation of jetmetUncertainties (calling python code from RDataFrame is a recent feature, and quite slow).
The fat jet variations are not there yet, but AK4PFchs jets and the Type-1 MET correction match with NanoAOD-tools within numerical precision for the tested configurations.
The implementation code is in this file, and the build config here (some files are copied from CMSSW, otherwise the only dependencies are a recent ROOT and Boost, so it should be relatively straightforward to include - at least locally for comparing the speed).
When trying to run over datasets (in particular the large files) on CRAB, using central nanoAOD-tools modules, jobs are often failing due to hitting their wall clock times. The main offender seems to be jetmetUncertainties.py, as this performs a lot of looping over the jets and uncertainties in a given event.
Would it be possible to optimise this (perhaps using vectorisation and tools from
numpy
/scipy
) so the module can run faster and reduce the frequency at which the jobs fail? This is starting to become a bottleneck for our analysis because of the length of time it takes to fully run over a dataset successfully.The text was updated successfully, but these errors were encountered: