Skip to content

List of files

dbeghin edited this page Feb 22, 2021 · 7 revisions

This page lists all of the different files found in the STEAM rates estimation tool, with a short explanation of their purpose. For most of them, more detailed comments are available inside the files.

Prod directory

The Prod directory is where you can create and submit condor jobs which apply the HLT trigger you want to study to data or MC files. This is where you do "Step 1" of the rates estimation workflow.

run_steamflow_cfg.py

Short script meant to be used in conjuction with the list of input files list_cff.py and the hlt_config.py file which you have to produce yourself (see the overview of the workflow). run_steamflow_cfg.py will run over all files in list_cff.py, keeping only the information over whether the HLT trigger was fired or not. Optionally, it can also switch the L1 prescale column used if you're running over data. You can edit the running options (maximum number of events, whether to switch the L1 column) at the beginning of the file. However, do not change the output file name, since that name is hardcoded elsewhere.

list_cff.py

List of input files for run_steamflow_cfg.py. For data, you need to edit it manually, for MC it can be created automatically from the datasets (recommended, unless you're doing a short test).

cmsCondorData.py

Script to create condor jobs for running over data. It takes as mandatory arguments the name of the configuration file used to run your trigger jobs (recommended: run_steamflow_cfg.py), the top of your CMSSW release (bla/CMSSW_X_X_X/src) and the output directory (needs to already exist). Options: number of data files per job (recommended: 1, which is also the default), job flavour (determines how long the job will run), path to your grid proxy (only necessary if files aren't available at CERN):

./cmsCondorData.py run_steamflow_cfg.py bla/CMSSW_X_X_X/src <remoteDir> -p <proxyPath> -n <nPerJob> -q <jobFlavour>

The reason the option -n 1 is recommended is because of potential issues in Step 2 of the rates calculation, when the JSON is used to count the number of lumi sections (LS) processed. If n > 1, in a rare case it's possible that two data files belonging to different datasets (e.g. HLTPhysics1 and HLTPhysics2) but with the same LS range are grouped together, in which case the LS counter will mistakenly avoid counting the LS coming from the second dataset. The rates normalization would then be wrong.

The cmsCondorData.py script will create a certain number of jobs, according to the -n option used and the total number of files in list_cff.py. Each job has its own directory, with an automatically generated configuration file run_cfg.py and the condor job itself sub_0.sh. run_cfg.py is a hybrid of run_steamflow_cfg.py and hlt_config.py and takes only n input files (as specified by your option). The error messages, terminal output, and log from the condor job will appear in the job's dedicated directory.

The cmsCondorData.py script also creates the master file for all the condor jobs, condor_cluster.sub. condor_cluster.sub specifies which condor jobs to run, the proxy to use (if specified), the names of the output, log and error files and the job flavour. You can find information about condor syntax here.

cmsCondorMC.py

Similar to cmsCondorData.py, except it creates jobs to run over MC rather than data. The main difference is that with MC the recommended procedure is to use the option to run over many datasets at once, and generate list_cff.py automatically. The job creation is thus a little more convoluted than for data, with directories created for each dataset, and procedures to make sure that output files are sent to the directory corresponding to the correct dataset. But otherwise the basic structure is the same.

The recommended way to run cmsCondorMC.py is with a proxy, which will automatically generate list_cff.py. It's also possible to run without a proxy, in which case you'll need to manually edit list_cff.py with input files available locally at CERN. You can also to edit the name of the hardcoded dataset at the beginning of cmsCondorMC.py (this isn't strictly necessary but allows you to keep track of what you're doing). This feature is still a little clunky and not recommended.

getMenu.sh

Shows an (obsolete) example of the hltGetConfiguration command.

Rates directory

In this directory you'll find the scripts necessary to get HLT rates from ROOT files containing information about which trigger paths fired in which events. This is "Step 2" of the rates estimation workflow, and you can run on ROOT files you created yourself in "Step 1", on files produced by somebody else, or even on data ROOT files produced centrally if you want to study an old HLT menu that has already been deployed online.

config_*.py files

These are just wrapper scripts which call other scripts in the same directory. They are designed so that you can easily customize the way you want to run trigger counting jobs, or the way you want to merge the outputs of these jobs. It's always recommended to use these "config" files rather than trying to use the scripts they call individually.

aux.py

A file which stores a bunch of disparate auxiliary functions used by many of the scripts described below. When you're looking at a script and don't recognize a function, check if it's imported from aux.py. If it is, you'll find more documentation by looking up the function name in aux.py.

triggerCountsFromTriggerResults.py

This is the most important script in the directory, it counts the number of times each trigger path is fired, and if you use the "somemaps" or "allmaps" options, it also counts dataset/group/stream rates, as well as trigger-dataset and dataset overlap rates. It creates as outputs multiple csv tables and one root file (for the overlap plots). It runs over one ROOT file at a time:

python triggerCountsFromTriggerResults.py -i <inputfile> -j <json/dataset> -s <finalstring> -f <filetype> -m <maps> -M <maxEvents>

Note: some of the "options" are in fact mandatory (the code could be revised to turn them into arguments).

  • -i <inputfile> (mandatory): Write here the one ROOT file over which you want to run. The ROOT file needs to have information about which trigger paths fired in each event.

  • -j <json/dataset> (mandatory): If running over data: write here the text file with the LS range you need in json format. If running over MC: write here the name of the MC dataset.

  • -s <finalstring> (mandatory): Write a string here to give a unique tag to the names of the many output files which will be produced. e.g. if you write "test" your outputs will be called output.path.physics.test.csv, histos.test.root, etc.

  • -f <filetype> (optional): <filetype>="custom" is the default, use it if you're running over STEAM ROOT files produced in "Step 1" of the rates estimation workflow. <filetype>="L1Accept" if you're running over L1Accept files and studying scouting triggers. <filetype>="RAW" if you're running over data files with HLT information that were NOT produced in Step 1, and you're not particularly interested in scouting triggers. This is useful when you want to study an HLT menu that has already been deployed online.

  • -m <maps> (optional): "nomaps" (default option, use none of the maps found in Menu_HLT.py), "somemaps" (use the maps allowing estimation of the dataset/group/stream rates) or "allmaps" (use all maps used in "somemaps" and also maps describing dataset merging).

  • -M <maxEvents> (optional): set a maximum number of events to be processed (useful for tests).

The general structure of the code is as follows: it begins by retrieving the options used and initializing many variables. Then there is a loop over events which counts how many times each trigger fires (some of the initialization is done in the first event of the loop). There are if statements checking if the maps from Menu_HLT.py are used, and if they're used extra counters are also incremented (counters for datasets/groups/streams and overlaps). Much care is taken to avoid double counting. At the end of the script, the output files (csv and root) are created and filled with the counts calculated previously in the event loop.

More detailed comments are available inside the code itself.

Menu_HLT.py

This file contains the maps which are used in the advanced "maps" options of triggerCountsFromTriggerResults.py. You need to keep it updated as the HLT menu changes, see the instructions here.

condorScriptForRatesData.py

This is the script that automatically generates condor jobs for trigger counts. The script first runs make_ratesFilesInputData.py to create a list of input root files to run over. Then it creates a "basic" job for each of those input files. Each "basic" job first executes triggerCountsFromTriggerResults.py to get the trigger counts and then handleFileTransfer.py to transfer the output files with the counts to an organized output folder. The "basic" jobs are grouped together into higher-level jobs, according to the number of files per job specified by the user. The higher-level jobs are the ones submitted to condor.

A master file for submitting the condor jobs is created (its structure is similar to that used in Prod/cmsCondorData.py, see also the condor documentation). There's also a "sub_total.jobb" script created, which first removes any old job output which may exist and then submits the freshly created condor jobs.

condorScriptForRatesData.py takes many options, some of which are mandatory:

python condorScriptForRates.py -j <json> -e <CMSSWrel> -i <infilesDir> -f <filetype> -n <nPerJob> -q <jobFlavour> -m <maps>

  • -j <json> (mandatory): full path to the luminosity json file

  • -e <CMSSWrel> (mandatory): directory where the top of a CMSSW release is located (write /full/path/to/CMSSW_X_X_X/src)

  • -i <infilesDir> (optional): directory where the input root files are located. By default, the code will take whatever is already in the existing filesInputData.py file. If you specify a new directory, a new file is created, but the old one is copied so you have an opportunity to recover it.

  • -f <filetype> (optional): only a few strings are accepted: "custom" (default option) or "RAW" or "L1Accept". See the documentation for the trigger counts script.

  • -n <nPerJob> (optional): number of files processed per condor job (default=5).

  • -q <flavour> (optional): job flavour, determines how long a condor job will run. Default is workday. Again, always check the condor documentation when you have doubts.

  • -m <maps> (optional): must be either "nomaps" (default), "somemaps" or "allmaps". Determines if the maps stored in Menu_HLT.py are used, see the documentation for the trigger counts script.

condorScriptForRatesMC.py

Very similar to condorScriptForRatesData.py, except this script generates condor jobs for MC rather than data. The main difference is that MC files are organised by dataset, and the script makes sure to organise the condor jobs by dataset too.

make_ratesFilesInputData.py

Short script to automatically generate a list of input files from data (called filesInputData.py), starting from a user-provided directory. The script will look for files ending in ".root" up to 10 layers of subdirectories deep inside the input directory provided.

To run it: python make_ratesFilesInputData.py -i /path/to/input/dir.

Be careful, the script makes a few assumptions about the structure of the directory where it's located. I haven't seen it happen, but it could crash ungracefully if given a wrong directory.

make_ratesFilesInputMC.py

Same as make_ratesFilesInputData.py, but for MC. Again, it's a little more complicated because of the organisation by dataset.

To run it: python make_ratesFilesInputData.py -i /path/to/input/dir.

Give as an argument the last directory which contains as subdirectories the names of all MC datasets you want to consider.

The code first copies ../MCDatasets/map_MCdatasets_xs.py to have a local updated version of the MC datasets to consider. Then it looks for root files into the input directory provided, going one level deeper (up to 10 levels) if none are found. When a root file is found, the code finds the name of the corresponding directory by looking at the last subdirectory considered. It then checks if this directory was present in map_MCdatasets_xs.py. If so, the code updates the file filesInputMC.py with the name of the directory as well as all the ROOT files corresponding to it.

handleFileTransfer.py

Script that transfers output csv and root files from one trigger counting job to the correct output directories. This allows the output of a job to be well organized. The magic here happens with the mergeNames dictionary, imported from the aux.py file. The dictionary explains to which subdirectory each output file should go.

This script not meant to be used in a standalone way, better to let the jobs handle it. If you want to add a new kind of output file, you can always change the mergeNames dictionary in the aux.py file.

mergeOutputs.py

Script that, for data, merges and scales the output csv and root files from all trigger counting jobs. Since each job produces many files with different names (a "global" file, a file for individual HLT physics paths, a file for physics datasets, etc.), the merge results in several output files, one for each category.

For MC, the script also works, but you need to specificy the dataset for which you wish to accomplish the merger. So you would need to run the script for each of the datasets. A better way of doing that is to instead use the script prepareMergeOutputsMC.py (see here) which automatically handles all datasets, and calls mergeOutputs.py once for each dataset.

To run the script: python mergeOutputs.py -w <dataMC> -l <lumiin> -t <lumitarget> -p <hltps> -d <dir> -m <maps>\n -f

where:

  • -w <dataMC> (optional): write here "data" if running on data or the name of the MC dataset considered, if running on MC. Default = "data".

  • -l <lumiin> (optional): write the value corresponding to the average instant lumi in your data json. No need to specify if MC.

  • -t <lumitarget> (mandatory): write the value corresponding to the target instant lumi for which you wish to calculate your rates.

  • -p <hltps> (optional): prescale of the HLT_physics trigger if running on data. No need to specify for MC.

  • -d <dir> (optional): directory where the output of the trigger counting jobs are located. For MC this needs to include the name of the dataset. Default = "Results/Data/Raw", which only works for data and only if the outputs are in the expected location.

  • -m <maps> (optional): must be either "nomaps" (default), "somemaps" or "allmaps". Determines if the maps stored in Menu_HLT.py are used, see the documentation for the trigger counts script.

  • -f (optional): if used, this option merges the root files which are used to produce trigger-dataset and dataset-dataset correlation figures. By default root files are NOT merged.

The script first creates a list of files to be merged in each category ("global", "physics path", "dataset", "root", etc.). The "global" and "root" categories are handled in standalone lists, while the other categories are put in a dictionary called masterDic.

The global files are merged first, in order to allow the scale factor from counts to rates to be calculated.

A list of keys for the masterDic is created while making sure the streams comes before the datasets, because we want to sort datasets by stream and for that we need to sort the streams first. Then for each of the masterDic categories, dictionaries for the counts are created, so that the total count for each HLT path, each dataset, and each stream can be separately computed. There's also a "groups" and a "type" dictionary meant to be used only for individual HLT path files.

Once the total counts have been computed, the counts dictionary keys (either path names, or dataset names, or etc. according to the output category) are sorted so that the counts appear in order from highest to lowest (for datasets, we sort them by stream first before sorting from highest to lowest within each stream). Then we write the merged output files, with the counts scaled to become rates. For some files, we keep a copy of the raw counts so that we can estimate statistical uncertainty.

"Root" files are merged last, and only if you specifically said so in the options. The datasets in the dataset-dataset and trigger-dataset overlap histograms are sorted like before (sorted by streams, then decreasing rate within each stream). The triggers are sorted by decreasing rate. The old, unsorted histograms are deleted and replaced by the new, sorted histograms.

prepareMergeOutputsMC.py

A simple script that just calls mergeOutputs.py once for each MC dataset name. Only useful for MC, as indicated by the name.

To run the script: python prepareMergeOutputsMC.py -t <lumitarget> -d <dir> -m <maps> -f. The options are exactly the same as for mergeOutputs.py, except for the directory -d <dir>. Here, you should enter the directory where all the MC dataset names show up as subdirectories. The code will do ls <dir>, look at the output (which should be a list of MC dataset names) and for each entry of the output run mergeOutputs.py with the correct options assigned.