TTree performance improvements advice needed #1106
-
I am trying to write a small python macro that should "demonstrate" that uproot is (almost) as fast as C++. How can one make this code run even "faster"? Can one somehow set the "title" of the created hist.Hist? In a second step later ... import uproot
import numpy as np
import hist
def process():
# open the ROOT file and load the hk TTree
with uproot.open({'data/my_tree.root': 'hk'}) as t:
# test that we can read some available arrays
tq_real = t.arrays(['TQReal.nhits', 'TQReal.pc2pe', 'TQReal.t0'])
tq_real_hits = t.arrays(['TQReal.hits.cable', 'TQReal.hits.T', 'TQReal.hits.Q'])
n_high_hit = np.sum(tq_real_hits['TQReal.hits.Q'] > 1.0)
print(f'Number of high Q hits: {n_high_hit}')
# create and fill some histograms
h_goodness = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label = 'goodness_time_fit (RecoBonsai)'))
h_goodness.fill(t['RecoBonsai.goodness_time_fit'].array())
h_trigger = hist.Hist(hist.axis.Regular(10, 0.0, 10.0, label = 'trigger_id (Header)'))
h_trigger.fill(t['Header.trigger_id'].array())
h_goodness_leaf = None
if 'RecoLEAF.goodness' in t: # RecoLEAF is a new class, not available everywhere
h_goodness_leaf = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label = 'goodness (RecoLEAF)'))
h_goodness_leaf.fill(t['RecoLEAF.goodness'].array())
# write histograms to a ROOT file
# note: when a new ROOT TH[123]D is created from the
# corresponding hist.Hist below, its "statistics"
# is computed from its bin content (in bare ROOT,
# the "statistics" is computed at filling time)
with uproot.recreate('data/output_analysis_tree_python_v1.root') as f_out:
f_out['goodness'] = h_goodness
f_out['trigger'] = h_trigger
if h_goodness_leaf is not None:
f_out['goodness_leaf'] = h_goodness_leaf
if __name__ == '__main__':
print('Processing')
process()
print('Done') |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 13 replies
-
If these arrays are not ragged (i.e. Be careful to avoid reading the same data multiple times. I don't see any examples of that here, but you can either consolidate all of your array-reading functions in one uproot.TTree.arrays call or use the If your data are heavily compressed with LZMA, there might be some benefit from setting a If you gather all of the array-fetching into one call, it would be easier to convert this into a process that calls uproot.TTree.iterate to read a subset of entries in each iteration step. That would allow you to scale to files that are too big to load into memory all at once (while still being a single-threaded process). The amount of data read in each step is controlled by To convert Beyond that, you want to consider parallel-processing. Because of Python's GIL, that almost always means multiprocessing, which, in turn, means not sending too much data between the processes because it has to be serialized (can't be just referenced as threads can). Your histograms are probably small enough. Setting this manually is annoying, so consider uproot.dask to distribute them on a Dask cluster. (dask-histogram is a Daskified version of boost-histogram/hist.) (How you set up or access the Dask cluster depends on where you are. In tutorials like this one, we launched a few Dask workers on one computer to show the principle. The Coffea team has extensive experience with Dask, and you might want to consider using Coffea to get some features built-in. The issues involved in horizontal scaling applies to C++ and Python equally, and Python may have an advantage there because of its simplicity—much of the difficulty of horizontal scaling comes from complexity, and simplicity in the workload and interfaces lets you concentrate on the problems in horizontal scaling. For vertical scaling, however, Python can only approach the speed of C++ in specific contexts, when it "gets out of the way." Uproot takes advantage of a fact that is true of some but not all ROOT datasets: pure numerical and ragged-numerical (only one level deep) data are stored in the ROOT file as arrays, so we can just cast those arrays instead of iterating over them. (The equivalent in C++ ROOT is called Bulk I/O.) Uproot can only approach ROOT's read speed to the extent that this is true: if the individual TBaskets are large (10's or 100's of kB at least) and the data types are right. In that case, reading is dominated by the physical disk speed and decompression, which are the same for Python and C++. In the worst case, you have lots of small files or lots of small TBaskets in them and Python/Uproot will spend most of its time deserializing this metadata. (This worst case is bad for C++, too, but it's proportionally worse for Python.) If you want to go even further afield, you could consider UnROOT.jl to load the data in Julia. Right now, that means also doing the analysis in Julia, but we're working on ways (AwkwardArray.jl) to connect the two environments. |
Beta Was this translation helpful? Give feedback.
-
What about the "title" of the created hist.Hist? |
Beta Was this translation helpful? Give feedback.
-
Many, many thanks for your help and all the additional info (I'll remember what you wrote about the scaling to the "iterative" / "parallel-processing" approach). Here's my final version for the time being: import uproot
import numpy as np
import hist
def process():
# open the ROOT file and load the hk TTree
with uproot.open({'data/my_tree.root': 'hk'}) as t:
# test that we can read some available branches
# note: if branches keep C++ fundamental data types or their fixed-size arrays,
# one can use the library='np' option (the NumPy, which may improve speed)
tq_real = t.arrays(['TQReal.nhits', 'TQReal.pc2pe', 'TQReal.t0'], library='np')
# note: if any branches with ragged/variable-length arrays are retrieved,
# one must use the default library='ak' option (the Awkward Array)
tq_real_hits = t.arrays(['TQReal.hits.cable', 'TQReal.hits.T', 'TQReal.hits.Q'])
n_high_hit = np.sum(tq_real_hits['TQReal.hits.Q'] > 1.0)
print(f'Number of high Q hits: {n_high_hit}')
# create and fill some histograms
h_goodness = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label='goodness_time_fit (RecoBonsai)'))
h_goodness.title = 'bs goodness distribution'
h_goodness.fill(t['RecoBonsai.goodness_time_fit'].array(library='np'))
h_trigger = hist.Hist(hist.axis.Regular(10, 0.0, 10.0, label='trigger_id (Header)'))
h_trigger.title = 'trigger distribution'
h_trigger.fill(t['Header.trigger_id'].array(library='np'))
h_goodness_leaf = None
if 'RecoLEAF.goodness' in t: # RecoLEAF is a new class, not available everywhere
h_goodness_leaf = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label='goodness (RecoLEAF)'))
h_goodness_leaf.title = 'leaf goodness distribution'
h_goodness_leaf.fill(t['RecoLEAF.goodness'].array(library='np'))
# write histograms to a ROOT file
# note: when a new ROOT TH[123]D is created from the
# corresponding hist.Hist below, its "statistics"
# is computed from its bin content (in bare ROOT,
# the "statistics" is computed at filling time)
with uproot.recreate('data/output_analysis_tree_python_v1.root') as f_out:
f_out['goodness'] = h_goodness
f_out['trigger'] = h_trigger
if h_goodness_leaf is not None:
f_out['goodness_leaf'] = h_goodness_leaf
if __name__ == '__main__':
print('Processing')
process()
print('Done') |
Beta Was this translation helpful? Give feedback.
If these arrays are not ragged (i.e.
t[branchname].typename
is a C++ value type, such asint32_t
ordouble
, possibly fixed-size arrays likefloat[10]
, but not variable-length arrays likefloat[]
), then you can use NumPy withlibrary="np"
instead of the default Awkward Array, and that would have greater or equal speed.Be careful to avoid reading the same data multiple times. I don't see any examples of that here, but you can either consolidate all of your array-reading functions in one uproot.TTree.arrays call or use the
array_cache
argument in both uproot.TTree.arrays and uproot.TBranch.array. (It doesn't do anything magical; it just checks a MutableMapping (dict) that you maintain, firs…