TTree performance improvements advice needed #1106

JacekHoleczek · 2024-01-26T20:45:44Z

JacekHoleczek
Jan 26, 2024

I am trying to write a small python macro that should "demonstrate" that uproot is (almost) as fast as C++.
Could you please give me advice on how to improve it?

How can one make this code run even "faster"?

Can one somehow set the "title" of the created hist.Hist?
If not, can one somehow set the "title" of the generated TH1D histogram (when writing the hist.Hist to a ROOT file)?

In a second step later ...
I assume that each time I ask for some "array", uproot will fully load all TTree entries corresponding to the requested branch into RAM. What would be the easiest way to change this code so that the RAM usage was kept "small" (but still running with good performance), even when analyzing giant trees (i.e., with very many entries)?

import uproot
import numpy as np
import hist

def process():
        # open the ROOT file and load the hk TTree
        with uproot.open({'data/my_tree.root': 'hk'}) as t:
                # test that we can read some available arrays
                tq_real = t.arrays(['TQReal.nhits', 'TQReal.pc2pe', 'TQReal.t0'])
                tq_real_hits = t.arrays(['TQReal.hits.cable', 'TQReal.hits.T', 'TQReal.hits.Q'])
                n_high_hit = np.sum(tq_real_hits['TQReal.hits.Q'] > 1.0)
                print(f'Number of high Q hits: {n_high_hit}')
                # create and fill some histograms
                h_goodness = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label = 'goodness_time_fit (RecoBonsai)'))
                h_goodness.fill(t['RecoBonsai.goodness_time_fit'].array())
                h_trigger = hist.Hist(hist.axis.Regular(10, 0.0, 10.0, label = 'trigger_id (Header)'))
                h_trigger.fill(t['Header.trigger_id'].array())
                h_goodness_leaf = None
                if 'RecoLEAF.goodness' in t: # RecoLEAF is a new class, not available everywhere
                        h_goodness_leaf = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label = 'goodness (RecoLEAF)'))
                        h_goodness_leaf.fill(t['RecoLEAF.goodness'].array())
                # write histograms to a ROOT file
                # note: when a new ROOT TH[123]D is created from the
                #       corresponding hist.Hist below, its "statistics"
                #       is computed from its bin content (in bare ROOT,
                #       the "statistics" is computed at filling time)
                with uproot.recreate('data/output_analysis_tree_python_v1.root') as f_out:
                        f_out['goodness'] = h_goodness
                        f_out['trigger'] = h_trigger
                        if h_goodness_leaf is not None:
                                f_out['goodness_leaf'] = h_goodness_leaf

if __name__ == '__main__':
        print('Processing')
        process()
        print('Done')

Answered by jpivarski

Jan 26, 2024

If these arrays are not ragged (i.e. t[branchname].typename is a C++ value type, such as int32_t or double, possibly fixed-size arrays like float[10], but not variable-length arrays like float[]), then you can use NumPy with library="np" instead of the default Awkward Array, and that would have greater or equal speed.

Be careful to avoid reading the same data multiple times. I don't see any examples of that here, but you can either consolidate all of your array-reading functions in one uproot.TTree.arrays call or use the array_cache argument in both uproot.TTree.arrays and uproot.TBranch.array. (It doesn't do anything magical; it just checks a MutableMapping (dict) that you maintain, firs…

View full answer

jpivarski · 2024-01-26T22:01:36Z

jpivarski
Jan 26, 2024
Maintainer

If these arrays are not ragged (i.e. t[branchname].typename is a C++ value type, such as int32_t or double, possibly fixed-size arrays like float[10], but not variable-length arrays like float[]), then you can use NumPy with library="np" instead of the default Awkward Array, and that would have greater or equal speed.

Be careful to avoid reading the same data multiple times. I don't see any examples of that here, but you can either consolidate all of your array-reading functions in one uproot.TTree.arrays call or use the array_cache argument in both uproot.TTree.arrays and uproot.TBranch.array. (It doesn't do anything magical; it just checks a MutableMapping (dict) that you maintain, first, instead of reading from the file.)

If your data are heavily compressed with LZMA, there might be some benefit from setting a decompression_executor (give it a concurrent.futures.ThreadPoolExecutor). I'm including the caveats (and not recommending interpretation_executor) because Python's GIL prevents it from taking advantage of multiple threads in the same process. Decompression, however, is offloaded to external libraries that release the GIL, so we have seen some improvements from multithreading just that part.

If you gather all of the array-fetching into one call, it would be easier to convert this into a process that calls uproot.TTree.iterate to read a subset of entries in each iteration step. That would allow you to scale to files that are too big to load into memory all at once (while still being a single-threaded process). The amount of data read in each step is controlled by step_size, which should be as large as you can make it while not running out of memory. The iteration steps don't need to line up with the TBaskets (granularity in the ROOT file itself), and any mismatches from one step of iteration are carried over to the next step of iteration (so it's a little better than using entry_start and entry_stop to cut up the file manually).

To convert process into a task that deals with a subset of entries at a time, you'll either need to collect and add your histograms (with predefined binning, as you have them) or define empty histograms outside of the uproot.TTree.iterate or uproot.iterate loop and fill them repeatedly. hist's fill function can be used for multiple array batches.

Beyond that, you want to consider parallel-processing. Because of Python's GIL, that almost always means multiprocessing, which, in turn, means not sending too much data between the processes because it has to be serialized (can't be just referenced as threads can). Your histograms are probably small enough. Setting this manually is annoying, so consider uproot.dask to distribute them on a Dask cluster. (dask-histogram is a Daskified version of boost-histogram/hist.)

(How you set up or access the Dask cluster depends on where you are. In tutorials like this one, we launched a few Dask workers on one computer to show the principle. The Coffea team has extensive experience with Dask, and you might want to consider using Coffea to get some features built-in.

The issues involved in horizontal scaling applies to C++ and Python equally, and Python may have an advantage there because of its simplicity—much of the difficulty of horizontal scaling comes from complexity, and simplicity in the workload and interfaces lets you concentrate on the problems in horizontal scaling.

For vertical scaling, however, Python can only approach the speed of C++ in specific contexts, when it "gets out of the way." Uproot takes advantage of a fact that is true of some but not all ROOT datasets: pure numerical and ragged-numerical (only one level deep) data are stored in the ROOT file as arrays, so we can just cast those arrays instead of iterating over them. (The equivalent in C++ ROOT is called Bulk I/O.) Uproot can only approach ROOT's read speed to the extent that this is true: if the individual TBaskets are large (10's or 100's of kB at least) and the data types are right. In that case, reading is dominated by the physical disk speed and decompression, which are the same for Python and C++. In the worst case, you have lots of small files or lots of small TBaskets in them and Python/Uproot will spend most of its time deserializing this metadata. (This worst case is bad for C++, too, but it's proportionally worse for Python.)

If you want to go even further afield, you could consider UnROOT.jl to load the data in Julia. Right now, that means also doing the analysis in Julia, but we're working on ways (AwkwardArray.jl) to connect the two environments.

0 replies

JacekHoleczek · 2024-01-27T10:32:47Z

JacekHoleczek
Jan 27, 2024
Author

What about the "title" of the created hist.Hist?
Can one somehow set it or later set the "title" of the generated TH1D histogram (when writing the hist.Hist to a ROOT file)?

13 replies

jpivarski Jan 30, 2024
Maintainer

It's definitely a hist thing, not an Uproot thing, so that's where it should be discussed.

Is this the "number of counts" or "sum of weights per bin" axis? That's not an ordinary histogram axis, and ROOT doesn't have a TAxis corresponding to that axis, either. I can see why you want to have a place to put a label like "counts in 1/GeV", but I don't know where it would go.

JacekHoleczek Jan 30, 2024
Author

Actually, ROOT keeps the information about all 3 axes (even if some are not needed/used):

TH1F *h = new TH1F("h", "MyH;MyX;MyY;MyZ", 1, 0., 1.);
std::cout << h->GetName() << " " << h->GetXaxis()->GetName() << " " << h->GetYaxis()->GetName() << " " << h->GetZaxis()->GetName() << std::endl;
std::cout << h->GetTitle() << " " << h->GetXaxis()->GetTitle() << " " << h->GetYaxis()->GetTitle() << " " << h->GetZaxis()->GetTitle() << std::endl;

Where it could go? Well, we have now the "title" attribute (which sets the global histogram title), so I could imagine an additional "label" attribute which would then set the "(n+1)-axis" title.

jpivarski Jan 30, 2024
Maintainer

Those three are binning axes, they specify how many bins and what range, which are not relevant for counts/sums of bin weights. In a 3D histogram, where would the "counts/GeV" label go?

I agree that we (or the hist/UHI developers) can make a new attribute to label this counts-dimension (not axis, or anyway, not a binning-axis). Maybe it has already been done and I don't know what its name is. When I plot things with UHI, the plot has the word "counts" on that plot axis, and maybe that's a default if it doesn't find something in one of the histogram's own attributes.

JacekHoleczek Jan 30, 2024
Author

In principle, for me, the "(n+1)-axis" title describes the "physical meaning" of the "content" of bins (e.g., "number of events" or "counts/GeV").
The drawing library then can decide what happens to the "(n+1)-axis" title.
Some drawing tools may use it, some may not.
For a TH1, this would be the drawn title of the drawn y-axis.
For a TH2, if you use the "Z" drawing option, an additional "color palette" is drawn, and it will get its title from the histogram's "3rd axis" title (if the user sets it, of course). Without the "Z" drawing option, it will simply be the drawn title of the drawn z-axis.
The problem with a TH3 is that ROOT keeps 3 axes only, so when drawing, you would need to manually set the title of the "color palette" (if you wanted it). I don't think a TH3 can keep the "4th axis" title (at least, I can't remember how to do it).

JacekHoleczek Jan 31, 2024
Author

Extremely hot (watch out for your fingertips when you click it): "[Z (bad) label for palette axis in TH3D"

JacekHoleczek · 2024-01-29T22:32:20Z

JacekHoleczek
Jan 29, 2024
Author

Many, many thanks for your help and all the additional info (I'll remember what you wrote about the scaling to the "iterative" / "parallel-processing" approach).

Here's my final version for the time being:

import uproot
import numpy as np
import hist

def process():
        # open the ROOT file and load the hk TTree
        with uproot.open({'data/my_tree.root': 'hk'}) as t:
                # test that we can read some available branches
                # note: if branches keep C++ fundamental data types or their fixed-size arrays,
                #       one can use the library='np' option (the NumPy, which may improve speed)
                tq_real = t.arrays(['TQReal.nhits', 'TQReal.pc2pe', 'TQReal.t0'], library='np')
                # note: if any branches with ragged/variable-length arrays are retrieved,
                #       one must use the default library='ak' option (the Awkward Array)
                tq_real_hits = t.arrays(['TQReal.hits.cable', 'TQReal.hits.T', 'TQReal.hits.Q'])
                n_high_hit = np.sum(tq_real_hits['TQReal.hits.Q'] > 1.0)
                print(f'Number of high Q hits: {n_high_hit}')
                # create and fill some histograms
                h_goodness = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label='goodness_time_fit (RecoBonsai)'))
                h_goodness.title = 'bs goodness distribution'
                h_goodness.fill(t['RecoBonsai.goodness_time_fit'].array(library='np'))
                h_trigger = hist.Hist(hist.axis.Regular(10, 0.0, 10.0, label='trigger_id (Header)'))
                h_trigger.title = 'trigger distribution'
                h_trigger.fill(t['Header.trigger_id'].array(library='np'))
                h_goodness_leaf = None
                if 'RecoLEAF.goodness' in t: # RecoLEAF is a new class, not available everywhere
                        h_goodness_leaf = hist.Hist(hist.axis.Regular(100, 0.0, 1.0, label='goodness (RecoLEAF)'))
                        h_goodness_leaf.title = 'leaf goodness distribution'
                        h_goodness_leaf.fill(t['RecoLEAF.goodness'].array(library='np'))
                # write histograms to a ROOT file
                # note: when a new ROOT TH[123]D is created from the
                #       corresponding hist.Hist below, its "statistics"
                #       is computed from its bin content (in bare ROOT,
                #       the "statistics" is computed at filling time)
                with uproot.recreate('data/output_analysis_tree_python_v1.root') as f_out:
                        f_out['goodness'] = h_goodness
                        f_out['trigger'] = h_trigger
                        if h_goodness_leaf is not None:
                                f_out['goodness_leaf'] = h_goodness_leaf

if __name__ == '__main__':
        print('Processing')
        process()
        print('Done')

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTree performance improvements advice needed #1106

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

TTree performance improvements advice needed #1106

JacekHoleczek Jan 26, 2024

Replies: 3 comments · 13 replies

jpivarski Jan 26, 2024 Maintainer

JacekHoleczek Jan 27, 2024 Author

jpivarski Jan 30, 2024 Maintainer

JacekHoleczek Jan 30, 2024 Author

jpivarski Jan 30, 2024 Maintainer

JacekHoleczek Jan 30, 2024 Author

JacekHoleczek Jan 31, 2024 Author

JacekHoleczek Jan 29, 2024 Author

JacekHoleczek
Jan 26, 2024

Replies: 3 comments 13 replies

jpivarski
Jan 26, 2024
Maintainer

JacekHoleczek
Jan 27, 2024
Author

jpivarski Jan 30, 2024
Maintainer

JacekHoleczek Jan 30, 2024
Author

jpivarski Jan 30, 2024
Maintainer

JacekHoleczek Jan 30, 2024
Author

JacekHoleczek Jan 31, 2024
Author

JacekHoleczek
Jan 29, 2024
Author