readxml.parse slow on HistFitter workspace #1687

gollumben · 2021-11-09T10:40:28Z

gollumben
Nov 9, 2021

Dear pyhf developers,
thank you very much for providing this awesome tool and wonderful documentations and tutorials!

I have been starting to use pyhf on some HistFitter workspaces (~100) which I am converting to JSON. They each have 35 regions / bins and ~150 systematics with up and down variations. Converting the xml files to json takes roughly 30 hours. Is there any way we could speed this up?

Looking at the source code it seems as if each file containing a histogram were opened one-by-one. Do you think there'd be any way to aggregate the files and histogram names first and then to read them out in one go? While this might require a slight refactoring of the readxml, it might increase the speed substantially.

Looking forward to you oppinion and potentially further speed-up suggestions,
Ben

Answered by kratsg

Nov 10, 2021

Ok, the problem is partially in uproot and in pyhf. In uproot code (thanks @jpivarski !) there is a damerau_levenshtein function being called when we hit a missing key that takes a long time because the number of keys in this file is very large (https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/_util.py#L810-L858).

In pyhf, when we hit the name of a histogram that is not retrievable without trying the full path first - then it causes a (slow) DeserializationError which is caught by an expected exception in pyhf. We need to change the way we check if a key exists in the file. This is a bug.

View full answer

lukasheinrich · 2021-11-09T11:16:39Z

lukasheinrich
Nov 9, 2021
Maintainer

Hi!

is the 30 hours a single workspace of all 100 workspaces? We du use a cache __FILECACHE__ to avoid re-opening files.. maybee if you profile the code with snakeviz this can give some insight

7 replies

alexander-held Nov 10, 2021
Maintainer

For debugging purposes it may also be interesting to know how this scales with the number of regions. You should be able to remove regions from the top-level xml, and checking runtime with a single region might be a bit easier to handle. What is the memory usage like? I would be surprised if that would be the problem here, but could you be running out of memory with too many open files, and that somehow slows things down significantly?

lukasheinrich Nov 10, 2021
Maintainer

do you have some numbers as tto how many modifiers/s and sampes/s you gett (this should be shown by the cmd line tool). Have you tried simplifying the workspace (one chanel, one sample) for dbeugging?

gollumben Nov 10, 2021
Author

I have been constantly monitoring the memory consumption and that did not seem to be an issue, so far. I have nevertheless reduced the running to one region for the moment.

@lukasheinrich there are 9 samples and 272 modifiers and from the progress bar (which I only just turned on, i.e. it was not slowing things down in the 30h case), it's taking ~5.3 s to process one HistoSys modifier.

alexander-held Nov 10, 2021
Maintainer

Are you running this locally or on some shared filesystem where maybe file access could be a bottleneck?

gollumben Nov 10, 2021
Author

This is running on DESY's afs. File access could be the bottleneck, but I would be surprised if that slowed the conversion down by so much.

gollumben · 2021-11-10T14:10:02Z

gollumben
Nov 10, 2021
Author

Hi again,
so running over one region, I made a snakeviz profiling analysis. Please find it attached (example.prof). The total running time was ~2.25 h. It seems like one of the uproot readings were the bottleneck (cursor.py:332(bytestring) has the longest tottime).

To compare the performance, I wrote a little script that compares the performance of upRoot versus pyRoot to open the files of interest and to output and integral (or list of bins for upRoot). I think this should roughly be doing what pyhf does. The script also outputs the runtime for both cases:

python quickAccessTest.py
...
It took 0 seconds to parse with pyRoot
It took 8 seconds to parse with upRoot

I have also attached the profiling of these two. I am not sure whether the comparison of this to pyhf is fair, but I thought this might be useful to compare.

Any further help would be highly appreciated!

Cheers,
Ben

snakevizTest.zip

4 replies

kratsg Nov 10, 2021
Maintainer

This smells like a filesystem issue unfortunately. I've seen this in the past with other slow file mounts. How large are the input XML and ROOT files? Can you give me access to them so I can try running them on a different filesystem to confirm?

gollumben Nov 10, 2021
Author

Hi @kratsg,
thank you for your input! Please find the xml file here /eos/user/b/bbrueers/tW0LBoosted/pyhf/monotop_twmetComb0L1LBoosted_allCRs_normDStoDR_unblind_sigTheo_envelope__pmoder_sig_a250_DM10_H900_tb1_st0p7/NormalMeasurement.xml. The root files are in the same directory. They are only shared with you.
What puzzles me about a filesystem issue is that it seems to work fine with the small worker script I attached previously (in the zip), which should be doing the same thing.
Thanks again,
Ben

kratsg Nov 10, 2021
Maintainer

Ok, this is interesting. ~80% of your modifiers will load in about 1second from the ROOT file, but a handful (primarily HistoSys) are taking 10 seconds each and that's where your slowdown is coming from... Let me look at this more.

gollumben Nov 10, 2021
Author

Thanks @kratsg! :)

kratsg · 2021-11-10T16:37:52Z

kratsg
Nov 10, 2021
Maintainer

Ok, the problem is partially in uproot and in pyhf. In uproot code (thanks @jpivarski !) there is a damerau_levenshtein function being called when we hit a missing key that takes a long time because the number of keys in this file is very large (https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/_util.py#L810-L858).

In pyhf, when we hit the name of a histogram that is not retrievable without trying the full path first - then it causes a (slow) DeserializationError which is caught by an expected exception in pyhf. We need to change the way we check if a key exists in the file. This is a bug.

6 replies

kratsg Nov 10, 2021
Maintainer

Quickly fixing, I have a nice speedup:

$ time pyhf xml2json monotop_twmetComb0L1LBoosted_allCRs_normDStoDR_unblind_sigTheo_envelope__pmoder_sig_a250_DM10_H900_tb1_st0p7/NormalMeasurement.xml --basedir monotop_twmetComb0L1LBoosted_allCRs_normDStoDR_unblind_sigTheo_envelope__pmoder_sig_a250_DM10_H900_tb1_st0p7 --hide-progress --output-file test.json

real	0m12.798s
user	0m13.814s
sys	0m0.722s

gollumben Nov 10, 2021
Author

Awesome, thank you so much @kratsg! Would you mind sharing your fix? Then I can run on all of my files in no time!

kratsg Nov 10, 2021
Maintainer

Here's the changes made on master of pyhf, but I should have a PR soon.

diff --git a/src/pyhf/readxml.py b/src/pyhf/readxml.py
index c078ce3e..2e6dec51 100644
--- a/src/pyhf/readxml.py
+++ b/src/pyhf/readxml.py
@@ -59,19 +59,21 @@ def import_root_histogram(rootdir, filename, path, name, filecache=None):
     fullpath = str(Path(rootdir).joinpath(filename))
     if fullpath not in filecache:
         f = uproot.open(fullpath)
-        filecache[fullpath] = f
+        keys = set(f.keys(cycle=False))
+        filecache[fullpath] = (f, keys)
     else:
-        f = filecache[fullpath]
-    try:
+        f, keys = filecache[fullpath]
+
+    fullname = "/".join([path, name])
+
+    if name in keys:
         hist = f[name]
-    except (KeyError, uproot.deserialization.DeserializationError):
-        fullname = "/".join([path, name])
-        try:
-            hist = f[fullname]
-        except KeyError:
-            raise KeyError(
-                f'Both {name} and {fullname} were tried and not found in {fullpath}'
-            )
+    elif fullname in keys:
+        hist = f[fullname]
+    else:
+        raise KeyError(
+            f'Both {name} and {fullname} were tried and not found in {fullpath}'
+        )
     return hist.to_numpy()[0].tolist(), extract_error(hist)

gollumben Nov 10, 2021
Author

Wonderful, thank you! :)

matthewfeickert Nov 10, 2021
Maintainer

For posterity, this was fixed in PR #1691.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readxml.parse slow on HistFitter workspace #1687

{{title}}

Replies: 3 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

readxml.parse slow on HistFitter workspace #1687

gollumben Nov 9, 2021

Replies: 3 comments · 17 replies

lukasheinrich Nov 9, 2021 Maintainer

alexander-held Nov 10, 2021 Maintainer

lukasheinrich Nov 10, 2021 Maintainer

gollumben Nov 10, 2021 Author

alexander-held Nov 10, 2021 Maintainer

gollumben Nov 10, 2021 Author

gollumben Nov 10, 2021 Author

kratsg Nov 10, 2021 Maintainer

gollumben Nov 10, 2021 Author

kratsg Nov 10, 2021 Maintainer

gollumben Nov 10, 2021 Author

kratsg Nov 10, 2021 Maintainer

kratsg Nov 10, 2021 Maintainer

gollumben Nov 10, 2021 Author

kratsg Nov 10, 2021 Maintainer

gollumben Nov 10, 2021 Author

matthewfeickert Nov 10, 2021 Maintainer

gollumben
Nov 9, 2021

Replies: 3 comments 17 replies

lukasheinrich
Nov 9, 2021
Maintainer

alexander-held Nov 10, 2021
Maintainer

lukasheinrich Nov 10, 2021
Maintainer

gollumben Nov 10, 2021
Author

alexander-held Nov 10, 2021
Maintainer

gollumben Nov 10, 2021
Author

gollumben
Nov 10, 2021
Author

kratsg Nov 10, 2021
Maintainer

gollumben Nov 10, 2021
Author

kratsg Nov 10, 2021
Maintainer

gollumben Nov 10, 2021
Author

kratsg
Nov 10, 2021
Maintainer

kratsg Nov 10, 2021
Maintainer

gollumben Nov 10, 2021
Author

kratsg Nov 10, 2021
Maintainer

gollumben Nov 10, 2021
Author

matthewfeickert Nov 10, 2021
Maintainer