Sum results of multiple ToyCalculator.distributions() #829

lawrenceleejr · 2020-04-21T07:56:17Z

Question

Hey guys! I've been using the new branch toycalc for some tests of toy throwing with an analysis with small yields. With these analyses, I often find that we sometimes need order millions of toys thrown to get a reasonable result, which turns out to be a real pain. We've found HistFactory to crash with that many toys, so we're this out with pyHF. So far, we've found that it's much more able to handle these huge numbers of toys.

The issue comes with the fact that the fastest machine I could find still can only throw toys at ~100/s peak, so these jobs a la the simple example here (*) take a really long time still. ToyCalculator.distributions() returns two distribution objects, and I'm wondering if it's possible to calculate these distributions separately in separate jobs and then join them later before calculating a p-value. I couldn't find any way of summing these, but maybe I missed something obvious.

Figuring out a way to do this would really open up some possibilities for us to blast a huge number of toys to a cluster and would be super useful.

-L

(*) https://github.com/scikit-hep/pyhf/blob/toycalc/src/pyhf/infer/calculators.py#L387

Relevant Issues and Pull Requests

#790

The text was updated successfully, but these errors were encountered:

kratsg · 2020-04-21T11:05:08Z

I couldn't find any way of summing these

I'm not fully understanding. What do you mean by "summing"? They're the individual qmu for the signal-like versus background-like distributions.

I guess your fundamental question is, can we parallelize toys? We have an open issue here in #807 describing this. The idea would be to allow for a backend to do the parallelization (e.g. multi-core). This should generally help when you need 1m+ toys.

lukasheinrich · 2020-04-21T11:50:16Z

I guess @lawrenceleejr just wants to distribute each toy to a separate machine and then combine the results.. I think Dask might be a nice way to do it in lieu of a batch system based approach. Larry, do you use the jax backend with jit? This speeds up toy calculation significantly. (it's still true that ROOT for low number of bins ~n_bins=1 is faster and pyhf only kicks in for more complex models) Cheers, Lukas

lawrenceleejr · 2020-04-21T13:26:52Z

Thanks both -- Yes indeed I'm interested in not just parallelizing but being able to persistify the results such that jobs can be distributed across machines and time. So yes parallelizing but not just a matter of starting new threads.

But I can definitely try out the Jax backend -- I'm currently using the numpy backend.

lukasheinrich · 2020-04-21T13:32:19Z

I can help set this up w/ jax if you send a demo script, I can adapt it quickly. it's basically undocumented :)

…

On Tue, Apr 21, 2020 at 3:27 PM Lawrence Lee ***@***.***> wrote: Thanks both -- Yes indeed I'm interested in not just parallelizing but being able to persistify the results such that jobs can be distributed across machines and time. So yes parallelizing but not just a matter of starting new threads. But I can definitely try out the Jax backend -- I'm currently using the numpy backend. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#829 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARV6AZWD7WN6SOGBTIO7WTRNWNKZANCNFSM4MNCPJRA> .

lawrenceleejr · 2020-04-21T14:02:35Z

Ah thanks for the offer. You can find my super simple script I've been playing around with in this gist:

https://gist.github.com/lawrenceleejr/51244c639bcb400dbe56ab986456aa30

lukasheinrich · 2020-04-21T15:23:53Z

HI @lawrenceleejr

for these simple models jax doesn't seem to buy you anythnig.. but I put togehter a small scrpit that shows how to save out the sample data and re-merging it together

https://gist.github.com/lukasheinrich/87ecc8a1fd3181befd357008859da28f

this bring you back to the result of .distributions() and should scale easily to however many toys fit into your memory (billions?)

lukasheinrich · 2020-04-21T15:47:40Z

PS: once you have high-stats distribution objects we can work no getting you a cls result .

lawrenceleejr · 2020-04-21T15:53:40Z

Oh that's beautiful thanks! Ok I see -- I was hoping it would basically be something like this. Glad that the objects can just be concatenated. I'll start setting something up for a batch system.

Thanks all!

matthewfeickert · 2020-04-21T16:31:50Z

I'll start setting something up for a batch system.

Cool to hear. Sorry for the late follow up on my part, but glad to hear that you're using this pre-release feature. Please keep us updated on how things are going.

lawrenceleejr · 2020-04-22T09:40:16Z

No worries @matthewfeickert!

Actually while we're here, is it possible for someone to outline how to go from these distributions to observed and expected(+variations) CLs values? I see how to get the observed I think -- but not clear to me how to get the rest of it using this calculator. Is there an example lives somewhere?

Thanks!

lukasheinrich · 2020-04-22T09:51:44Z

@lawrenceleejr it's basically this function

https://github.com/scikit-hep/pyhf/blob/toycalc/src/pyhf/infer/utils.py#L147

but we can make the API a bit modularized so you can come with pre-made distributions

lawrenceleejr · 2020-04-22T09:56:44Z

I see -- so we can just grab those lines in our higher level scripts for now. If you're interested we have a few people who are interested in learning these tools who could maybe help prep a PR for modularizing things. No promised, but are you open to that or would you rather keep that kind of thing in house?

Thanks so much -- this is all super helpful!

And in fact given all that info, it'd be fine for me if you want to close the issue.

lukasheinrich · 2020-04-22T10:00:33Z

we're always open for external PRs. in fact we're very happy to see them. I'll close this issue then, feel free to reopen.

one easy refactoring is to have a the bottom part be its own function function in pyhf.infer.utils

def cls_from_distributions(teststat, signal_distribution, bkg_distribution):
    sig_plus_bkg_distribution, b_only_distribution = calc.distributions(poi_test)

    CLsb = sig_plus_bkg_distribution.pvalue(teststat)
    CLb = b_only_distribution.pvalue(teststat)
    CLs = CLsb / CLb
    CLsb, CLb, CLs = (
        tensorlib.reshape(CLsb, (1,)),
        tensorlib.reshape(CLb, (1,)),
        tensorlib.reshape(CLs, (1,)),
    )

    _returns = [CLs]
    if return_tail_probs:
        _returns.append([CLsb, CLb])
    if return_expected_set:
        CLs_exp = []
        for n_sigma in [2, 1, 0, -1, -2]:
            CLs = sig_plus_bkg_distribution.pvalue(
                n_sigma
            ) / b_only_distribution.pvalue(n_sigma)
            CLs_exp.append(tensorlib.reshape(CLs, (1,)))
        CLs_exp = tensorlib.astensor(CLs_exp)
        if return_expected:
            _returns.append(CLs_exp[2])
        _returns.append(CLs_exp)
    elif return_expected:
        n_sigma = 0
        CLs = sig_plus_bkg_distribution.pvalue(n_sigma) / b_only_distribution.pvalue(
            n_sigma
        )
        _returns.append(tensorlib.reshape(CLs, (1,)))
    # Enforce a consistent return type of the observed CLs
    return tuple(_returns) if len(_returns) > 1 else _returns[0]

kratsg added help wanted Extra attention is needed / contributions welcome question Further information is requested labels Apr 21, 2020

lukasheinrich closed this as completed Apr 21, 2020

lukasheinrich reopened this Apr 21, 2020

lukasheinrich closed this as completed Apr 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum results of multiple ToyCalculator.distributions() #829

Sum results of multiple ToyCalculator.distributions() #829

lawrenceleejr commented Apr 21, 2020

kratsg commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020 via email •

edited by matthewfeickert

Loading

lawrenceleejr commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020 via email

lawrenceleejr commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020 •

edited

Loading

lawrenceleejr commented Apr 21, 2020

matthewfeickert commented Apr 21, 2020

lawrenceleejr commented Apr 22, 2020

lukasheinrich commented Apr 22, 2020

lawrenceleejr commented Apr 22, 2020

lukasheinrich commented Apr 22, 2020 •

edited

Loading

Sum results of multiple ToyCalculator.distributions() #829

Sum results of multiple ToyCalculator.distributions() #829

Comments

lawrenceleejr commented Apr 21, 2020

Question

Relevant Issues and Pull Requests

kratsg commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020 via email • edited by matthewfeickert Loading

lawrenceleejr commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020 via email

lawrenceleejr commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020

lukasheinrich commented Apr 21, 2020 • edited Loading

lawrenceleejr commented Apr 21, 2020

matthewfeickert commented Apr 21, 2020

lawrenceleejr commented Apr 22, 2020

lukasheinrich commented Apr 22, 2020

lawrenceleejr commented Apr 22, 2020

lukasheinrich commented Apr 22, 2020 • edited Loading

lukasheinrich commented Apr 21, 2020 via email •

edited by matthewfeickert

Loading

lukasheinrich commented Apr 21, 2020 •

edited

Loading

lukasheinrich commented Apr 22, 2020 •

edited

Loading