Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum results of multiple ToyCalculator.distributions() #829

Closed
lawrenceleejr opened this issue Apr 21, 2020 · 13 comments
Closed

Sum results of multiple ToyCalculator.distributions() #829

lawrenceleejr opened this issue Apr 21, 2020 · 13 comments
Labels
help wanted Extra attention is needed / contributions welcome question Further information is requested

Comments

@lawrenceleejr
Copy link

Question

Hey guys! I've been using the new branch toycalc for some tests of toy throwing with an analysis with small yields. With these analyses, I often find that we sometimes need order millions of toys thrown to get a reasonable result, which turns out to be a real pain. We've found HistFactory to crash with that many toys, so we're this out with pyHF. So far, we've found that it's much more able to handle these huge numbers of toys.

The issue comes with the fact that the fastest machine I could find still can only throw toys at ~100/s peak, so these jobs a la the simple example here (*) take a really long time still. ToyCalculator.distributions() returns two distribution objects, and I'm wondering if it's possible to calculate these distributions separately in separate jobs and then join them later before calculating a p-value. I couldn't find any way of summing these, but maybe I missed something obvious.

Figuring out a way to do this would really open up some possibilities for us to blast a huge number of toys to a cluster and would be super useful.

-L

(*) https://github.com/scikit-hep/pyhf/blob/toycalc/src/pyhf/infer/calculators.py#L387

Relevant Issues and Pull Requests

#790

@kratsg
Copy link
Contributor

kratsg commented Apr 21, 2020

I couldn't find any way of summing these

I'm not fully understanding. What do you mean by "summing"? They're the individual qmu for the signal-like versus background-like distributions.

I guess your fundamental question is, can we parallelize toys? We have an open issue here in #807 describing this. The idea would be to allow for a backend to do the parallelization (e.g. multi-core). This should generally help when you need 1m+ toys.

@kratsg kratsg added help wanted Extra attention is needed / contributions welcome question Further information is requested labels Apr 21, 2020
@lukasheinrich
Copy link
Contributor

lukasheinrich commented Apr 21, 2020 via email

@lawrenceleejr
Copy link
Author

Thanks both -- Yes indeed I'm interested in not just parallelizing but being able to persistify the results such that jobs can be distributed across machines and time. So yes parallelizing but not just a matter of starting new threads.

But I can definitely try out the Jax backend -- I'm currently using the numpy backend.

@lukasheinrich
Copy link
Contributor

lukasheinrich commented Apr 21, 2020 via email

@lawrenceleejr
Copy link
Author

Ah thanks for the offer. You can find my super simple script I've been playing around with in this gist:

https://gist.github.com/lawrenceleejr/51244c639bcb400dbe56ab986456aa30

@lukasheinrich
Copy link
Contributor

HI @lawrenceleejr

for these simple models jax doesn't seem to buy you anythnig.. but I put togehter a small scrpit that shows how to save out the sample data and re-merging it together

https://gist.github.com/lukasheinrich/87ecc8a1fd3181befd357008859da28f

this bring you back to the result of .distributions() and should scale easily to however many toys fit into your memory (billions?)

@lukasheinrich
Copy link
Contributor

lukasheinrich commented Apr 21, 2020

PS: once you have high-stats distribution objects we can work no getting you a cls result .

@lawrenceleejr
Copy link
Author

Oh that's beautiful thanks! Ok I see -- I was hoping it would basically be something like this. Glad that the objects can just be concatenated. I'll start setting something up for a batch system.

Thanks all!

@lukasheinrich lukasheinrich reopened this Apr 21, 2020
@matthewfeickert
Copy link
Member

I'll start setting something up for a batch system.

Cool to hear. Sorry for the late follow up on my part, but glad to hear that you're using this pre-release feature. Please keep us updated on how things are going.

@lawrenceleejr
Copy link
Author

No worries @matthewfeickert!

Actually while we're here, is it possible for someone to outline how to go from these distributions to observed and expected(+variations) CLs values? I see how to get the observed I think -- but not clear to me how to get the rest of it using this calculator. Is there an example lives somewhere?

Thanks!

@lukasheinrich
Copy link
Contributor

@lawrenceleejr it's basically this function

https://github.com/scikit-hep/pyhf/blob/toycalc/src/pyhf/infer/utils.py#L147

but we can make the API a bit modularized so you can come with pre-made distributions

@lawrenceleejr
Copy link
Author

I see -- so we can just grab those lines in our higher level scripts for now. If you're interested we have a few people who are interested in learning these tools who could maybe help prep a PR for modularizing things. No promised, but are you open to that or would you rather keep that kind of thing in house?

Thanks so much -- this is all super helpful!

And in fact given all that info, it'd be fine for me if you want to close the issue.

@lukasheinrich
Copy link
Contributor

lukasheinrich commented Apr 22, 2020

we're always open for external PRs. in fact we're very happy to see them. I'll close this issue then, feel free to reopen.

one easy refactoring is to have a the bottom part be its own function function in pyhf.infer.utils

def cls_from_distributions(teststat, signal_distribution, bkg_distribution):
    sig_plus_bkg_distribution, b_only_distribution = calc.distributions(poi_test)

    CLsb = sig_plus_bkg_distribution.pvalue(teststat)
    CLb = b_only_distribution.pvalue(teststat)
    CLs = CLsb / CLb
    CLsb, CLb, CLs = (
        tensorlib.reshape(CLsb, (1,)),
        tensorlib.reshape(CLb, (1,)),
        tensorlib.reshape(CLs, (1,)),
    )

    _returns = [CLs]
    if return_tail_probs:
        _returns.append([CLsb, CLb])
    if return_expected_set:
        CLs_exp = []
        for n_sigma in [2, 1, 0, -1, -2]:
            CLs = sig_plus_bkg_distribution.pvalue(
                n_sigma
            ) / b_only_distribution.pvalue(n_sigma)
            CLs_exp.append(tensorlib.reshape(CLs, (1,)))
        CLs_exp = tensorlib.astensor(CLs_exp)
        if return_expected:
            _returns.append(CLs_exp[2])
        _returns.append(CLs_exp)
    elif return_expected:
        n_sigma = 0
        CLs = sig_plus_bkg_distribution.pvalue(n_sigma) / b_only_distribution.pvalue(
            n_sigma
        )
        _returns.append(tensorlib.reshape(CLs, (1,)))
    # Enforce a consistent return type of the observed CLs
    return tuple(_returns) if len(_returns) > 1 else _returns[0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed / contributions welcome question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants