Additional documentation needed on difference between calculator type test statistic return type #1792

matthewfeickert · 2022-03-01T18:30:53Z

Summary

There is a large discrepancy in the value of the discovery test statistic if it is generated via an Asymptotic based or toy based calculator.

Related Issues:

OS / Environment

$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Steps to Reproduce

# issue_1792.py
import pyhf


def discovery_from_test_stat(model, data):
    init_pars = model.config.suggested_init()
    par_bounds = model.config.suggested_bounds()
    fixed_params = model.config.suggested_fixed()
    return pyhf.infer.test_statistics.q0(
        0.0,
        data,
        model,
        init_pars,
        par_bounds,
        fixed_params,
    )


def discovery_from_calculator(model, data, calc_type="asymptotics", **kwargs):
    calculator = pyhf.infer.utils.create_calculator(
        calc_type, data, model, test_stat="q0", **kwargs
    )
    return calculator.teststatistic(0.0)


def discovery_from_asymptotic_calculator(model, data):
    calculator = pyhf.infer.calculators.AsymptoticCalculator(
        data, model, test_stat="q0"
    )
    return calculator.teststatistic(0.0)


if __name__ == "__main__":
    model = pyhf.simplemodels.uncorrelated_background([25], [2500], [2.5])
    data = [2600] + model.config.auxdata

    discovery_test_stat = discovery_from_test_stat(model, data)
    print(f"Discovery test stat from infer.test_statistics.q0: {discovery_test_stat}")

    discovery_test_stat = discovery_from_calculator(model, data, calc_type="toybased")
    print(
        f"Discovery test stat from toy based infer.utils.create_calculator: {discovery_test_stat}"
    )

    discovery_test_stat = discovery_from_calculator(model, data)
    print(
        f"Discovery test stat from asymptotics infer.utils.create_calculator: {discovery_test_stat}"
    )

    discovery_test_stat = discovery_from_asymptotic_calculator(model, data)
    print(
        f"Discovery test stat from infer.calculators.AsymptoticCalculator: {discovery_test_stat}"
    )

File Upload (optional)

No response

Expected Results

The discovery test statistic would be the same regardless of calculator type.

Actual Results

$ python issue_1792.py 
Discovery test stat from infer.test_statistics.q0: 3.9377335959507036
Discovery test stat from toy based infer.utils.create_calculator: 3.9377335959507036
Discovery test stat from asymptotics infer.utils.create_calculator: 1.485845489706193
Discovery test stat from infer.calculators.AsymptoticCalculator: 1.485845489706193

pyhf Version

9fd99be886349a90e927672e950cc233fad0916c on master

Code of Conduct

I agree to follow the Code of Conduct

alexander-held · 2022-03-01T19:02:07Z

I don't understand how the first two results are the exact same, but if this is a toy vs non-toy issue with a 4 sigma effect (which, looking at the setup, would make sense roughly), are you sure you have enough toys to evaluate this (several 10k)?

matthewfeickert · 2022-03-01T19:10:52Z

are you sure you have enough toys to evaluate this (several 10k)?

Here I'm using the calculator API in a strange way as only 1 experiment is being evaluated, so there really isn't any pseudoexperiment generation happening. I should give a better example later.

pyhf/src/pyhf/infer/calculators.py

Lines 951 to 960 in 9fd99be

    
           teststat_func = utils.get_test_stat(self.test_stat) 
        
           teststat = teststat_func( 
        
               poi_test, 
        
               self.data, 
        
               self.pdf, 
        
               self.init_pars, 
        
               self.par_bounds, 
        
               self.fixed_params, 
        
           ) 
        
           return teststat

alexander-held · 2022-03-01T19:17:46Z

I missed the sqrt required to go from q0 to the significance in the previous comment, this should be a 2 sigma effect of course: sqrt(2500) = 50, so a 100 event excess over background-only with negligible background uncertainty will be a 2 sigma effect. That agrees perfectly with the first two numbers.

alexander-held · 2022-03-01T19:19:40Z

The last two numbers scale with the number of signal events, so these are something else.

matthewfeickert · 2022-03-01T19:20:19Z

The last two numbers scale with the number of signal events, so these are something else.

yeah. The fact that the calculators are going in opposite directions as the signal increases is telling there's a problem.

alexander-held · 2022-03-01T19:21:34Z

At the moment this doesn't run any toys as far as I can tell, so I'm assuming this is a question about the calculators, and not about asymptotic vs toy agreement in general?

matthewfeickert · 2022-03-01T19:22:15Z

At the moment this doesn't run any toys as far as I can tell, so I'm assuming this is a question about the calculators, and not about asymptotic vs toy agreement in general?

Yes, I didn't phrase the original text clearly as I was dumping things in for myself to clarify later.

matthewfeickert · 2022-03-01T19:30:18Z

Okay @kratsg has pointed out that the behavior of the calculator APIs is (known to be) not consistent across the asymptotic and toy based as asymptotic is returning p-values (so cumulative distributions of test statistics)

pyhf/src/pyhf/infer/__init__.py

Lines 169 to 178 in 9fd99be

    
           teststat = calc.teststatistic(poi_test) 
        
           sig_plus_bkg_distribution, bkg_only_distribution = calc.distributions(poi_test) 
        
           tb, _ = get_backend() 
        
           CLsb_obs, CLb_obs, CLs_obs = tuple( 
        
               tb.astensor(pvalue) 
        
               for pvalue in calc.pvalues( 
        
                   teststat, sig_plus_bkg_distribution, bkg_only_distribution 
        
               ) 
        
           )

while the toy based is returning q test stats. At the very least this needs to be made a lot more clear in the docs.

kratsg · 2022-03-01T19:31:45Z

APIs is (known to be) not consistent across the asymptotic and toy based as asymptotic is returning p-values

well, it's consistent here, it's that the calc.teststatistic(test_poi) has a different meaning which is translated by the respective calculator's calc.pvalues(teststat, ...) call if that makes sense. The API is the same, but there's a few hoops/hurdles/threading to make it work.

matthewfeickert · 2022-03-01T19:34:04Z

well, it's consistent here, it's that the calc.teststatistic(test_poi) has a different meaning which is translated by the respective calculator's calc.pvalues(teststat, ...) call if that makes sense. The API is the same, but there's a few hoops/hurdles/threading to make it work.

While all very true, a user should rightly complain that the public API is too confusing as is (given the current documentation).

matthewfeickert · 2022-03-01T19:48:41Z

My main complaint at the moment is that while we make it clear that the asymptotic test stats are in -^mu/sigma space

pyhf/src/pyhf/infer/calculators.py

Lines 85 to 89 in 9fd99be

    
           class AsymptoticTestStatDistribution: 
        
               r""" 
        
               The distribution the test statistic in the asymptotic case. 
        
               Note: These distributions are in :math:`-\hat{\mu}/\sigma` space.

we don't make this clear again in the calculator teststatistic API for the asymptotics calculator

pyhf/src/pyhf/infer/calculators.py

Lines 330 to 332 in 9fd99be

    
               def teststatistic(self, poi_test): 
        
                   r""" 
        
                   Compute the test statistic for the observed data under the studied model.

or in the toy calculator (that it is different)

pyhf/src/pyhf/infer/calculators.py

Lines 922 to 924 in 9fd99be

    
               def teststatistic(self, poi_test): 
        
                   """ 
        
                   Compute the test statistic for the observed data under the studied model.

Also we could mention in EmpiricalDistribution that the test stats are distributed in different spaces as well

pyhf/src/pyhf/infer/calculators.py

Lines 526 to 532 in 9fd99be

    
           class EmpiricalDistribution: 
        
               """ 
        
               The empirical distribution of the test statistic. 
        
               Unlike :py:class:`~pyhf.infer.calculators.AsymptoticTestStatDistribution` where the 
        
               distribution for the test statistic is normally distributed, the 
        
               :math:`p`-values etc are computed from the sampled distribution.

So this really is a documentation issue, but a pretty big one in my mind.

I forgot this, and if I've written some of this code and can make this mistake when tired then I think a user definitely will.

kratsg · 2022-03-01T19:51:52Z

My main complaint at the moment is that while we make it clear that the asymptotic test stats are in -^mu/sigma space

this part I think is what confuses me, so if you have a better grasp, it would be great to clarify this for the upcoming release.

matthewfeickert · 2022-03-01T19:54:54Z

This all came up as I was trying to take a stab at Issue #1712 and was trying to figure out to have things work for either calculator type.

matthewfeickert added the bug Something isn't working label Mar 1, 2022

matthewfeickert added the docs Documentation related label Mar 1, 2022

matthewfeickert changed the title ~~Discrepancy in test statistics from cacluators~~ Additional documentation needed on different between calculator type test statistic return type Mar 1, 2022

matthewfeickert changed the title ~~Additional documentation needed on different between calculator type test statistic return type~~ Additional documentation needed on difference between calculator type test statistic return type Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional documentation needed on difference between calculator type test statistic return type #1792

Additional documentation needed on difference between calculator type test statistic return type #1792

matthewfeickert commented Mar 1, 2022 •

edited

Loading

alexander-held commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 •

edited

Loading

alexander-held commented Mar 1, 2022

alexander-held commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 •

edited

Loading

alexander-held commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 •

edited

Loading

kratsg commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022

kratsg commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 •

edited

Loading

Additional documentation needed on difference between calculator type test statistic return type #1792

Additional documentation needed on difference between calculator type test statistic return type #1792

Comments

matthewfeickert commented Mar 1, 2022 • edited Loading

Summary

OS / Environment

Steps to Reproduce

File Upload (optional)

Expected Results

Actual Results

pyhf Version

Code of Conduct

alexander-held commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 • edited Loading

alexander-held commented Mar 1, 2022

alexander-held commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 • edited Loading

alexander-held commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 • edited Loading

kratsg commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022

kratsg commented Mar 1, 2022

matthewfeickert commented Mar 1, 2022 • edited Loading

matthewfeickert commented Mar 1, 2022 •

edited

Loading

matthewfeickert commented Mar 1, 2022 •

edited

Loading

matthewfeickert commented Mar 1, 2022 •

edited

Loading

matthewfeickert commented Mar 1, 2022 •

edited

Loading

matthewfeickert commented Mar 1, 2022 •

edited

Loading