-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
result of Distribution::ChiSquared.cumulative_function outside [0..1] range #113
Comments
ok, so a couple of questions:
I took the liberty to sightly modify your inputs and this is what I got: # data setup
observations_1 = [388, 51692, 119, 45633]
expectations_1 = [269.8969662278191, 51810.10303377218, 237.10303377218088, 45514.89696622782]
chi_score_1, df_1 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_1, observations_1)
# outputs
# irb(main):025:0> chi_score_1
# => 111.0839904758887
# irb(main):026:0> df_1
# => 3
# Goodness of fit with 95 % confidence
StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_1, observations_1)
# => {:probability=>1.0000001945374604, :p_value=>-1.945374603629091e-07, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95} now, the problem still persist in that probability is bigger than 1.0 😛 so I'm going to look into that. Here's the same modifications made for the second example: irb(main):028:0> observations_2 = [388, 51692, 119, 45633, 271, 40040]
=> [388, 51692, 119, 45633, 271, 40040]
irb(main):029:0> expectations_2 = [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
=> [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
irb(main):030:0> chi_score_2, df_2 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_2, observations_2)
=> [114.36002831520479, 5]
irb(main):031:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_2, observations_2)
=> {:probability=>1.0000000000064857, :p_value=>-6.4857008652552395e-12, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95} I'll also going to do a read on the python implementation. I'm not familiar with that package, so probably it works the same but with a different API. |
here's a variant that gives me infinity in the probability with the degrees of freedom you are using: irb(main):032:0> observed_1_variant = [388,51692]
=> [388, 51692]
irb(main):033:0> expected_1_variant = [269.8969662278191, 51810.10303377218]
=> [269.8969662278191, 51810.10303377218]
irb(main):034:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expected_1_variant, observed_1_variant)
=> {:probability=>Infinity, :p_value=>-Infinity, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95} from what I recall (but correct me if I'm wrong!), we need a minimum of 3 or more elements to perform a Chi squared test, so this result makes sense to me: I need more data to perform a proper test 😛 |
here's an R version: > expected
[1] 269.897 51810.103 237.103 45514.897
> observed
[1] 388 51692 119 45633
> chisq.test(as.table(observed, expected))
Chi-squared test for given probabilities
data: as.table(observed, expected)
X-squared = 96566, df = 3, p-value < 2.2e-16
> chisq.test(as.table(expected, observed))
Chi-squared test for given probabilities
data: as.table(expected, observed)
X-squared = 96625, df = 3, p-value < 2.2e-16 |
As for questions 1 and 2: The nested array As for question 3, I use the following function (in a local test class) to calculate the chi score iterating over all cells of the contingency table:
I guess the implementation is not optimal, but I wasn't successful in reproducing this behaviour reliably with |
got it. I'll do some reading and exploration then and be right back to you. |
Consider a chi square distribution table. If you look at the second row (df=2), it predicts a value of 0.1 if the chi_squared value is 4.61. Using your gem
which rounds to 0.10. so it appears your function and the table agree. But if we try the first row (df = 1), for a chi_squared value of 2.71. it also predicts a value of 0.1, but your function does
i.e. it does not agree with the table. |
@obromios hi! interesting check. It's been a while since I try to stab a this problem, so it's going to be a bit slow for me to reply with a possible fix, if there's one at all. Thanks for looking at it and I'll try my best to allocate time to fix this! |
I work with the following two test contingency tables (
observed1
,observed2
) with these resulting variables:When calculated with Python, I get the following results when extracting the
pvalue
(replaceobserved
withobserved1
orobserved2
):When using
Distribution::ChiSquared.cumulative_function
from theruby-statistics
package, I get the following results (replacedf
withdf1
ordf2
,chi_score
likewise):While I cannot confirm the Python module yields the correct values, it seems more plausible, considering that the p-value should be in the range [0..1]?
The text was updated successfully, but these errors were encountered: