Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

result of Distribution::ChiSquared.cumulative_function outside [0..1] range #113

Open
oliver-czulo opened this issue Oct 30, 2023 · 7 comments

Comments

@oliver-czulo
Copy link
Contributor

oliver-czulo commented Oct 30, 2023

I work with the following two test contingency tables (observed1, observed2) with these resulting variables:

observed1 = [[388,51692],[119,45633]]
expected1 = [[269.8969662278191, 51810.10303377218], [237.10303377218088, 45514.89696622782]]
chi_score1 = 111.0839904758887
df1 = 1

observed2 = [[388,51692],[119,45633],[271,40040]]
expected2 = [[293.3065012342283, 51786.69349876577], [257.66818441759625, 45494.3318155824], [227.02531434817544, 40083.974685651825]]
chi_score2 = 114.36002831520479
df2 = 2

When calculated with Python, I get the following results when extracting the pvalue (replace observed with observed1 or observed2):

import scipy.stats as stats
# include above variables here
X2 = stats.chi2_contingency(observed, correction=False)[0]

# observed1 => pvalue = 5.671618200219206e-26
# observed2 => pvalue = 1.469045936431957e-25

When using Distribution::ChiSquared.cumulative_function from the ruby-statistics package, I get the following results (replace dfwith df1or df2, chi_score likewise):

probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
p_value = 1.0 - probability

# observed1 => p_value = Infinity
# observed2 => p_value = 1.0000333200206515

While I cannot confirm the Python module yields the correct values, it seems more plausible, considering that the p-value should be in the range [0..1]?

@oliver-czulo oliver-czulo changed the title result of Distribution::ChiSquared.cumulative_function differs from calculation in Python result of Distribution::ChiSquared.cumulative_function outside [0..1] range Oct 30, 2023
@estebanz01
Copy link
Owner

ok, so a couple of questions:

  • why degrees of freedom is less than the number of (observed elements - 1) ? for observed1 shouldn't be 3 and for observed2 be 5 ?
  • why the separation of elements in multiple arrays? e.g. for [388,51692] and [119,45633] you want to perform a Chi squared test on both arrays in a separated manner or do you want to compare both so only one chi squared test is performed?
  • how did you calculate the chi_score variable?

I took the liberty to sightly modify your inputs and this is what I got:

# data setup
observations_1 = [388, 51692, 119, 45633]
expectations_1 = [269.8969662278191, 51810.10303377218, 237.10303377218088, 45514.89696622782]
chi_score_1, df_1 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_1, observations_1)

# outputs
# irb(main):025:0> chi_score_1
# => 111.0839904758887
# irb(main):026:0> df_1
# => 3

#  Goodness of fit with 95 % confidence
StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_1, observations_1)
# => {:probability=>1.0000001945374604, :p_value=>-1.945374603629091e-07, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

now, the problem still persist in that probability is bigger than 1.0 😛 so I'm going to look into that.

Here's the same modifications made for the second example:

irb(main):028:0> observations_2 = [388, 51692, 119, 45633, 271, 40040]
=> [388, 51692, 119, 45633, 271, 40040]
irb(main):029:0> expectations_2 = [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
=> [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
irb(main):030:0> chi_score_2, df_2 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_2, observations_2)
=> [114.36002831520479, 5]
irb(main):031:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_2, observations_2)
=> {:probability=>1.0000000000064857, :p_value=>-6.4857008652552395e-12, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

I'll also going to do a read on the python implementation. I'm not familiar with that package, so probably it works the same but with a different API.

@estebanz01
Copy link
Owner

estebanz01 commented Oct 31, 2023

here's a variant that gives me infinity in the probability with the degrees of freedom you are using:

irb(main):032:0> observed_1_variant = [388,51692]
=> [388, 51692]
irb(main):033:0> expected_1_variant = [269.8969662278191, 51810.10303377218]
=> [269.8969662278191, 51810.10303377218]
irb(main):034:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expected_1_variant, observed_1_variant)
=> {:probability=>Infinity, :p_value=>-Infinity, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

from what I recall (but correct me if I'm wrong!), we need a minimum of 3 or more elements to perform a Chi squared test, so this result makes sense to me: I need more data to perform a proper test 😛

@estebanz01
Copy link
Owner

here's an R version:

> expected
[1]   269.897 51810.103   237.103 45514.897
> observed
[1]   388 51692   119 45633
> chisq.test(as.table(observed, expected))

	Chi-squared test for given probabilities

data:  as.table(observed, expected)
X-squared = 96566, df = 3, p-value < 2.2e-16

> chisq.test(as.table(expected, observed))

	Chi-squared test for given probabilities

data:  as.table(expected, observed)
X-squared = 96625, df = 3, p-value < 2.2e-16

@oliver-czulo
Copy link
Contributor Author

As for questions 1 and 2: The nested array observed1, for instance, represents a 2*2 contingency table, each sub-array stands for a row. Degree of freedom is then calculated as #rows-1 * #columns-1 (see also this statology page).

As for question 3, I use the following function (in a local test class) to calculate the chi score iterating over all cells of the contingency table:

def self.calculate_chi_score(observed, expected)
    sum = 0.0
    observed.size.times { |i|
        observed[i].size.times { |j|
            sum += ((observed[i])[j] - (expected[i])[j])**2 / (expected[i])[j]
         }
    }
    sum
end

I guess the implementation is not optimal, but I wasn't successful in reproducing this behaviour reliably with map or similar functions.

@estebanz01
Copy link
Owner

got it. I'll do some reading and exploration then and be right back to you.

@obromios
Copy link

obromios commented Nov 6, 2024

Consider a chi square distribution table. If you look at the second row (df=2), it predicts a value of 0.1 if the chi_squared value is 4.61. Using your gem

>> 1-Statistics::Distribution::ChiSquared.new(2).cumulative_function(4.61)
=> 0.09972206384119364

which rounds to 0.10. so it appears your function and the table agree. But if we try the first row (df = 1), for a chi_squared value of 2.71. it also predicts a value of 0.1, but your function does

>> 1-Statistics::Distribution::ChiSquared.new(1).cumulative_function(2.71)
=> -Infinity

i.e. it does not agree with the table.
As far as I can see, if you use df of 1, then any chi_squared_value gives -Infinity. It appears to me that the function is not returning the correct value when the number of degrees of freedom is 1.

@estebanz01
Copy link
Owner

@obromios hi! interesting check. It's been a while since I try to stab a this problem, so it's going to be a bit slow for me to reply with a possible fix, if there's one at all.

Thanks for looking at it and I'll try my best to allocate time to fix this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants