Skip to content

Commit

Permalink
2.0.0 Codename Random (#12)
Browse files Browse the repository at this point in the history
* statistical-test: Chi-Squared Goodness of fit. (#11)
* statistical-test: Paired T-tests. (#14)
* Wilcoxon Rank-sum test / Mann-Whitney U test (#15)
* Generate gaussian random numbers (#16)
* distribution: normal: Generate random gaussian samples.
* distribution: Weibull: Random sample that follows Weibull distribution.

* distribution: Student's T: Generate random samples for T distribution.

NOTE: The random sampling process is generation random values that
are similar to an uniform random sample. Not sure why.

* version 2.0.0! 🎉
  • Loading branch information
estebanz01 authored Jan 18, 2018
1 parent 4a9effa commit 33d75d8
Show file tree
Hide file tree
Showing 18 changed files with 619 additions and 2 deletions.
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,6 @@ language: ruby
rvm:
- 2.2
- 2.3.1
- 2.4.0
- 2.5.0
before_install: gem install bundler
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@

A basic ruby gem that implements some statistical methods, functions and concepts to be used in any ruby environment without depending on any mathematical software like `R`, `Matlab`, `Octave` or similar.

Unit test runs under the following ruby versions:
* Ruby 2.2.
* Ruby 2.3.1.
* Ruby 2.4.0.
* Ruby 2.5.0.

We got the inspiration from the folks at [JStat](https://github.com/jstat/jstat) and some interesting lectures about [Keystroke dynamics](http://www.biometric-solutions.com/keystroke-dynamics.html).

Some logic and algorithms are extractions or adaptations from other authors, which are referenced in the comments.
Expand Down
2 changes: 1 addition & 1 deletion lib/math.rb
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def self.simpson_rule(a, b, n, &block)

def self.lower_incomplete_gamma_function(s, x)
# The greater the iterations, the better. That's why we are iterating 10_000 * x times
self.simpson_rule(0, x, (10_000 * x).round) do |t|
self.simpson_rule(0, x, (10_000 * x.round).round) do |t|
(t ** (s - 1)) * Math.exp(-t)
end
end
Expand Down
1 change: 1 addition & 0 deletions lib/statistics/distribution/beta.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ def mode
end

def mean
return if alpha + beta == 0
alpha / (alpha + beta)
end
end
Expand Down
40 changes: 40 additions & 0 deletions lib/statistics/distribution/normal.rb
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,46 @@ def density_function(value)

(left_up/(left_down) * right)
end

## Marsaglia polar method implementation for random gaussian (normal) number generation.
# References:
# https://en.wikipedia.org/wiki/Marsaglia_polar_method
# https://math.stackexchange.com/questions/69245/transform-uniform-distribution-to-normal-distribution-using-lindeberg-l%C3%A9vy-clt
# https://www.projectrhea.org/rhea/index.php/The_principles_for_how_to_generate_random_samples_from_a_Gaussian_distribution

def random(elements: 1, seed: Random.new_seed)
results = []

# Setup seed
srand(seed)

# Number of random numbers to be generated.
elements.times do
x, y, r = 0.0, 0.0, 0.0

# Find an (x, y) point in the x^2 + y^2 < 1 circumference.
loop do
x = 2.0 * rand - 1.0
y = 2.0 * rand - 1.0

r = (x ** 2) + (y ** 2)

break unless r >= 1.0 || r == 0
end

# Project the random point to the required random distance
r = Math.sqrt(-2.0 * Math.log(r) / r)

# Transform the random distance to a gaussian value and append it to the results array
results << mean + x * r * standard_deviation
end

if elements == 1
results.first
else
results
end
end
end

class StandardNormal < Normal
Expand Down
32 changes: 32 additions & 0 deletions lib/statistics/distribution/t_student.rb
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,38 @@ def variance
degrees_of_freedom/(degrees_of_freedom - 2.0)
end
end

# Quantile function extracted from http://www.jennessent.com/arcview/idf.htm
# TODO: Make it truly Student's T sample.
def random(elements: 1, seed: Random.new_seed)
warn 'This is an alpha version code. The generated sample is similar to an uniform distribution'
srand(seed)

v = degrees_of_freedom
results = []

# Because the Quantile function of a student-t distribution is between (-Infinity, y)
# we setup an small threshold in order to properly compute the integral
threshold = 10_000.0e-12

elements.times do
y = rand
results << Math.simpson_rule(threshold, y, 10_000) do |t|
up = Math.gamma((v+1)/2.0)
down = Math.sqrt(Math::PI * v) * Math.gamma(v/2.0)
right = (1 + ((y ** 2)/v.to_f)) ** ((v+1)/2.0)
left = up/down.to_f

left * right
end
end

if elements == 1
results.first
else
results
end
end
end
end
end
20 changes: 20 additions & 0 deletions lib/statistics/distribution/weibull.rb
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,26 @@ def variance

(scale ** 2) * (left - right)
end

# Using the inverse CDF function, also called quantile, we can calculate
# a random sample that follows a weibull distribution.
#
# Formula extracted from http://www.stat.yale.edu/Courses/1997-98/101/chigf.htm
def random(elements: 1, seed: Random.new_seed)
results = []

srand(seed)

elements.times do
results << ((-1/scale) * Math.log(1 - rand)) ** (1/shape)
end

if elements == 1
results.first
else
results
end
end
end
end
end
42 changes: 42 additions & 0 deletions lib/statistics/statistical_test/chi_squared_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
module Statistics
module StatisticalTest
class ChiSquaredTest
def self.chi_statistic(expected, observed)
# If the expected is a number, we asumme that all expected observations
# has the same probability to occur, hence we expect to see the same number
# of expected observations per each observed value
statistic = if expected.is_a? Numeric
observed.reduce(0) do |memo, observed_value|
up = (observed_value - expected) ** 2
memo += (up/expected.to_f)
end
else
expected.each_with_index.reduce(0) do |memo, (expected_value, index)|
up = (observed[index] - expected_value) ** 2
memo += (up/expected_value.to_f)
end
end

[statistic, observed.size - 1]
end

def self.goodness_of_fit(alpha, expected, observed)
chi_score, df = *self.chi_statistic(expected, observed) # Splat array result

return if chi_score.nil? || df.nil?

probability = Distribution::ChiSquared.new(df).cumulative_function(chi_score)
p_value = 1 - probability

# According to https://stats.stackexchange.com/questions/29158/do-you-reject-the-null-hypothesis-when-p-alpha-or-p-leq-alpha
# We can assume that if p_value <= alpha, we can safely reject the null hypothesis, ie. accept the alternative hypothesis.
{ probability: probability,
p_value: p_value,
alpha: alpha,
null: alpha < p_value,
alternative: p_value <= alpha,
confidence_level: 1 - alpha }
end
end
end
end
22 changes: 22 additions & 0 deletions lib/statistics/statistical_test/t_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,28 @@ def self.perform(alpha, tails, *args)
alternative: p_value <= alpha,
confidence_level: 1 - alpha }
end

def self.paired_test(alpha, tails, left_group, right_group)
# Handy snippet grabbed from https://stackoverflow.com/questions/2682411/ruby-sum-corresponding-members-of-two-or-more-arrays
differences = [left_group, right_group].transpose.map { |value| value.reduce(:-) }

degrees_of_freedom = differences.size - 1
down = differences.standard_deviation/Math.sqrt(differences.size)

t_score = (differences.mean - 0)/down.to_f

probability = Distribution::TStudent.new(degrees_of_freedom).cumulative_function(t_score)

p_value = 1 - probability
p_value *= 2 if tails == :two_tail

{ probability: probability,
p_value: p_value,
alpha: alpha,
null: alpha < p_value,
alternative: p_value <= alpha,
confidence_level: 1 - alpha }
end
end
end
end
95 changes: 95 additions & 0 deletions lib/statistics/statistical_test/wilcoxon_rank_sum_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
module Statistics
module StatisticalTest
class WilcoxonRankSumTest
def rank(elements)
ranked_elements = {}

elements.sort.each_with_index do |element, index|
if ranked_elements.fetch(element, false)
# This allow us to solve the ties easily when performing the rank summation per group
ranked_elements[element][:counter] += 1
ranked_elements[element][:rank] += (index + 1)
else
ranked_elements[element] = { counter: 1, rank: (index + 1) }
end
end

# ranked_elements = [{ x => { counter: 1, rank: y } ]
ranked_elements
end

# Steps to perform the calculation are based on http://www.mit.edu/~6.s085/notes/lecture5.pdf
def perform(alpha, tails, group_one, group_two)
# Size for each group
n1, n2 = group_one.size, group_two.size

# Rank all data
total_ranks = rank(group_one + group_two)

# sum rankings per group
r1 = ranked_sum_for(total_ranks, group_one)
r2 = ranked_sum_for(total_ranks, group_two)

# calculate U statistic
u1 = (n1 * (n1 + 1)/2.0) - r1
u2 = (n2 * (n2 + 1)/2.0 ) - r2

u_statistic = [u1.abs, u2.abs].min

median_u = (n1 * n2)/2.0

ties = total_ranks.values.select { |element| element[:counter] > 1 }

std_u = if ties.size > 0
corrected_sigma(ties, n1, n2)
else
Math.sqrt((n1 * n2 * (n1 + n2 + 1))/12.0)
end

z = (u_statistic - median_u)/std_u

# Most literature are not very specific about the normal distribution to be used.
# We ran multiple tests with a Normal(median_u, std_u) and Normal(0, 1) and we found
# the latter to be more aligned with the results.
probability = Distribution::StandardNormal.new.cumulative_function(z.abs)
p_value = 1 - probability
p_value *= 2 if tails == :two_tail

{ probability: probability,
u: u_statistic,
z: z,
p_value: p_value,
alpha: alpha,
null: alpha < p_value,
alternative: p_value <= alpha,
confidence_level: 1 - alpha }
end

# Formula extracted from http://www.statstutor.ac.uk/resources/uploaded/mannwhitney.pdf
private def corrected_sigma(ties, total_group_one, total_group_two)
n = total_group_one + total_group_two

rank_sum = ties.reduce(0) do |memo, t|
memo += ((t[:counter] ** 3) - t[:counter])/12.0
end

left = (total_group_one * total_group_two)/(n * (n - 1)).to_f
right = (((n ** 3) - n)/12.0) - rank_sum

Math.sqrt(left * right)
end

private def ranked_sum_for(total, group)
# sum rankings per group
group.reduce(0) do |memo, element|
rank_of_element = total[element][:rank] / total[element][:counter].to_f
memo += rank_of_element
end
end
end

# Both test are the same. To keep the selected name, we just alias the class
# with the implementation.
MannWhitneyU = WilcoxonRankSumTest
end
end
2 changes: 1 addition & 1 deletion lib/statistics/version.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
module Statistics
VERSION = "1.0.2"
VERSION = "2.0.0"
end
9 changes: 9 additions & 0 deletions spec/statistics/distribution/beta_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,19 @@
end

describe '#mean' do
it 'returns nil if alpha and beta is zero' do
alpha = 0
beta = 0

expect(described_class.new(alpha, beta).mean).to be_nil
end

it 'calculates the expected mean for the beta distribution' do
alpha = rand(-5..5)
beta = rand(-5..5)

alpha = 1 if alpha + beta == 0 # To avoid NaN results.

expect(described_class.new(alpha, beta).mean).to eq alpha.to_f/(alpha.to_f + beta.to_f)
end
end
Expand Down
47 changes: 47 additions & 0 deletions spec/statistics/distribution/normal_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,53 @@
end
end
end

# To test random generation, we are going to use the Goodness-of-fit test
# to validate if a sample fits a normal distribution.
describe '#random' do
it 'returns a pseudo random number that belongs to a normal distribution' do
# Normal sample generated from R with mean 5, std 2.0 and seed 100
alpha = 0.01
normal_sample = [3.995615, 5.263062, 4.842166, 6.773570, 5.233943]
random_sample = described_class.new(5.0, 2.0).random(elements: 5, seed: 100)

test = Statistics::StatisticalTest::ChiSquaredTest.goodness_of_fit(alpha, normal_sample, random_sample)

# Null hypothesis: Both samples belongs to the same distribution (normal in this case)
# Alternative hypotesis: Each sample is generated with a different distribution.

expect(test[:null]).to be true
expect(test[:alternative]).to be false
end

it 'does not generate a random sample that follows an uniform distribution' do
# Uniform sample elements generated in R with seed 100
uniform_sample = [0.30776611, 0.25767250, 0.55232243, 0.05638315, 0.46854928]
random_sample = described_class.new(5.0, 2.0).random(elements: 5, seed: 100)

test = Statistics::StatisticalTest::ChiSquaredTest.goodness_of_fit(0.01, uniform_sample, random_sample)

expect(test[:null]).to be false
expect(test[:alternative]).to be true
end

it 'generates the specified number of random elements and store it into an array' do
elements = rand(2..5)
sample = described_class.new(5.0, 2.0).random(elements: elements)

expect(sample).to be_a Array
expect(sample.size).to eq elements
end

it 'returns a single random number when only one element is required' do
normal = described_class.new(5.0, 2.0)
sample_1 = normal.random # 1 element by default
sample_2 = normal.random(elements: 1)

expect(sample_1).to be_a Numeric
expect(sample_2).to be_a Numeric
end
end
end

describe Statistics::Distribution::StandardNormal do
Expand Down
Loading

0 comments on commit 33d75d8

Please sign in to comment.