2.0.0 Codename Random (#12)

* statistical-test: Chi-Squared Goodness of fit. (#11) * statistical-test: Paired T-tests. (#14) * Wilcoxon Rank-sum test / Mann-Whitney U test (#15) * Generate gaussian random numbers (#16) * distribution: normal: Generate random gaussian samples. * distribution: Weibull: Random sample that follows Weibull distribution. * distribution: Student's T: Generate random samples for T distribution. NOTE: The random sampling process is generation random values that are similar to an uniform random sample. Not sure why. * version 2.0.0! 🎉
estebanz01 · Jan 18, 2018 · 33d75d8 · 33d75d8
1 parent 4a9effa
commit 33d75d8
Show file tree

Hide file tree

Showing 18 changed files with 619 additions and 2 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -3,4 +3,6 @@ language: ruby
 rvm:
   - 2.2
   - 2.3.1
+  - 2.4.0
+  - 2.5.0
 before_install: gem install bundler
diff --git a/README.md b/README.md
@@ -4,6 +4,12 @@
 
 A basic ruby gem that implements some statistical methods, functions and concepts to be used in any ruby environment without depending on any mathematical software like `R`, `Matlab`, `Octave` or similar.
 
+Unit test runs under the following ruby versions:
+* Ruby 2.2.
+* Ruby 2.3.1.
+* Ruby 2.4.0.
+* Ruby 2.5.0.
+
 We got the inspiration from the folks at [JStat](https://github.com/jstat/jstat) and some interesting lectures about [Keystroke dynamics](http://www.biometric-solutions.com/keystroke-dynamics.html).
 
 Some logic and algorithms are extractions or adaptations from other authors, which are referenced in the comments.

diff --git a/lib/math.rb b/lib/math.rb
@@ -45,7 +45,7 @@ def self.simpson_rule(a, b, n, &block)
 
   def self.lower_incomplete_gamma_function(s, x)
     # The greater the iterations, the better. That's why we are iterating 10_000 * x times
-    self.simpson_rule(0, x, (10_000 * x).round) do |t|
+    self.simpson_rule(0, x, (10_000 * x.round).round) do |t|
       (t ** (s - 1)) * Math.exp(-t)
     end
   end

diff --git a/lib/statistics/distribution/beta.rb b/lib/statistics/distribution/beta.rb
@@ -28,6 +28,7 @@ def mode
       end
 
       def mean
+        return if alpha + beta == 0
         alpha / (alpha + beta)
       end
     end

diff --git a/lib/statistics/distribution/normal.rb b/lib/statistics/distribution/normal.rb
@@ -25,6 +25,46 @@ def density_function(value)
 
         (left_up/(left_down) * right)
       end
+
+      ## Marsaglia polar method implementation for random gaussian (normal) number generation.
+      # References:
+      # https://en.wikipedia.org/wiki/Marsaglia_polar_method
+      # https://math.stackexchange.com/questions/69245/transform-uniform-distribution-to-normal-distribution-using-lindeberg-l%C3%A9vy-clt
+      # https://www.projectrhea.org/rhea/index.php/The_principles_for_how_to_generate_random_samples_from_a_Gaussian_distribution
+
+      def random(elements: 1, seed: Random.new_seed)
+        results = []
+
+        # Setup seed
+        srand(seed)
+
+        # Number of random numbers to be generated.
+        elements.times do
+          x, y, r = 0.0, 0.0, 0.0
+
+          # Find an (x, y) point in the x^2 + y^2 < 1 circumference.
+          loop do
+            x = 2.0 * rand - 1.0
+            y = 2.0 * rand - 1.0
+
+            r = (x ** 2) + (y ** 2)
+
+            break unless r >= 1.0 || r == 0
+          end
+
+          # Project the random point to the required random distance
+          r = Math.sqrt(-2.0 * Math.log(r) / r)
+
+          # Transform the random distance to a gaussian value and append it to the results array
+          results << mean + x * r * standard_deviation
+        end
+
+        if elements == 1
+          results.first
+        else
+          results
+        end
+      end
     end
 
     class StandardNormal < Normal

diff --git a/lib/statistics/distribution/t_student.rb b/lib/statistics/distribution/t_student.rb
@@ -45,6 +45,38 @@ def variance
           degrees_of_freedom/(degrees_of_freedom - 2.0)
         end
       end
+
+      # Quantile function extracted from http://www.jennessent.com/arcview/idf.htm
+      # TODO: Make it truly Student's T sample.
+      def random(elements: 1, seed: Random.new_seed)
+        warn 'This is an alpha version code. The generated sample is similar to an uniform distribution'
+        srand(seed)
+
+        v = degrees_of_freedom
+        results = []
+
+        # Because the Quantile function of a student-t distribution is between (-Infinity, y)
+        # we setup an small threshold in order to properly compute the integral
+        threshold = 10_000.0e-12
+
+        elements.times do
+          y = rand
+          results << Math.simpson_rule(threshold, y, 10_000) do |t|
+            up = Math.gamma((v+1)/2.0)
+            down = Math.sqrt(Math::PI * v) * Math.gamma(v/2.0)
+            right = (1 + ((y ** 2)/v.to_f)) ** ((v+1)/2.0)
+            left = up/down.to_f
+
+            left * right
+          end
+        end
+
+        if elements == 1
+          results.first
+        else
+          results
+        end
+      end
     end
   end
 end
diff --git a/lib/statistics/distribution/weibull.rb b/lib/statistics/distribution/weibull.rb
@@ -41,6 +41,26 @@ def variance
 
         (scale ** 2) * (left - right)
       end
+
+      # Using the inverse CDF function, also called quantile, we can calculate
+      # a random sample that follows a weibull distribution.
+      #
+      # Formula extracted from http://www.stat.yale.edu/Courses/1997-98/101/chigf.htm
+      def random(elements: 1, seed: Random.new_seed)
+        results = []
+
+        srand(seed)
+
+        elements.times do
+          results << ((-1/scale) * Math.log(1 - rand)) ** (1/shape)
+        end
+
+        if elements == 1
+          results.first
+        else
+          results
+        end
+      end
     end
   end
 end
diff --git a/lib/statistics/statistical_test/chi_squared_test.rb b/lib/statistics/statistical_test/chi_squared_test.rb
@@ -0,0 +1,42 @@
+module Statistics
+  module StatisticalTest
+    class ChiSquaredTest
+      def self.chi_statistic(expected, observed)
+        # If the expected is a number, we asumme that all expected observations
+        # has the same probability to occur, hence we expect to see the same number
+        # of expected observations per each observed value
+        statistic = if expected.is_a? Numeric
+                      observed.reduce(0) do |memo, observed_value|
+                        up = (observed_value - expected) ** 2
+                        memo += (up/expected.to_f)
+                      end
+                    else
+                      expected.each_with_index.reduce(0) do |memo, (expected_value, index)|
+                        up = (observed[index] - expected_value) ** 2
+                        memo += (up/expected_value.to_f)
+                      end
+                    end
+
+          [statistic, observed.size - 1]
+      end
+
+      def self.goodness_of_fit(alpha, expected, observed)
+        chi_score, df = *self.chi_statistic(expected, observed) # Splat array result
+
+        return if chi_score.nil? || df.nil?
+
+        probability = Distribution::ChiSquared.new(df).cumulative_function(chi_score)
+        p_value = 1 - probability
+
+        # According to https://stats.stackexchange.com/questions/29158/do-you-reject-the-null-hypothesis-when-p-alpha-or-p-leq-alpha
+        # We can assume that if p_value <= alpha, we can safely reject the null hypothesis, ie. accept the alternative hypothesis.
+        { probability: probability,
+          p_value: p_value,
+          alpha: alpha,
+          null: alpha < p_value,
+          alternative: p_value <= alpha,
+          confidence_level: 1 - alpha }
+      end
+    end
+  end
+end
diff --git a/lib/statistics/statistical_test/t_test.rb b/lib/statistics/statistical_test/t_test.rb
@@ -41,6 +41,28 @@ def self.perform(alpha, tails, *args)
           alternative: p_value <= alpha,
           confidence_level: 1 - alpha }
       end
+
+      def self.paired_test(alpha, tails, left_group, right_group)
+        # Handy snippet grabbed from https://stackoverflow.com/questions/2682411/ruby-sum-corresponding-members-of-two-or-more-arrays
+        differences = [left_group, right_group].transpose.map { |value| value.reduce(:-) }
+
+        degrees_of_freedom = differences.size - 1
+        down = differences.standard_deviation/Math.sqrt(differences.size)
+
+        t_score = (differences.mean - 0)/down.to_f
+
+        probability = Distribution::TStudent.new(degrees_of_freedom).cumulative_function(t_score)
+
+        p_value = 1 - probability
+        p_value *= 2 if tails == :two_tail
+
+        { probability: probability,
+          p_value: p_value,
+          alpha: alpha,
+          null: alpha < p_value,
+          alternative: p_value <= alpha,
+          confidence_level: 1 - alpha }
+      end
     end
   end
 end
diff --git a/lib/statistics/statistical_test/wilcoxon_rank_sum_test.rb b/lib/statistics/statistical_test/wilcoxon_rank_sum_test.rb
@@ -0,0 +1,95 @@
+module Statistics
+  module StatisticalTest
+    class WilcoxonRankSumTest
+      def rank(elements)
+        ranked_elements = {}
+
+        elements.sort.each_with_index do |element, index|
+          if ranked_elements.fetch(element, false)
+            # This allow us to solve the ties easily when performing the rank summation per group
+            ranked_elements[element][:counter] += 1
+            ranked_elements[element][:rank] += (index + 1)
+          else
+            ranked_elements[element] = { counter: 1, rank: (index + 1) }
+          end
+        end
+
+        # ranked_elements = [{ x => { counter: 1, rank: y } ]
+        ranked_elements
+      end
+
+      # Steps to perform the calculation are based on http://www.mit.edu/~6.s085/notes/lecture5.pdf
+      def perform(alpha, tails, group_one, group_two)
+        # Size for each group
+        n1, n2 = group_one.size, group_two.size
+
+        # Rank all data
+        total_ranks = rank(group_one + group_two)
+
+        # sum rankings per group
+        r1 = ranked_sum_for(total_ranks, group_one)
+        r2 = ranked_sum_for(total_ranks, group_two)
+
+        # calculate U statistic
+        u1 = (n1 * (n1 + 1)/2.0) - r1
+        u2 = (n2 * (n2 + 1)/2.0 ) - r2
+
+        u_statistic = [u1.abs, u2.abs].min
+
+        median_u = (n1 * n2)/2.0
+
+        ties = total_ranks.values.select { |element| element[:counter] > 1 }
+
+        std_u = if ties.size > 0
+                  corrected_sigma(ties, n1, n2)
+                else
+                  Math.sqrt((n1 * n2 * (n1 + n2 + 1))/12.0)
+                end
+
+        z = (u_statistic - median_u)/std_u
+
+        # Most literature are not very specific about the normal distribution to be used.
+        # We ran multiple tests with a Normal(median_u, std_u) and Normal(0, 1) and we found
+        # the latter to be more aligned with the results.
+        probability = Distribution::StandardNormal.new.cumulative_function(z.abs)
+        p_value = 1 - probability
+        p_value *= 2 if tails == :two_tail
+
+        { probability: probability,
+          u: u_statistic,
+          z: z,
+          p_value: p_value,
+          alpha: alpha,
+          null: alpha < p_value,
+          alternative: p_value <= alpha,
+          confidence_level: 1 - alpha }
+      end
+
+      # Formula extracted from http://www.statstutor.ac.uk/resources/uploaded/mannwhitney.pdf
+      private def corrected_sigma(ties, total_group_one, total_group_two)
+        n = total_group_one + total_group_two
+
+        rank_sum = ties.reduce(0) do |memo, t|
+                    memo += ((t[:counter] ** 3) - t[:counter])/12.0
+                  end
+
+        left = (total_group_one * total_group_two)/(n * (n - 1)).to_f
+        right = (((n ** 3) - n)/12.0) - rank_sum
+
+        Math.sqrt(left * right)
+      end
+
+      private def ranked_sum_for(total, group)
+        # sum rankings per group
+        group.reduce(0) do |memo, element|
+          rank_of_element = total[element][:rank] / total[element][:counter].to_f
+          memo += rank_of_element
+        end
+      end
+    end
+
+    # Both test are the same. To keep the selected name, we just alias the class
+    # with the implementation.
+    MannWhitneyU = WilcoxonRankSumTest
+  end
+end
diff --git a/lib/statistics/version.rb b/lib/statistics/version.rb
@@ -1,3 +1,3 @@
 module Statistics
-  VERSION = "1.0.2"
+  VERSION = "2.0.0"
 end
diff --git a/spec/statistics/distribution/beta_spec.rb b/spec/statistics/distribution/beta_spec.rb
@@ -55,10 +55,19 @@
   end
 
   describe '#mean' do
+    it 'returns nil if alpha and beta is zero' do
+      alpha = 0
+      beta = 0
+
+      expect(described_class.new(alpha, beta).mean).to be_nil
+    end
+
     it 'calculates the expected mean for the beta distribution' do
       alpha = rand(-5..5)
       beta = rand(-5..5)
 
+      alpha = 1 if alpha + beta == 0 # To avoid NaN results.
+
       expect(described_class.new(alpha, beta).mean).to eq alpha.to_f/(alpha.to_f + beta.to_f)
     end
   end

diff --git a/spec/statistics/distribution/normal_spec.rb b/spec/statistics/distribution/normal_spec.rb
@@ -24,6 +24,53 @@
       end
     end
   end
+
+  # To test random generation, we are going to use the Goodness-of-fit test
+  # to validate if a sample fits a normal distribution.
+  describe '#random' do
+    it 'returns a pseudo random number that belongs to a normal distribution' do
+      # Normal sample generated from R with mean 5, std 2.0 and seed 100
+      alpha = 0.01
+      normal_sample = [3.995615, 5.263062, 4.842166, 6.773570, 5.233943]
+      random_sample = described_class.new(5.0, 2.0).random(elements: 5, seed: 100)
+
+      test = Statistics::StatisticalTest::ChiSquaredTest.goodness_of_fit(alpha, normal_sample, random_sample)
+
+      # Null hypothesis: Both samples belongs to the same distribution (normal in this case)
+      # Alternative hypotesis: Each sample is generated with a different distribution.
+
+      expect(test[:null]).to be true
+      expect(test[:alternative]).to be false
+    end
+
+    it 'does not generate a random sample that follows an uniform distribution' do
+      # Uniform sample elements generated in R with seed 100
+      uniform_sample = [0.30776611, 0.25767250, 0.55232243, 0.05638315, 0.46854928]
+      random_sample = described_class.new(5.0, 2.0).random(elements: 5, seed: 100)
+
+      test = Statistics::StatisticalTest::ChiSquaredTest.goodness_of_fit(0.01, uniform_sample, random_sample)
+
+      expect(test[:null]).to be false
+      expect(test[:alternative]).to be true
+    end
+
+    it 'generates the specified number of random elements and store it into an array' do
+      elements = rand(2..5)
+      sample = described_class.new(5.0, 2.0).random(elements: elements)
+
+      expect(sample).to be_a Array
+      expect(sample.size).to eq elements
+    end
+
+    it 'returns a single random number when only one element is required' do
+      normal = described_class.new(5.0, 2.0)
+      sample_1 = normal.random # 1 element by default
+      sample_2 = normal.random(elements: 1)
+
+      expect(sample_1).to be_a Numeric
+      expect(sample_2).to be_a Numeric
+    end
+  end
 end
 
 describe Statistics::Distribution::StandardNormal do
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,4 +3,6 @@ language: ruby @@
     rvm:
       - 2.2
       - 2.3.1
+      - 2.4.0
+      - 2.5.0
     before_install: gem install bundler