-
Notifications
You must be signed in to change notification settings - Fork 900
What's New
For a full listing of all changes, see our release notes directly in the repository.
Previously there were two ways to estimate some statistics from a sample set: The Statistics
class provided static extension methods to evaluate a single statistic from an enumerable, and DescriptiveStatistics
to compute a whole set of standard statistics at once. This was unsatisfactory since it was not very efficient: the DescriptiveStatistics
way actually required more than one pass internally (mostly because of the median) and it was not leveraging the fact that the sample set may already be sorted.
To fix the first issue, we've marked DescriptiveStatistics.Median
as obsolete and will remove it in v3. Until then, the median computation is delayed until requested the first time. In normal cases where Median is not used therefore now only requires a single pass.
The second issue we attacked by introducing three new classes to compute a single statistic directly from the best fitting sample data format:
-
ArrayStatistics
: operates on arrays which are not assumed to be sorted (but doesn't hurt if they are). -
SortedArrayStatistics
: operate on arrays which must be sorted in ascending order. -
StreamingStatistics
: operates on a stream in a single pass, without keeping the full data in memory at any time. Can thus be used to stream over data larger than system memory.
ArrayStatistics implements Minimum, Maximum, Mean, Variance, StandardDeviation, PopulationVariance, PopulationStandardDeviation. In addition it implements order statistics functions that reorder the data array (partial sorting) and therefore have the Inplace
-suffix to indicate the side effect. They get slightly faster when calling them repeatedly, yet are usually still faster with up to 5 calls than a full sorting: OrderStatistic, Median, Percentile, LowerQuartile, UpperQuartile, InterquartileRange, FiveNumberSummary, Quantile, QuantileCustom.
Example: We want to compute the IQR of {3,1,2,4}.
var data = new double[] { 3.0, 1.0, 2.0, 4.0 };
ArrayStatistics.InterquartileRangeInplace(data); // iqr = 2.1667
This is equivalent to executing IQR(c(3,1,2,4), type=8)
in R. Note that we always default to approximately median-unbiased quantiles, hence type R8. If you need compatibility with another implementation, you can use QuantileCustom
which accepts either a QuantileDefinition
enum (we support all 9 R-types, SAS 1-5, Excel, Nist, Hydrology, etc.) or a 4-parameter definition as in Mathematica.
SortedArrayStatistics expects data to be sorted in ascending order and implements Minimum, Maximum, OrderStatistic, Median, Percentile, LowerQuartile, UpperQuartile, InterquartileRange, FiveNumberSummary, Quantile, QuantileCustom. It leverages the ordering for very fast (constant time) order statistics. There's also no need to reorder the data, so other than ArrayStatistics, this class never modifies the provided array. It does not re-implement any operations that cannot leverage the ordering, like Mean or Variance. Use the implementation from ArrayStatistics instead.
StreamingStatistics estimates statistics in a single pass without memorization and implements Minimum, Maximum, Mean, Variance, StandardDeviation, PopulationVariance, PopulationStandardDeviation. It does not implement any order statistics, since they require sorting and are thus not computable in a single pass without keeping the data in memory.
The Statistics class has been updated to leverage these new implementations internally, and implements all of the statistics mentioned above as extension methods on enumerables. In addition it now also provides an empirical InverseCDF
function, and provides alternative implementations with a Func
-suffix for some of the statistics that instead of a parameter return a lambda/function that can be evaluated multiple times efficiently (e.g. for plotting): Quantile, QuantileCustom, Percentile, InverseCDF, OrderStatistic.