Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DescriptiveStatistics.md #1077

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 72 additions & 79 deletions docs/DescriptiveStatistics.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,3 @@
[hide]
#I "../../out/lib/net40"
#r "MathNet.Numerics.dll"
#r "MathNet.Numerics.FSharp.dll"
open System.Numerics
open MathNet.Numerics
open MathNet.Numerics.Statistics

Descriptive Statistics
======================

Expand All @@ -20,20 +12,20 @@ We need to reference Math.NET Numerics and open the statistics namespace:
Univariate Statistical Analysis
-------------------------------

The primary class for statistical analysis is `Statistics` which provides common
descriptive statics as static extension methods to `IEnumerable<double>` sequences.
The primary class for statistical analysis is `Statistics`, which provides common
descriptive statistic functions as static extension methods to `IEnumerable<double>` sequences.
However, various statistics can be computed much more efficiently if the data source
has known properties or structure, that's why the following classes provide specialized
has known properties or structure, which is why the following classes provide specialized
static implementations:

* **ArrayStatistics** provides routines optimized for single-dimensional arrays. Some
of these routines end with the `Inplace` suffix, indicating that they reorder the
input array slightly towards being sorted during execution - without fully sorting
them, which could be expensive.
* **SortedArrayStatistics** provides routines optimized for an array sorting ascendingly.
Especially order-statistics are very efficient this way, some even with constant time complexity.
* **SortedArrayStatistics** provides routines optimized for an array sorted in ascending order.
Order-statistics are especially very efficient this way, some even with constant time complexity.
* **StreamingStatistics** processes large amounts of data without keeping them in memory.
Useful if data larger than local memory is streamed directly from a disk or network.
This is useful if data that is larger than local memory is streamed directly from a disk or network.

Another alternative, in case you need to gather a whole set of statistical characteristics
in one pass, is provided by the `DescriptiveStatistics` class:
Expand All @@ -59,10 +51,10 @@ Minimum & Maximum

The minimum and maximum values of a sample set can be evaluated with the `Minimum` and `Maximum`
functions of all four classes: `Statistics`, `ArrayStatistics`, `SortedArrayStatistics`
and `StreamingStatistics`. The one in `SortedArrayStatistics` is the fastest with constant
time complexity, but expects the array to be sorted ascendingly.
and `StreamingStatistics`. The min and max functions found in `SortedArrayStatistics` are the fastest, having constant
time complexity, but the array that is passed through `SortedArrayStatistics` is expected to be sorted in ascending order.

Both min and max are directly affected by outliers and are therefore no robust statistics at all.
Both the min and max are directly affected by outliers and are therefore not considered to be robust statistics.
For a more robust alternative, consider using Quantiles instead.

[lang=csharp]
Expand All @@ -74,16 +66,15 @@ For a more robust alternative, consider using Quantiles instead.
Mean
----

The *arithmetic mean* or *average* of the provided samples. In statistics, the sample mean is
a measure of the central tendency and estimates the expected value of the distribution.
The mean is affected by outliers, so if you need a more robust estimate consider to use the Median instead.
Here, the "mean" refers to the *arithmetic mean* or *average* of the provided samples. In statistics, the sample mean is
a measure of the central tendency, and estimates the expected value of the distribution.
The mean is affected by outliers, so if you need a more robust estimate, consider using the median instead.

`Statistics.Mean(data)`
`StreamingStatistics.Mean(stream)`
`ArrayStatistics.Mean(data)`

$$$
\overline{x} = \frac{1}{N}\sum_{i=1}^N x_i
$$\overline{x} = \frac{1}{N}\sum_{i=1}^N x_i$$

[lang=fsharp]
let whiteNoise = Generate.Normal(1000, mean=10.0, standardDeviation=2.0)
Expand All @@ -100,24 +91,22 @@ Variance and Standard Deviation

Variance $\sigma^2$ and the Standard Deviation $\sigma$ are measures of how far the samples are spread out.

If the whole population is available, the functions with the Population-prefix
will evaluate the respective measures with an $N$ normalizer for a population of size $N$.
If the whole population is available, the functions with the Population-prefix will evaluate the respective
measures with an $N$ normalizer for a population of size $N$.

`Statistics.PopulationVariance(population)`
`Statistics.PopulationStandardDeviation(population)`

$$$
\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2
$$\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2$$

On the other hand, if only a sample of the full population is available, the functions
without the Population-prefix will estimate unbiased population measures by applying
Bessel's correction with an $N-1$ normalizer to a sample set of size $N$.
On the other hand, if only a sample of the full population is available, the functions without the Population-prefix
will estimate unbiased population measures by applying Bessel's correction with an $N-1$ normalizer to a sample
set of size $N$.

`Statistics.Variance(samples)`
`Statistics.StandardDeviation(samples)`

$$$
s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})^2
$$s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})^2$$

[lang=fsharp]
Statistics.Variance whiteNoise
Expand All @@ -131,7 +120,7 @@ s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})^2
#### Combined Routines

Since mean and variance are often needed together, there are routines
that evaluate both in a single pass:
that evaluate both functions within a single pass:

`Statistics.MeanVariance(samples)`
`ArrayStatistics.MeanVariance(samples)`
Expand All @@ -145,18 +134,16 @@ Covariance
----------

The sample covariance is an estimation of the Covariance, a measure of how much two random
variables change together. Similarly to the variance above, there are two versions in order to
apply Bessel's correction to bias in case of sample data.
variables change together. Similar to the variance above, two versions are needed in order
to apply Bessel's correction to bias in the case of sample data.

`Statistics.Covariance(samples1, samples2)`

$$$
q = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})(y_i - \overline{y})
$$q = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})(y_i - \overline{y})$$

`Statistics.PopulationCovariance(population1, population2)`

$$$
q = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y)
$$q = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y)$$

[lang=fsharp]
Statistics.Covariance(whiteNoise, whiteNoise)
Expand All @@ -170,30 +157,30 @@ Order Statistics
#### Order Statistic

The k-th order statistic of a sample set is the k-th smallest value. Note that,
as an exception to most of Math.NET Numerics, the order k is one-based, meaning
as an exception to most of the Math.NET Numerics, the order k is one-based, meaning
the smallest value is the order statistic of order 1 (there is no order 0).

`Statistics.OrderStatistic(data, order)`
`SortedArrayStatistics.OrderStatistic(data, order)`

If the samples are sorted ascendingly, this is trivial and can be evaluated in constant time,
which is what the `SortedArrayStatistics` implementation does.
If the samples are sorted in ascending order, this is trivial and can be evaluated in
constant time, which is what the `SortedArrayStatistics` implementation does.

If you have the samples in an array which is not (guaranteed to be) sorted,
but if it is fine if the array does incrementally get sorted over multiple calls,
you can also use the following in-place implementation. It is usually faster
than fully sorting the array, unless you need to compute it for more than a handful orders.
If you have samples in an array that are not (guaranteed to be) sorted, but are
okay if the array does get sorted incrementally over multiple calls, then you
can also use the following in-place implementation. Unless you need to compute
it for more than a handful of orders, it is usually faster than fully sorting the array.

`ArrayStatistics.OrderStatisticInplace(data, order)`

For convenience there's also an option that returns a function `Func<int, double>`,
For convenience there is also an option that returns a function, `Func<int, double>`,
mapping from order to the resulting order statistic. Internally it sorts a copy of the
provided data and then on each invocation uses efficient sorted algorithms:
provided data, and then on each invocation, uses efficient sorted algorithms:

`Statistics.OrderStatisticFunc(data)`

Such Inplace and Func variants are a common pattern throughout the Statistics class
and also the rest of the library.
as well as the rest of the library.

[lang=fsharp]
Statistics.OrderStatistic(whiteNoise, 1)
Expand Down Expand Up @@ -222,15 +209,15 @@ the sorted set of samples and thus separating the higher half of the data from t
The median is only unique if the sample size is odd. This implementation internally
uses the default quantile definition, which is equivalent to mode 8 in R and is approximately
median-unbiased regardless of the sample distribution. If you need another convention, use
`QuantileCustom` instead, see below for details.
`QuantileCustom` instead, please see below for details.

[lang=fsharp]
Statistics.Median whiteNoise
// [fsi:val it : float = 10.11872428]
Statistics.Median wave
// [fsi:val it : float = -2.452600839e-16]

#### Quartiles and the 5-number summary
#### Quartiles and the 5-Number Summary

Quartiles group the ascendingly sorted data into four equal groups, where each
group represents a quarter of the data. The lower quartile is estimated by
Expand All @@ -251,8 +238,8 @@ estimates the median as discussed above.
Statistics.UpperQuartile whiteNoise
// [fsi:val it : float = 11.33213732]

Using that data we can provide a useful set of indicators usually named 5-number summary,
which consists of the minimum value, the lower quartile, the median, the upper quartile and
By using that data, we can provide a useful set of indicators usually named 5-number summary,
which consists of the minimum value, the lower quartile, the median, the upper quartile, and
the maximum value. All these values can be visualized in the popular box plot diagrams.

`Statistics.FiveNumberSummary(data)`
Expand All @@ -265,32 +252,38 @@ the maximum value. All these values can be visualized in the popular box plot di
Statistics.FiveNumberSummary wave
// [fsi:val it : float [] = [|-0.5; -0.3584185509; -2.452600839e-16; 0.3584185509; 0.5|] ]

The difference between the upper and the lower quartile is called inter-quartile range (IQR)
and is a robust indicator of spread. In box plots the IQR is the total height of the box.
The difference between the upper and the lower quartile is called inter-quartile range (IQR),
and is a robust indicator of spread. In box plots, the IQR is the total height of the box.

`Statistics.InterquartileRange(data)`
`SortedArrayStatistics.InterquartileRange(data)`
`ArrayStatistics.InterquartileRangeInplace(data)`

Just like median, quartiles use the default R8 quantile definition internally.
Just like the median, quartiles use the default R8 quantile definition internally.

[lang=fsharp]
Statistics.InterquartileRange whiteNoise
// [fsi:val it : float = 2.686199498]

#### Percentiles

Percentiles extend the concept further by grouping the sorted values into 100
equal groups and looking at the 101 places (0,1,..,100) between and around them.
The 0-percentile represents the minimum value, 25 the first quartile, 50 the median,
75 the upper quartile and 100 the maximum value.
Percentiles further extend the concept by grouping the sorted values into 100
equal groups and then looking at the 101 places (0,1,..,100) between and around them.

Below are the percentile representations:

* 0: minimum value
* 25: first quartile
* 50: median
* 75: upper quartile
* 100: maximum value

`Statistics.Percentile(data, p)`
`Statistics.PercentileFunc(data)`
`SortedArrayStatistics.Percentile(data, p)`
`ArrayStatistics.PercentileInplace(data, p)`

Just like median, percentiles use the default R8 quantile definition internally.
Just like the median, percentiles use the default R8 quantile definition internally.

[lang=fsharp]
Statistics.Percentile(whiteNoise, 5)
Expand All @@ -301,9 +294,9 @@ Just like median, percentiles use the default R8 quantile definition internally.
#### Quantiles

Instead of grouping into 4 or 100 boxes, quantiles generalize the concept to an infinite number
of boxes and thus to arbitrary real numbers $\tau$ between 0.0 and 1.0, where 0.0 represents the
minimum value, 0.5 the median and 1.0 the maximum value. Quantiles are closely related to
the inverse cumulative distribution function of the sample distribution.
of boxes. These infinite number of boxes are then mapped to arbitrary real numbers, $\tau$, between
0.0 and 1.0, where 0.0 represents the minimum value, 0.5 the median, and 1.0 the maximum value.
Quantiles are closely related to the inverse cumulative distribution function of the sample distribution.

`Statistics.Quantile(data, tau)`
`Statistics.QuantileFunc(data)`
Expand All @@ -316,14 +309,14 @@ the inverse cumulative distribution function of the sample distribution.

#### Quantile Conventions and Compatibility

Remember that all these descriptive statistics do not *compute* but merely *estimate*
statistical indicators of the value distribution. In the case of quantiles,
there is usually not a single number between the two groups specified by $\tau$.
There are multiple ways to deal with this: the R project supports 9 modes and Mathematica
and SciPy have their own way to parametrize the behavior.
Remember that all these descriptive statistics *estimate* rather than *compute*
statistical indicators of the value distribution. In the case of quantiles,
there usually is not a single number between the two groups specified by $\tau$.
There are multiple ways to deal with this concern: the R project supports 9 modes
while Mathematica and SciPy have their own way to parametrize the behavior.

The `QuantileCustom` functions support all 9 modes from the R-project, which includes the one
used by Microsoft Excel, and also the 4-parameter variant of Mathematica:
The `QuantileCustom` functions support all 9 modes from the R-project, including the one
used by Microsoft Excel as well as the 4-parameter variant of Mathematica:

`Statistics.QuantileCustom(data, tau, definition)`
`Statistics.QuantileCustomFunc(data, definition)`
Expand Down Expand Up @@ -356,14 +349,14 @@ Rank Statistics
#### Ranks

Rank statistics are the counterpart to order statistics. The `Ranks` function evaluates the rank
of each sample and returns them as an array of doubles. The return type is double instead of int
in order to deal with ties, if one of the values appears multiple times.
Similar to `QuantileDefinition`, the `RankDefinition` enumeration controls how ties should be handled:
of each sample and returns them as an array of doubles. In case one of the values appears multiple
times, the return type is double instead of int in order to deal with ties. Similar to
`QuantileDefinition`, the `RankDefinition` enumeration controls how ties should be handled:

* **Average**, Default: Replace ties with their mean (causing non-integer ranks).
* **Min**, Sports: Replace ties with their minimum, as typical in sports ranking.
* **Max**: Replace ties with their maximum.
* **First**: Permutation with increasing values at each index of ties.
* **First**: Permutation containing increasing values at each index of ties.
* **EmpiricalCDF**

`Statistics.Ranks(data, definition)`
Expand All @@ -380,9 +373,9 @@ Similar to `QuantileDefinition`, the `RankDefinition` enumeration controls how t

#### Quantile Rank

Counterpart of the `Quantile` function, estimates $\tau$ of the provided $\tau$-quantile value
$x$ from the provided samples. The $\tau$-quantile is the data value where the cumulative distribution
function crosses $\tau$.
This is the counterpart of the `Quantile` function, which estimates the $\tau$ of the provided
$\tau$-quantile value $x$ from the provided samples. The $\tau$-quantile is the data value where
the cumulative distribution function crosses $\tau$.

`Statistics.QuantileRank(data, x, definition)`
`Statistics.QuantileRankFunc(data, definition)`
Expand Down Expand Up @@ -422,7 +415,7 @@ Histograms
----------

A histogram can be computed using the [Histogram][hist] class. Its constructor takes
the samples enumerable, the number of buckets to create, plus optionally the range
the samples enumerable, the number of buckets to create, and, optionally, the range
(minimum, maximum) of the sample data if available.

[hist]: https://numerics.mathdotnet.com/api/MathNet.Numerics.Statistics/Histogram.htm
Expand All @@ -438,7 +431,7 @@ Correlation
The `Correlation` class supports computing Pearson's product-momentum and Spearman's ranked
correlation coefficient, as well as their correlation matrix for a set of vectors.

Code Sample: Computing the correlation coefficient of 1000 samples of f(x) = 2x and g(x) = x^2:
Code Sample: Computing the correlation coefficient of 1000 samples of $f(x) = 2x$ and $g(x) = x^2$:

[lang=csharp]
double[] dataF = Generate.LinearSpacedMap(1000, 0, 100, x => 2*x);
Expand Down