diff --git a/docs/DescriptiveStatistics.md b/docs/DescriptiveStatistics.md index da9dab424..b548a5d2b 100644 --- a/docs/DescriptiveStatistics.md +++ b/docs/DescriptiveStatistics.md @@ -1,11 +1,3 @@ - [hide] - #I "../../out/lib/net40" - #r "MathNet.Numerics.dll" - #r "MathNet.Numerics.FSharp.dll" - open System.Numerics - open MathNet.Numerics - open MathNet.Numerics.Statistics - Descriptive Statistics ====================== @@ -20,20 +12,20 @@ We need to reference Math.NET Numerics and open the statistics namespace: Univariate Statistical Analysis ------------------------------- -The primary class for statistical analysis is `Statistics` which provides common -descriptive statics as static extension methods to `IEnumerable` sequences. +The primary class for statistical analysis is `Statistics`, which provides common +descriptive statistic functions as static extension methods to `IEnumerable` sequences. However, various statistics can be computed much more efficiently if the data source -has known properties or structure, that's why the following classes provide specialized +has known properties or structure, which is why the following classes provide specialized static implementations: * **ArrayStatistics** provides routines optimized for single-dimensional arrays. Some of these routines end with the `Inplace` suffix, indicating that they reorder the input array slightly towards being sorted during execution - without fully sorting them, which could be expensive. -* **SortedArrayStatistics** provides routines optimized for an array sorting ascendingly. - Especially order-statistics are very efficient this way, some even with constant time complexity. +* **SortedArrayStatistics** provides routines optimized for an array sorted in ascending order. + Order-statistics are especially very efficient this way, some even with constant time complexity. * **StreamingStatistics** processes large amounts of data without keeping them in memory. - Useful if data larger than local memory is streamed directly from a disk or network. + This is useful if data that is larger than local memory is streamed directly from a disk or network. Another alternative, in case you need to gather a whole set of statistical characteristics in one pass, is provided by the `DescriptiveStatistics` class: @@ -59,10 +51,10 @@ Minimum & Maximum The minimum and maximum values of a sample set can be evaluated with the `Minimum` and `Maximum` functions of all four classes: `Statistics`, `ArrayStatistics`, `SortedArrayStatistics` -and `StreamingStatistics`. The one in `SortedArrayStatistics` is the fastest with constant -time complexity, but expects the array to be sorted ascendingly. +and `StreamingStatistics`. The min and max functions found in `SortedArrayStatistics` are the fastest, having constant +time complexity, but the array that is passed through `SortedArrayStatistics` is expected to be sorted in ascending order. -Both min and max are directly affected by outliers and are therefore no robust statistics at all. +Both the min and max are directly affected by outliers and are therefore not considered to be robust statistics. For a more robust alternative, consider using Quantiles instead. [lang=csharp] @@ -74,16 +66,15 @@ For a more robust alternative, consider using Quantiles instead. Mean ---- -The *arithmetic mean* or *average* of the provided samples. In statistics, the sample mean is -a measure of the central tendency and estimates the expected value of the distribution. -The mean is affected by outliers, so if you need a more robust estimate consider to use the Median instead. +Here, the "mean" refers to the *arithmetic mean* or *average* of the provided samples. In statistics, the sample mean is +a measure of the central tendency, and estimates the expected value of the distribution. +The mean is affected by outliers, so if you need a more robust estimate, consider using the median instead. `Statistics.Mean(data)` `StreamingStatistics.Mean(stream)` `ArrayStatistics.Mean(data)` -$$$ -\overline{x} = \frac{1}{N}\sum_{i=1}^N x_i +$$\overline{x} = \frac{1}{N}\sum_{i=1}^N x_i$$ [lang=fsharp] let whiteNoise = Generate.Normal(1000, mean=10.0, standardDeviation=2.0) @@ -100,24 +91,22 @@ Variance and Standard Deviation Variance $\sigma^2$ and the Standard Deviation $\sigma$ are measures of how far the samples are spread out. -If the whole population is available, the functions with the Population-prefix - will evaluate the respective measures with an $N$ normalizer for a population of size $N$. +If the whole population is available, the functions with the Population-prefix will evaluate the respective +measures with an $N$ normalizer for a population of size $N$. `Statistics.PopulationVariance(population)` `Statistics.PopulationStandardDeviation(population)` -$$$ -\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 +$$\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2$$ -On the other hand, if only a sample of the full population is available, the functions -without the Population-prefix will estimate unbiased population measures by applying -Bessel's correction with an $N-1$ normalizer to a sample set of size $N$. +On the other hand, if only a sample of the full population is available, the functions without the Population-prefix +will estimate unbiased population measures by applying Bessel's correction with an $N-1$ normalizer to a sample +set of size $N$. `Statistics.Variance(samples)` `Statistics.StandardDeviation(samples)` -$$$ -s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})^2 +$$s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})^2$$ [lang=fsharp] Statistics.Variance whiteNoise @@ -131,7 +120,7 @@ s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})^2 #### Combined Routines Since mean and variance are often needed together, there are routines -that evaluate both in a single pass: +that evaluate both functions within a single pass: `Statistics.MeanVariance(samples)` `ArrayStatistics.MeanVariance(samples)` @@ -145,18 +134,16 @@ Covariance ---------- The sample covariance is an estimation of the Covariance, a measure of how much two random -variables change together. Similarly to the variance above, there are two versions in order to -apply Bessel's correction to bias in case of sample data. +variables change together. Similar to the variance above, two versions are needed in order +to apply Bessel's correction to bias in the case of sample data. `Statistics.Covariance(samples1, samples2)` -$$$ -q = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})(y_i - \overline{y}) +$$q = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})(y_i - \overline{y})$$ `Statistics.PopulationCovariance(population1, population2)` -$$$ -q = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y) +$$q = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y)$$ [lang=fsharp] Statistics.Covariance(whiteNoise, whiteNoise) @@ -170,30 +157,30 @@ Order Statistics #### Order Statistic The k-th order statistic of a sample set is the k-th smallest value. Note that, -as an exception to most of Math.NET Numerics, the order k is one-based, meaning +as an exception to most of the Math.NET Numerics, the order k is one-based, meaning the smallest value is the order statistic of order 1 (there is no order 0). `Statistics.OrderStatistic(data, order)` `SortedArrayStatistics.OrderStatistic(data, order)` -If the samples are sorted ascendingly, this is trivial and can be evaluated in constant time, -which is what the `SortedArrayStatistics` implementation does. +If the samples are sorted in ascending order, this is trivial and can be evaluated in +constant time, which is what the `SortedArrayStatistics` implementation does. -If you have the samples in an array which is not (guaranteed to be) sorted, -but if it is fine if the array does incrementally get sorted over multiple calls, -you can also use the following in-place implementation. It is usually faster -than fully sorting the array, unless you need to compute it for more than a handful orders. +If you have samples in an array that are not (guaranteed to be) sorted, but are +okay if the array does get sorted incrementally over multiple calls, then you +can also use the following in-place implementation. Unless you need to compute +it for more than a handful of orders, it is usually faster than fully sorting the array. `ArrayStatistics.OrderStatisticInplace(data, order)` -For convenience there's also an option that returns a function `Func`, +For convenience there is also an option that returns a function, `Func`, mapping from order to the resulting order statistic. Internally it sorts a copy of the -provided data and then on each invocation uses efficient sorted algorithms: +provided data, and then on each invocation, uses efficient sorted algorithms: `Statistics.OrderStatisticFunc(data)` Such Inplace and Func variants are a common pattern throughout the Statistics class -and also the rest of the library. +as well as the rest of the library. [lang=fsharp] Statistics.OrderStatistic(whiteNoise, 1) @@ -222,7 +209,7 @@ the sorted set of samples and thus separating the higher half of the data from t The median is only unique if the sample size is odd. This implementation internally uses the default quantile definition, which is equivalent to mode 8 in R and is approximately median-unbiased regardless of the sample distribution. If you need another convention, use -`QuantileCustom` instead, see below for details. +`QuantileCustom` instead, please see below for details. [lang=fsharp] Statistics.Median whiteNoise @@ -230,7 +217,7 @@ median-unbiased regardless of the sample distribution. If you need another conve Statistics.Median wave // [fsi:val it : float = -2.452600839e-16] -#### Quartiles and the 5-number summary +#### Quartiles and the 5-Number Summary Quartiles group the ascendingly sorted data into four equal groups, where each group represents a quarter of the data. The lower quartile is estimated by @@ -251,8 +238,8 @@ estimates the median as discussed above. Statistics.UpperQuartile whiteNoise // [fsi:val it : float = 11.33213732] -Using that data we can provide a useful set of indicators usually named 5-number summary, -which consists of the minimum value, the lower quartile, the median, the upper quartile and +By using that data, we can provide a useful set of indicators usually named 5-number summary, +which consists of the minimum value, the lower quartile, the median, the upper quartile, and the maximum value. All these values can be visualized in the popular box plot diagrams. `Statistics.FiveNumberSummary(data)` @@ -265,14 +252,14 @@ the maximum value. All these values can be visualized in the popular box plot di Statistics.FiveNumberSummary wave // [fsi:val it : float [] = [|-0.5; -0.3584185509; -2.452600839e-16; 0.3584185509; 0.5|] ] -The difference between the upper and the lower quartile is called inter-quartile range (IQR) -and is a robust indicator of spread. In box plots the IQR is the total height of the box. +The difference between the upper and the lower quartile is called inter-quartile range (IQR), +and is a robust indicator of spread. In box plots, the IQR is the total height of the box. `Statistics.InterquartileRange(data)` `SortedArrayStatistics.InterquartileRange(data)` `ArrayStatistics.InterquartileRangeInplace(data)` -Just like median, quartiles use the default R8 quantile definition internally. +Just like the median, quartiles use the default R8 quantile definition internally. [lang=fsharp] Statistics.InterquartileRange whiteNoise @@ -280,17 +267,23 @@ Just like median, quartiles use the default R8 quantile definition internally. #### Percentiles -Percentiles extend the concept further by grouping the sorted values into 100 -equal groups and looking at the 101 places (0,1,..,100) between and around them. -The 0-percentile represents the minimum value, 25 the first quartile, 50 the median, -75 the upper quartile and 100 the maximum value. +Percentiles further extend the concept by grouping the sorted values into 100 +equal groups and then looking at the 101 places (0,1,..,100) between and around them. + +Below are the percentile representations: + +* 0: minimum value +* 25: first quartile +* 50: median +* 75: upper quartile +* 100: maximum value `Statistics.Percentile(data, p)` `Statistics.PercentileFunc(data)` `SortedArrayStatistics.Percentile(data, p)` `ArrayStatistics.PercentileInplace(data, p)` -Just like median, percentiles use the default R8 quantile definition internally. +Just like the median, percentiles use the default R8 quantile definition internally. [lang=fsharp] Statistics.Percentile(whiteNoise, 5) @@ -301,9 +294,9 @@ Just like median, percentiles use the default R8 quantile definition internally. #### Quantiles Instead of grouping into 4 or 100 boxes, quantiles generalize the concept to an infinite number -of boxes and thus to arbitrary real numbers $\tau$ between 0.0 and 1.0, where 0.0 represents the -minimum value, 0.5 the median and 1.0 the maximum value. Quantiles are closely related to -the inverse cumulative distribution function of the sample distribution. +of boxes. These infinite number of boxes are then mapped to arbitrary real numbers, $\tau$, between +0.0 and 1.0, where 0.0 represents the minimum value, 0.5 the median, and 1.0 the maximum value. +Quantiles are closely related to the inverse cumulative distribution function of the sample distribution. `Statistics.Quantile(data, tau)` `Statistics.QuantileFunc(data)` @@ -316,14 +309,14 @@ the inverse cumulative distribution function of the sample distribution. #### Quantile Conventions and Compatibility -Remember that all these descriptive statistics do not *compute* but merely *estimate* -statistical indicators of the value distribution. In the case of quantiles, -there is usually not a single number between the two groups specified by $\tau$. -There are multiple ways to deal with this: the R project supports 9 modes and Mathematica -and SciPy have their own way to parametrize the behavior. +Remember that all these descriptive statistics *estimate* rather than *compute* +statistical indicators of the value distribution. In the case of quantiles, +there usually is not a single number between the two groups specified by $\tau$. +There are multiple ways to deal with this concern: the R project supports 9 modes +while Mathematica and SciPy have their own way to parametrize the behavior. -The `QuantileCustom` functions support all 9 modes from the R-project, which includes the one -used by Microsoft Excel, and also the 4-parameter variant of Mathematica: +The `QuantileCustom` functions support all 9 modes from the R-project, including the one +used by Microsoft Excel as well as the 4-parameter variant of Mathematica: `Statistics.QuantileCustom(data, tau, definition)` `Statistics.QuantileCustomFunc(data, definition)` @@ -356,14 +349,14 @@ Rank Statistics #### Ranks Rank statistics are the counterpart to order statistics. The `Ranks` function evaluates the rank -of each sample and returns them as an array of doubles. The return type is double instead of int -in order to deal with ties, if one of the values appears multiple times. -Similar to `QuantileDefinition`, the `RankDefinition` enumeration controls how ties should be handled: +of each sample and returns them as an array of doubles. In case one of the values appears multiple +times, the return type is double instead of int in order to deal with ties. Similar to +`QuantileDefinition`, the `RankDefinition` enumeration controls how ties should be handled: * **Average**, Default: Replace ties with their mean (causing non-integer ranks). * **Min**, Sports: Replace ties with their minimum, as typical in sports ranking. * **Max**: Replace ties with their maximum. -* **First**: Permutation with increasing values at each index of ties. +* **First**: Permutation containing increasing values at each index of ties. * **EmpiricalCDF** `Statistics.Ranks(data, definition)` @@ -380,9 +373,9 @@ Similar to `QuantileDefinition`, the `RankDefinition` enumeration controls how t #### Quantile Rank -Counterpart of the `Quantile` function, estimates $\tau$ of the provided $\tau$-quantile value -$x$ from the provided samples. The $\tau$-quantile is the data value where the cumulative distribution -function crosses $\tau$. +This is the counterpart of the `Quantile` function, which estimates the $\tau$ of the provided +$\tau$-quantile value $x$ from the provided samples. The $\tau$-quantile is the data value where +the cumulative distribution function crosses $\tau$. `Statistics.QuantileRank(data, x, definition)` `Statistics.QuantileRankFunc(data, definition)` @@ -422,7 +415,7 @@ Histograms ---------- A histogram can be computed using the [Histogram][hist] class. Its constructor takes -the samples enumerable, the number of buckets to create, plus optionally the range +the samples enumerable, the number of buckets to create, and, optionally, the range (minimum, maximum) of the sample data if available. [hist]: https://numerics.mathdotnet.com/api/MathNet.Numerics.Statistics/Histogram.htm @@ -438,7 +431,7 @@ Correlation The `Correlation` class supports computing Pearson's product-momentum and Spearman's ranked correlation coefficient, as well as their correlation matrix for a set of vectors. -Code Sample: Computing the correlation coefficient of 1000 samples of f(x) = 2x and g(x) = x^2: +Code Sample: Computing the correlation coefficient of 1000 samples of $f(x) = 2x$ and $g(x) = x^2$: [lang=csharp] double[] dataF = Generate.LinearSpacedMap(1000, 0, 100, x => 2*x);