diff --git a/2_mathematical_spaces/spaces.qmd b/2_mathematical_spaces/spaces.qmd index 95b5a05..d955dd3 100644 --- a/2_mathematical_spaces/spaces.qmd +++ b/2_mathematical_spaces/spaces.qmd @@ -36,7 +36,7 @@ sophisticated sets but also sets that are equipped with additional structure. These combinations of sets and structure are also known as _spaces_. In this chapter we will discuss the basic conceptual features of general spaces -before reviewing some of prototypical spaces that are particularly common in +before reviewing some of the prototypical spaces that are particularly common in practical applications. This presentation will include not only the properties of sets with an arbitrary number of elements but also a survey of some of the most fundamental structures that we can endow onto those sets. @@ -104,7 +104,7 @@ In many circumstances we will need to distinguish between variables that refer to _arbitrary_ elements and variables that refer to _particular_ but unspecified elements. Following the computer science canon I will refer to these as **unbound variables** and **bound variables**, respectively. To distinguish -between the two I will decorate bound variables with a tilde; in words $x$ +between the two I will decorate bound variables with a tilde; in other words $x$ denotes any element of the space $X$ while $\tilde{x}$ denotes a fixed but unspecified element. @@ -154,7 +154,7 @@ single element, and a **full set** consisting of the entire set (@fig-subsets). Most subsets, however, contain an intermediate number of elements. One of the key features of uncountable spaces is that most subsets also contain an uncountable number of elements. To visually represent subsets containing an -uncountable number of elements I will used filled shapes to contrast against +uncountable number of elements I will use filled shapes to contrast against individual points. ::: {#fig-subsets layout="[ [-5, 45, 45, -5], [-5, 45, 45, -5]]"} @@ -787,7 +787,7 @@ most subsets will be neither open nor closed. Unlike open balls these metric-derived open subsets are _closed_ under unions and intersections. If $\mathsf{x}_{1}$ and $\mathsf{x}_{2}$ are both open -subsets then $\mathsf{x}_{1} \cup \mathsf{x}_{2}$ will also an open subset. In +subsets then $\mathsf{x}_{1} \cup \mathsf{x}_{2}$ will also be an open subset. In fact the union of _any_ number of open subsets will be open. Likewise if $\mathsf{x}_{1}$ and $\mathsf{x}_{2}$ are both open subsets then $\mathsf{x}_{1} \cap \mathsf{x}_{2}$ will also be an open subset. Indeed the @@ -964,7 +964,7 @@ figures/structures/general_topology/convergence/convergence){ width=90% #fig-general-convergence} A subtle benefit of this topological definition of convergence is that because -it doesn't require a metric is also doesn't require us to define the positive +it doesn't require a metric, it also doesn't require us to define the positive real numbers. This can be helpful for avoiding circular logic in more technical mathematical analyses. @@ -1021,10 +1021,10 @@ algebra, or metric just builds on top of that foundation. Mathematically it is much easier to work with structures that are _compatible_ with each other. For example if we want to equip a set with both a topology and a metric then the resulting space will be particularly well-behaved if -the we use a metric topology. At the same time ambient structure can also +we use a metric topology. At the same time ambient structure can also distinguish certain compatible subsets. -For example is a set is equipped with an ordering then we can define **interval** +For example if a set is equipped with an ordering then we can define **interval** subsets that contain all elements above and below two boundary elements. An **open interval** excludes both boundary elements, $$ @@ -1045,7 +1045,7 @@ intervals that contain only one boundary, \end{align*} Note that these notions of "open" and "closed" subsets are in general distinct from the open and closed subsets defined by a topology. Only when an ordering -is compatible with a topology will these the open and closed intervals also be +is compatible with a topology will these open and closed intervals also be topologically open and closed. As we saw in [Section 1.2.4.1](@sec:open-balls) a metric distinguishes subsets @@ -1086,8 +1086,8 @@ $\mathsf{x}_{1} \subset X$ is smaller than a subset $\mathsf{x}_{2} \subset X$ if $\mathsf{x}_{1} \subset \mathsf{x}_{2}$, and larger if $\mathsf{x}_{2} \subset \mathsf{x}_{1}$. Two subsets that only partially overlap are incomparable, and hence fall into the same place in the sequential -ordering. For any set the empty set will always the smallest subset and the -full set will always the largest. +ordering. For any set the empty set will always be the smallest subset and the +full set will always be the largest. The union and intersection operations introduce algebraic structure, known as a **Boolean algebraic structure**, to the power set. They are both commutative, @@ -1311,6 +1311,10 @@ figures/real_line_grid/real_line_grid){width=50% #fig-real-line-grid} ## Extended Real Lines + + One limitation of a real line is that it does not contains points that _approach_ either negative or positive infinity, but not points that represent those limits directly. An **extended real line** resolves introduces two new @@ -1567,7 +1571,7 @@ Using this notation we can define an inverse function a bit more compactly as $$ \text{Id} = f^{-1} \circ f. $$ -In words the composition of a bijective function with its inverse function is +In other words the composition of a bijective function with its inverse function is the identify function. ## Relating Structures @@ -1852,7 +1856,7 @@ Pushforward and pullback functions allow us to _lift_ a transformation between sets into a transformation between spaces. For structure that can be pushed forward along the function $f : X \rightarrow Y$ any input space $(X, \mathfrak{x})$ automatically defines a compatible output space -$(Y, f_{*}(\mathfrak{x}))$. Similarly for structure that can be pulled back +$(Y, f_{*}(\mathfrak{x}))$. Similarly for a structure that can be pulled back against $f$ any output space $(Y, \mathfrak{y})$ automatically defines a compatible input space $(X, f^{*}(\mathfrak{y}))$. @@ -1923,7 +1927,7 @@ latter not (@fig-monoticity). Monotonically increasing functions preserve orderings so that larger inputs always imply larger outputs. The function (a) $f_{1} : x \mapsto x^{3}$ is -monotonic but the function (b) $f_{1} : x \mapsto -x^{3}$ is not. +monotonic but the function (b) $f : x \mapsto -x^{2}$ is not. ::: #### Algebra-Preserving Relations diff --git a/3_product_spaces/product_spaces.qmd b/3_product_spaces/product_spaces.qmd index f30f8f9..91d68af 100644 --- a/3_product_spaces/product_spaces.qmd +++ b/3_product_spaces/product_spaces.qmd @@ -181,7 +181,7 @@ $X_{2}$. Each element of the product set is uniquely specified by one element of $X_{1}$ and one element of $X_{2}$. Consequently every variable taking values in the -product set $x \in X_{1} \times X_{2}$ is compromised of an ordered pair of +product set $x \in X_{1} \times X_{2}$ is comprised of an ordered pair of variables from each component space, $$ x = (x_{1}, x_{2}), @@ -282,7 +282,7 @@ X_{1} \times \ldots \times X_{i} \times \ldots \times X_{I} = \times_{i = 1}^{I} X_{i} $$ -where every product variable $x \in \times_{i = 1}^{I} X_{i}$ is compromised of +where every product variable $x \in \times_{i = 1}^{I} X_{i}$ is comprised of a ordered collection of component variables $$ x = ( x_{1}, \ldots, x_{i}, \ldots, x_{I}) @@ -609,7 +609,7 @@ component full sets we will always be able to construct the product empty set and product full set from this procedure. Moreover because the component open subsets are finite intersections these -productsubsets will be as well. For example given any finite collection of open +product subsets will be as well. For example given any finite collection of open component subsets $$ \{ \mathsf{x}_{1, i}, \ldots, \mathsf{x}_{j, i}, \ldots \mathsf{x}_{J, i} \} @@ -870,7 +870,7 @@ $$ ( 1, \ldots, i_{1} - 1, i_{1} + 1, \ldots, i_{J} - 1, i_{J} + 1, \ldots, J ) $$ define yet another product set. Replicating this second product set once for -each element of first product set defines a collection of **cross sections sets** +each element of the first product set defines a collection of **cross sections sets** (@fig-conditioning), \begin{align*} \times_{i' = 1}^{I} X_{i'} \mid (x_{i_{1}}, \ldots, x_{i_{J}}) diff --git a/4_probability_on_general_spaces/probability_on_general_spaces.qmd b/4_probability_on_general_spaces/probability_on_general_spaces.qmd index 47f893e..ba997b9 100644 --- a/4_probability_on_general_spaces/probability_on_general_spaces.qmd +++ b/4_probability_on_general_spaces/probability_on_general_spaces.qmd @@ -205,7 +205,7 @@ sufficiently useful for practical application or if we need to consider countably additive measures, let alone measures that might be additive over even larger collections of subsets. -For example a common problem that arises is practice is reconstructing +For example a common problem that arises in practice is reconstructing the measure allocated to a general subset from the measures allocated to particularly nice subsets that are easier with which to work. If we could always decompose a generic subset into the disjoint union of a @@ -217,7 +217,7 @@ Potentially some subsets might be decomposable only into an uncountably infinite number of subsets in which case we would need even stronger notions of additivity! -Fortunately for us we don't have to go to that last extreme. In turns +Fortunately for us we don't have to go to that last extreme. It turns out that on most spaces that we'll encounter in practice, and typical notions of "nice" subsets, countable additivity is sufficient for reconstructing the measure allocated to more general subsets. @@ -228,9 +228,9 @@ allocations to **rectangular** subsets (@fig-disk-decomposition). In general a non-rectangular subset, in this case a disk, can be crudely approximated by a single rectangular subset. The disk can be approximated more precisely as the disjoint union of many different -rectangular subsets, but that will never exact reconstruct the disk. +rectangular subsets, but that will never exactly reconstruct the disk. Only when we incorporate a countably infinite number of rectangular -subsets can be reconstruct the disk without any error. +subsets can we reconstruct the disk without any error. ![On a two-dimensional real plane $\mathbb{R}^{2}$ a non-rectangular disc can be approximated, but not exactly reconstructed, by the finite @@ -339,7 +339,7 @@ Similarly the elements of a $\sigma$-algebra are known as **measurable subsets** while any subsets in the power set but not in the $\sigma$-algebra are referred to as **non-measurable** subsets. -When non-measurable subsets are misbehaving subsets they reveals the +When non-measurable subsets are misbehaving subsets they reveal the subtle, and often counterintuitive, pathologies inherent to that space. By working with $\sigma$-algebras directly we can avoid these awkward pathologies entirely. @@ -415,7 +415,7 @@ behaviors that we have to avoid at all! I will refer to any measurable space $(X, 2^{X})$ compatible with a discrete topology as **discrete measurable spaces**. -On the the other hand the Borel $\sigma$-algebra derived from the +On the other hand the Borel $\sigma$-algebra derived from the topology that defines the real line filters out all of the non-constructive subsets and their undesired behaviors while keeping all of the interval subsets and the subsets that we can derive from them. @@ -973,7 +973,7 @@ figures/interval_partitions/interval_partitions){ width=90% #fig-equal-length-intervals} The easiest way to accomplish this uniformity is to allocate to each -interval a measure directly equal to the its length, +interval a measure directly equal to its length, $$ \lambda( \, [x_{1}, x_{2}] \, ) = L( \, [x_{1}, x_{2}] \, ) diff --git a/5_expectation_values/expectation_values.qmd b/5_expectation_values/expectation_values.qmd index de1d65f..3375244 100644 --- a/5_expectation_values/expectation_values.qmd +++ b/5_expectation_values/expectation_values.qmd @@ -42,7 +42,7 @@ from calculus. This measure-informed integration operation summarizes the interaction between a measure and a given function, allowing us to use one to learn about the other. -We will being our exploration of measure-informed integration with a +We will begin our exploration of measure-informed integration with a heuristic construction on finite measure spaces before considering a more formal, but also more abstract, construction that applies to any measure space. Next we'll investigate how the specification of @@ -56,7 +56,7 @@ exceptional measures whose integrals can be computed algorithmically. # Integration on Finite Measure Spaces {#sec:finite_integration} To start our discussion of measure-informed integration as simply as -possible let's begin by considering a finite measure space compromised +possible let's begin by considering a finite measure space comprised of the finite set $$ X = \{ \Box, \clubsuit, \diamondsuit, \heartsuit, \spadesuit \}, @@ -409,9 +409,9 @@ corresponding simple function decomposition, In general a non-negative, measurable function can be represented by more than one simple function decomposition. Fortunately the measure-informed integral derived from any of them will always be the -same. Consequently there's no worry for ambiguous of otherwise +same. Consequently there's no worry for ambiguous or otherwise inconsistent answers, and measure-informed integrals for non-negative, -measurable function are completely well-behaved. +measurable functions are completely well-behaved. This procedure for defining measure-informed integrals through simple functions representations is known as **Lebesgue integration** in the @@ -722,7 +722,7 @@ this form. Because $L(X, \mathcal{X}, \mu)$ contains all of the indicator functions this functional relationship between $L(X, \mathcal{X}, \mu)$ and $\mathbb{R}$ determines the allocations to every measurable subset, -and hence full determines the measure $\mu$. At the same time +and hence fully determines the measure $\mu$. At the same time $L(X, \mathcal{X}, \mu)$ also contains many integrands that are not indicator functions, and hence quite a bit of redundant information about $\mu$. @@ -875,7 +875,7 @@ which is not, in general, equal to $1$. In other words scaling a probability distribution results not in another probability distribution but rather a generic measure. -If we want transform one probability distribution into another then we +If we want to transform one probability distribution into another then we need to correct for the modified normalization, defining \begin{align*} \mathbb{E}_{g \ast \pi} [ f ] @@ -1039,8 +1039,8 @@ important in practice. ### The Mean -If an embedding function is an integrand than we can evaluate its -measure-informed integral, $\mathbb{I}_{\mu}[\iota]$. The ultimately +If an embedding function is an integrand then we can evaluate its +measure-informed integral, $\mathbb{I}_{\mu}[\iota]$. The ultimate utility of this measure-informed integral, however, depends on what information about the ambient measure it extracts. @@ -1272,7 +1272,7 @@ working with spaces like circles, spheres, torii, and more. Many analyses on these spaces have been undermined by attempts to summarize measures with moments that don't actually exist! -All of this said we still to take care with the necessary conditions +All of this said we still have to take care with the necessary conditions when working with more familiar spaces as well. For example in [Section 5.2.2](@sec:practical_lebesgue) we'll learn that the identify function from a real line into itself is not integrable with respect to @@ -1354,7 +1354,7 @@ skewed towards smaller or larger values. ![](figures/histograms/varying_behaviors/multimodal/multimodal){#fig-hist-multimodal} -Histogram are extremely effective at communicating the basic features of +Histograms are extremely effective at communicating the basic features of a measure. The measure in (a) is diffuse but decaying, allocating more measure at smaller points than larger points. Conversely the measure in (b) concentrates around a single point while the measure in (c) @@ -1409,7 +1409,7 @@ M :\; & X & &\rightarrow& \; &[0, \mu(X)]& \\ & x & &\mapsto& & M_{\mu}(x) = \mu(\mathsf{I}_{x}) = \mathbb{I}_{\mu}[I_{\mathsf{I}_{x}}] &. \end{alignat*} -According this mapping is known as a +This mapping is known as a **cumulative distribution function** (@fig-cdf-basics). ![A cumulative distribution function quantifies how measure is allocated @@ -1451,7 +1451,7 @@ are any gaps in the allocation, intermediate intervals with zero allocated measure, then the cumulative distribution function will flatten out completely (@fig-cdf-gap). -::: {#fig-hist-examples layout="[-5, 30, 30, 30, -5]"} +::: {#fig-cdf-examples layout="[-5, 30, 30, 30, -5]"} ![](figures/cdfs/cdf_behaviors/unimodal/unimodal){#fig-cdf-unimodal} ![](figures/cdfs/cdf_behaviors/narrow_unimodal/narrow_unimodal){#fig-cdf-narrow-unimodal} @@ -1461,7 +1461,7 @@ flatten out completely (@fig-cdf-gap). A careful survey of a cumulative distribution function can communicate a wealth of information about the ambient measure. (a) Here the ambient measure is unimodal with the cumulative distribution function -appreciably increasingly only one we reach the central neighborhood +appreciably increasing only once we reach the central neighborhood where the measure allocation is concentrated. (b) A narrower concentration results in a steeper cumulative distribution function. (c) A cumulative distribution function flattens if there are any gaps @@ -1607,7 +1607,7 @@ accumulated measure below $m$, $$ x_{m-} = \underset{x \in X}{\mathrm{argmax}} M(x) < m, $$ -and bounded above by the point $x_{+}$ that achieves the smallest +and bounded above by the point $x_{m+}$ that achieves the smallest accumulated measure above $m$ (@fig-quantile-inverse-problems), $$ x_{m+} = \underset{x \in X}{\mathrm{argmin}} M(x) > m. @@ -1715,7 +1715,7 @@ $$ $$ The integral of any real-valued function $f: X \rightarrow \mathbb{R}$ -with respect to counting measure is given by over summing all of the +with respect to counting measure is given by summing over all of the output values, \begin{align*} \mathbb{I}_{\chi}[f] @@ -1934,7 +1934,7 @@ When a real-valued function has a well-defined Riemann integral then we can apply the tools of calculus to evaluate Lebesgue integrals. The exceptional Riemann integrals that can be evaluated analytically allow us to compute the corresponding Lebesgue integrals exactly. More -generally we can use to numerical integration techniques to approximate +generally we can use numerical integration techniques to approximate the Riemann integrals, and hence approximately evaluate Lebesgue integrals. @@ -1959,7 +1959,7 @@ the sign of the Riemann integral. In order to properly relate Lebesgue integrals to Riemann integrals we have to fix the _orientation_ of the intervals. -Similarly the mean of a Lebesgue measure would by given by the integral +Similarly the mean of a Lebesgue measure would be given by the integral of the identity function, \begin{align*} \mathbb{I}_{\lambda}[\iota] diff --git a/6_density_functions/density_functions.qmd b/6_density_functions/density_functions.qmd index 25cda19..660fa2a 100644 --- a/6_density_functions/density_functions.qmd +++ b/6_density_functions/density_functions.qmd @@ -322,7 +322,7 @@ in a crude measurable partition might also be infinite. If we break up those subsets into finer and finer pieces, however, then the infinite allocations might spread out into finite allocations. When we can construct a fine enough measurable partition such that _all_ of the -subset allocations are finite we will always able to avoid infinity +subset allocations are finite we will always be able to avoid infinity entirely by working with small enough subsets. Moreover if that fine enough measurable partition is also countable then we will always be able to aggregate those smaller-subset allocations into any general @@ -404,7 +404,7 @@ behaviors. For example on a discrete space every measure with non-zero atomic allocations is absolutely continuous with respect to every other -measure with non-zero atomic allocations. Moreover Lebesgue meaures +measure with non-zero atomic allocations. Moreover Lebesgue measures defined with respect to different metrics are always absolutely continuous with each other. @@ -683,7 +683,7 @@ Admittedly I'm being a bit mathematically sloppy here because Radon-Nikodym derivatives are defined only up to $\nu$-null subsets; technically this mapping doesn't yield a single function but rather a collection of all functions that are equal $\nu$-almost everywhere. -In order to achieve a unique outptu function we need to introduce +In order to achieve a unique output function we need to introduce additional constraints, such as continuity or even smoothness. This sloppy notation, however, does allow us to investigate many of the useful properties of the operation. @@ -861,7 +861,7 @@ their limitations. In particular probability density functions are defined only relative to the given reference measure. If the reference measure is at all ambiguous then a density function will not completely determine a probability distribution! At the same time if the reference -measure every changes then probability density functions will also have +measure ever changes then the probability density functions will also have to change if we want them to represent the same probability distributions. @@ -951,7 +951,7 @@ expectation value to an integral informed by the counting measure gives &= \mathbb{I}_{\chi} [ \pi \cdot f ], \end{align*} -where $\pi$ in the last term denotes a function maps each element of +where $\pi$ in the last term denotes a function that maps each element of $X$ to its atomic allocation, \begin{alignat*}{6} \pi :\; & X & &\rightarrow& \; & [0, \infty] & @@ -1097,7 +1097,7 @@ $$ On the other hand we can evaluate the cumulative distribution function at each boundary and subtract, $$ -\mathrm{Poisson}( \, (n_{1}, n_{2}] \, ; \lambda) f +\mathrm{Poisson}( \, (n_{1}, n_{2}] \, ; \lambda) = \Pi_{\mathrm{Poisson}}(n_{2}) - \Pi_{\mathrm{Poisson}}(n_{1}). $$ @@ -1141,7 +1141,7 @@ accommodated but the atomic subsets are the most practically relevant. Given a particular $D$-dimensional real space, that is a particular rigid real space or particular parameterization of a flexible real -space, and a compatible a compatible probability distribution $\pi$ we +space, and a compatible probability distribution $\pi$ we can define a Lebesgue probability density function $$ \frac{ \mathrm{d} \pi \hphantom{ {}^{D} } }{ \mathrm{d} \lambda^{D} } : @@ -1359,7 +1359,7 @@ $$ Integrating a probability density function by eye is not always straightforward. In particular bounds between probability densities do -not always imply bounds between interval probabilities. (a) Here +not always imply bounds between interval probabilities. (a) Here the largest probability density in the first interval, $p_{1}$, is smaller than the smallest probability density in the second interval, $p_{2}$. Because the two intervals are the same length the probability allocated @@ -1685,7 +1685,7 @@ $$ $$ that define expectation values are particularly nice, at least as far as integrals go. That isn't to say that the integrals are easy to evaluate -but rather that they many of them actually admit closed-form solutions, +but rather that many of them actually admit closed-form solutions, which is pretty miraculous when it comes to integrals. For those twisted individuals who fancy a good integral calculation, myself included, I've included those calculations in the @@ -1787,7 +1787,7 @@ probability distributions, \end{align*} where $$ -\mathrm{erf} (x) +\mathrm{erf} (x) = \frac{2}{\sqrt{\pi}} \int_{0}^{ x } \mathrm{d} t \, \exp \left( -t^{2} \right) $$ @@ -1795,7 +1795,7 @@ is known as the **error function**. Conveniently the error function, if not the normal cumulative distribution functions themselves, are available in most programming -languages. This allows us directly compute interval probabilities by +languages. This allows us to directly compute interval probabilities by subtracting cumulative probabilities (@fig-normal-interval-prob), $$ \text{normal}( \, (x_{1}, x_{2} ] \, ; \mu, \sigma ) @@ -1816,7 +1816,7 @@ width=90% #fig-normal-interval-prob} # Other Useful Probability Density Functions -Most of applications of probability theory that we will tackle in this +Most applications of probability theory that we will tackle in this book will use probability distributions that are absolutely continuous with respect to a counting measure or a Lebesgue measure and implemented with appropriate probability density functions. There are a few @@ -1878,7 +1878,7 @@ As they become narrower and narrower normal probability density functions start to behave like a hypothetical singular density function. (a) In the limit $\sigma \rightarrow 0$ the normal probability density functions centered at $\mu = x'$ converge to an infinitely narrow spike -at $x'$. (b) At the same the expectation values of all expectands $f$ +at $x'$. (b) At the same time the expectation values of all expectands $f$ coverge to the point evaluations $f(x')$. ::: @@ -2041,12 +2041,12 @@ is often worth the added subtlety. Real spaces adequately model many phenomena that arise in practical applications, but by no means all of them. In some cases we will need to consider continuous spaces that look like a real spaces _locally_ but -exhibit different shapes _globally_ (@fig-circle). These include for +exhibit different shapes _globally_ (@fig-circle-line). These include for example spheres, torii, and even more foreign spaces. Mathematically these spaces, along with real spaces, are collectively known as **manifolds**. -::: {#fig-circle layout="[ [-20, 60, -20], [-20, 60, -20] ]"} +::: {#fig-circle-line layout="[ [-20, 60, -20], [-20, 60, -20] ]"} ![](figures/circle_vs_line/far/far){#fig-circle-far} ![](figures/circle_vs_line/close/close){#fig-circle-close} @@ -2112,6 +2112,10 @@ are not implemented with classic Riemann integration but rather a more general **manifold integration** that is not implemented in the same way. + + That said sometimes there are work arounds. For example removing a point $x' \in \mathbb{S}^{1}$ from the circle defines a new space $\mathbb{S}^{1} \setminus x'$. Circular probability distributions,