measure-probability.tex

\documentclass{article}
\title{Measure and Probability Theory}

\input{common.tex}

\newcommand{\carath}{Carath\'{e}odory}

\begin{document}
\maketitle

\section{About}
This document is part of a series of mathematical notes available at \url{https://gwthomas.github.io/math4ml}.
You are free to distribute it as you wish.
Please report any mistakes to \url{gwthomas@berkeley.edu}.

Measure theory is concerned with the problem of assigning a mathematically consistent notion of size to sets.
We care about measure theory because of its use in the modern, rigorous formulation of probability given by Kolmogorov.

\section{Collections of sets}
We would like to assign measures to various subsets of $\R^n$ characterizing their size.
Ideally our measure $\mu$ would satisfy
\begin{enumerate}[(i)]
\item For any countable collection of disjoint sets $E_1, E_2, \dots \subseteq \R^n$,
\[\mu\bigg(\bigcup_i E_i\bigg) = \sum_i \mu(E_i)\]
\item If two sets $E, F \subseteq \R^n$ are such that $E$ can be transformed into $F$ by rigid transformations, then $\mu(E) = \mu(F)$.
\item The measure of the unit cube is 1.
\end{enumerate}
The first property, called \term{countable additivity}, just means that if you partition a set into countably many parts, the sum of the parts' measures equals the original set's measure.
The requirement that additivity hold for countable collections (as opposed to just finite collections) is important for proving limit theorems.

Unfortunately, one can show that these three properties are incompatible if we allow arbitrary subsets of $\R^n$.
The solution in measure theory is to restrict ourselves to some ``reasonable'' collection of subsets.

Recall that the \term{powerset} of a set $\Omega$ is the set of all subsets of $\Omega$, i.e.
\[\calP(\Omega) = \{S : S \subseteq \Omega\}\]
Note that in particular $\varnothing, \Omega \in \calP(\Omega)$ for any set $\Omega$.

In the remainder we will consider collections of subsets of $\Omega$; in other words, these collections are subsets of $\calP(\Omega)$.
We will make certain requirements of these collections so that we have some structure to work with.
In particular, we choose the collections so that the properties above hold, not for arbitrary subsets of $\Omega$, but for any sets in the collection.

\subsection{Algebras and $\sigma$-algebras}
Let $\Omega$ be a non-empty set.
Then $\calA \subseteq \calP(\Omega)$ is an algebra on $\Omega$ if
\begin{enumerate}[(i)]
\item $\calA$ is non-empty.
\item If $E \in \calA$, then $E\comp = \Omega \setminus E \in \calA$.
\item If $E_1, \dots, E_n \in \calA$, then $\bigcup_{i=1}^n E_i \in \calA$.
\end{enumerate}
The second property states that $\calA$ is \term{closed under complements}.
Using de Morgan's laws, properties 2 and 3 collectively imply that $\calA$ is closed under finite intersections as well, since
\[\bigcap_{i=1}^n E_i = \bigg(\bigcup_{i=1}^n E_i\comp\bigg)\comp\]
Then we must have $\varnothing \in \calA$; since $\calA$ is non-empty there exists some $E \in \calA$, so $E\comp \in \calA$, and hence $\varnothing = E \cap E\comp \in \calA$.

In light of the desirability of countable additivity, we would like the collection of subsets we consider to be closed under unions of countably many sets, not just finitely many.
Thus we need to strengthen condition 3, and arrive at the following definition: a \term{$\sigma$-algebra} is an algebra that is closed under countable unions.
It follows by the same reasoning as above that a $\sigma$-algebra is also closed under countable intersections.

Note that $\{\varnothing, \Omega\}$ and $\calP(\Omega)$ are $\sigma$-algebras for any $\Omega$, and moreover they are respectively the smallest and largest possible $\sigma$-algebras.

If $\calC \subseteq \calP(\Omega)$ is any collection of subsets of $\Omega$, there exists a unique smallest $\sigma$-algebra containing $\calC$; this is called the \term{$\sigma$-algebra generated by $\calC$} and denoted $\sigma(\calC)$.

%The $\sigma$-algebra generated by the set of all open sets in $\Omega$ is called the \term{Borel $\sigma$-algebra on $\Omega$} and denoted $\calB(\Omega)$.
%Its members, called the \term{Borel sets} of $\Omega$, are all the open sets, closed sets, and countable unions and intersections of these.

\section{Measures}
Let $\Omega$ be a non-empty set and $\calM \subseteq \calP(\Omega)$ a $\sigma$-algebra.
The pair $(\Omega, \calM)$ is called a \term{measurable space}, and the elements of $\calM$ are its \term{measurable sets}.
A \term{measure} on $(\Omega, \calM)$ is a function $\mu : \calM \to [0,\infty]$ such that
\begin{enumerate}[(i)]
\item $\mu(\varnothing) = 0$
\item For any countable collection of disjoint sets $\{E_i\} \subseteq \calM$,
\[\mu\bigg(\bigdotcup_i E_i\bigg) = \sum_i \mu(E_i)\]
\end{enumerate}
The triple $(\Omega, \calM, \mu)$ is called a \term{measure space}.

The simplest nontrivial example of a measure is the \term{counting measure}, given by
\[E \mapsto \begin{cases}
|E| & \text{$E$ finite} \\
\infty & \text{otherwise}
\end{cases}\]

We say that $\mu$ is \term{finite} if $\mu(\Omega) < \infty$, and it is \term{$\sigma$-finite} if $\Omega$ can be written as the union of countably many measurable sets of finite measure.

A set $E \in \calM$ such that $\mu(E) = 0$ is called a \term{$\mu$-null set}, or usually just a \term{null set} if the measure is clear from context.
A property is said to hold $\mu$-\term{almost everywhere} (often abbreviated a.e.) if the set of points for which it does not hold is $\mu$-null (again, one would usually just write \term{almost everywhere} unless there was ambiguity).\footnote{
    In order for this notion to be interesting we needed to first introduce a measure with nontrivial null sets, so we wait to give an example in the Lebesgue measure section.
}
%If every subset of every $\mu$-null set is measurable (i.e. also an element of $\calM$), then we say that $\mu$ is \term{complete}.

We now give some basic properties of measures.
\begin{proposition}
If $E, F \in \calM$ and $E \subseteq F$, then $\mu(E) \leq \mu(F)$.
\end{proposition}
This property is called \term{monotonicity}.
\begin{proof}
Suppose $E, F \in \calM$ and $E \subseteq F$. Then
\[\mu(F) = \mu(E \dotcup (F \setminus E)) = \mu(E) + \mu(F \setminus E) \geq \mu(E)\]
as claimed.
\end{proof}
Note that monotonicity implies
\[0 = \mu(\varnothing) \leq \mu(E) \leq \mu(\Omega)\]
for every $E \in \calM$, since $\varnothing \subseteq E \subseteq \Omega$.

\begin{proposition}
For any countable collection of sets $\{E_i\} \subseteq \calM$ (disjoint or not),
\[\mu\bigg(\bigcup_i E_i\bigg) \leq \sum_i \mu(E_i)\]
\end{proposition}
This property is called \term{sub-additivity}.
\begin{proof}
Define $F_1 = E_1$ and $F_i = E_i \setminus (\bigcup_{j < i} E_j)$ for $i > 1$, noting that $\bigcup_{j \leq i} F_j = \bigcup_{j \leq i} E_j$ for all $i$ and the $F_i$ are disjoint.
Then
\[\mu\bigg(\bigcup_i E_i\bigg) = \mu\bigg(\bigcup_i F_i\bigg) = \sum_i \mu(F_i) \leq \sum_i \mu(E_i)\]
where the last inequality follows by monotonicity since $F_i \subseteq E_i$ for all $i$.
\end{proof}

\section{Lebesgue measure}
\term{Lebesgue measure} is the measure that corresponds to our intuitive notion of physical size.
For example, the Lebesgue measure of a measurable subset of $\R$ gives a number interpretable as the set's length.
Lebesgue measure can also be defined in higher dimensions (we omit this generalization), where it represents the area, volume, etc. of the set.

The strategy for constructing Lebesgue measure is to define it first on intervals, which have an obvious measure (their length), and then use that definition to define the measure of more complicated sets.

Let $\calI$ be the set of all intervals (open, closed, or semi-open) on $\R$.
Define $\ell : \calI \to [0,\infty]$ by
\[\ell([a,b]) = b - a\]
with the same definition when $[a,b]$ is replaced by $(a,b)$, $[a,b)$, or $(a,b]$.
For infinite intervals, use the ``obvious'' convention that $\infty-a = \infty$ and $b-(-\infty) = \infty$.

The key tool in constructing Lebesgue measure is the \term{Lebesgue outer measure} $\lambda^* : \calP(\R) \to [0,\infty]$, which is given by
\[\lambda^*(E) = \inf\left\{\sum_{k=1}^\infty \ell(I_k) : I_k \in \calI, E \subseteq \bigcup_{k=1}^\infty I_k\right\}\]
A set $E \subseteq \R$ is said to be \term{Lebesgue measurable} if for every $A \subseteq \R$,
\[\lambda^*(A) = \lambda^*(A \cap E) + \lambda^*(A \cap E\comp)\]
It turns out that the set of Lebesgue measurable sets is very large and contains pretty much any reasonable set that one would encounter in practice.
However, it is possible\footnote{
    assuming the axiom of choice
} to construct pathological subsets of $\R$ that are not Lebesgue measurable.

The Lebesgue outer measure and Lebesgue measurable sets have a number of nice properties:
\begin{enumerate}[(i)]
\item The set of Lebesgue measurable sets, denoted $\calL$, is a $\sigma$-algebra.
\item $\lambda^*|_\calL$ is a measure on $\calL$.
\item $\lambda^*|_\calI = \ell$, so the measure agrees with our initial notion of interval length.
\end{enumerate}
Defining the Lebesgue measure $\lambda = \lambda^*|_\calL$, we have a measure space $(\R, \calL, \lambda)$.\footnote{
    One can show that $\lambda$ is the unique measure on $(\R, \calL)$ that extends $\ell$.
    It turns out that uniqueness stems from the fact that $\ell$ is $\sigma$-finite.
}

\subsubsection{Sets of measure zero}
Consider the following intriguing property of Lebesgue measure.
\begin{proposition}
If $E \subseteq \R$ is countable, then $\lambda(E) = 0$.
\end{proposition}
\begin{proof}
First note that for any $x \in \R$, we have
\[\lambda(\{x\}) = \lambda([x,x]) = x - x = 0\]
Now suppose $E \subseteq \R$ is countable.
Then we can write $E = \bigdotcup_i \{x_i\}$, whence it follows that
\[\lambda(E) = \sum_i \lambda(\{x_i\}) = \sum_i 0 = 0\]
by the countable additivity of measures.
\end{proof}
Specifically, it may be surprising to consider that $\lambda(\Q) = 0$.
It turns out that it's also possible to construct uncountable subsets of $\R$ that have Lebesgue measure zero, e.g. the Cantor set \cite{folland}.

Now for the promised example of \textit{almost everywhere}: the absolute value function $x \mapsto |x|$ is differentiable almost everywhere, since it is only not differentiable at $x = 0$, and $\lambda(\{0\}) = 0$.
Note that Lebesgue measure is in some sense the ``default'' measure on $\R$, in that if no measure is specified (as in the previous sentence), the author is generally speaking in reference to Lebesgue measure.

\section{Lebesgue integration}
In this section we consider the problem of defining the integral of functions on an abstract measure space $(\Omega, \calM, \mu)$.

Just as not all sets are measurable, not all functions are measurable.
A function $f : \Omega \to \R$ is \term{measurable} if
\[\{\omega \in \Omega : f(\omega) \leq x\} \in \calM \tab \forall x \in \R\]
We follow the common approach of defining the Lebesgue integral for increasingly complicated functions in terms of integrals of simpler functions.
The simplest functions to integrate are the \term{indicator functions}; if $E \in \calM$, then its indicator function is
\[1_E(\omega) = \begin{cases}
1 & \omega \in E \\
0 & \omega \not\in E
\end{cases}\]
The integral of an indicator function is defined as
\[\int_\Omega 1_E\dd{\mu} = \mu(E)\]
From indicator functions we can construct \term{non-negative simple functions}, which are finite linear combinations of indicator functions:
\[\phi = \sum_{i=1}^n \alpha_i1_{E_i}\]
where $\alpha_i \geq 0$ for all $i$.
Here the integral is defined to be
\[\int_\Omega \phi\dd{\mu} = \sum_{i=1}^n \alpha_i \int_\Omega 1_{E_i}\dd{\mu} = \sum_{i=1}^n \alpha_i \mu(E_i)\]
and we use the convention that $0 \cdot \infty = 0$.
Then we can define the integral of an arbitrary non-negative measurable function $f$ as follows:
\[\int_\Omega f\dd{\mu} = \sup\left\{\int_\Omega \phi\dd{\mu} : 0 \leq \phi \leq f, \text{$\phi$ simple}\right\}\]
Finally, we can extend the definition to arbitrary measurable functions by using the decomposition
\[f = f^+ - f^-\]
where
\begin{align*}
f^+(x) &= \max(f(x), 0) \\
f^-(x) &= \max(-f(x), 0)
\end{align*}
If at least one of $\int_\Omega f^+\dd{\mu}$ and $\int_\Omega f^-\dd{\mu}$ is finite, we define
\[\int_\Omega f\dd{\mu} = \int_\Omega f^+\dd{\mu} - \int_\Omega f^-\dd{\mu}\]
Furthermore if $\int_\Omega |f|\dd{\mu} < \infty$, we say that $f$ is \term{Lebesgue integrable}.
Note that this is a slightly stronger condition than what is required for the previous definition; clearly $|f| = f^+ + f^-$, so $f$ is Lebesgue integrable iff the integrals of both $f^+$ and $f^-$ are finite.\footnote{
    Why is the integral defined even for some functions that are not ``integrable''?
    I'm not sure, and would love to know if anyone has more info.
    But all the sources I've consulted agree on these definitions.
}

\subsection{The Lebesgue integral on $\R$}
We now consider the special case where $\Omega = \R$ and $\mu = \lambda$.
In addition to being a very important special case of the general theory above, this scenario has a geometric interpretation that helps us better understand Lebesgue integration.

\subsection{Comparison with the Riemann integral}
In a nutshell, the Lebesgue integral is in many ways superior to the Riemann integral.

First, any function that is Riemann integrable on a bounded interval is also Lebesgue integrable, and the values of the integrals agree.
But there exist functions are Lebesgue integrable but not Riemann integrable.
For example, consider the rational indicator $1_\Q$ on $[0,1]$.
We know that for the Lebesgue integral,
\[\int_{[0,1]} 1_\Q\dd{\lambda} = \lambda(\Q \cap [0,1]) = 0\]
However it is easy to check that $1_\Q$ is not Riemann integrable: every non-trivial interval will contain at least one rational number and at least one irrational number, so no matter how the partition is chosen, the lower Darboux sum will be zero and the upper Darboux sum will be one.

But this is a rather contrived example.
Of more practical importance is the existence of stronger convergence theorems, such as the monotone convergence theorem and dominated convergence theorem.

Another advantage of the Lebesgue integral, which admittedly is less important for our purposes, is that integration can be defined on spaces other than Euclidean space.
The Riemann integral relies heavily on properties of the real line.

\section{Probability}
Suppose we have some sort of randomized experiment (e.g. a coin toss, die roll) that has a fixed set of possible outcomes.
This set is called the \term{sample space} and denoted $\Omega$.

We would like to define probabilities for some \term{events}, which are subsets of $\Omega$.
The set of events is denoted $\calF$ and is required to be a $\sigma$-algebra.

Then we can define a \term{probability measure} $\pm : \calF \to [0,1]$ which must satisfy $\pr{\Omega} = 1$ in addition to the axioms for general measures.
The triple $(\Omega, \calF, \pm)$ is called a \term{probability space}.\footnote{
    Note that a probability space is simply a measure space in which the measure of the whole space equals 1.
}

If $\pr{A} = 1$, we say that $A$ occurs \term{almost surely} (often abbreviated a.s.).\footnote{
    This is a probabilist's version of the measure-theoretic term \textit{almost everywhere}.
}
Conversely if $\pr{A} = 0$, we say that $A$ occurs \term{almost never}.

From these axioms, a number of useful rules can be derived.
\begin{proposition}
If $A$ is an event, then $\pr{A\comp} = 1 - \pr{A}$.
\end{proposition}
\begin{proof}
Using the countable additivity of $\pm$, we have
\[\pr{A} + \pr{A\comp} = \pr{A \dotcup A\comp} = \pr{\Omega} = 1\]
which proves the result.
\end{proof}

\begin{proposition}
Let $A$ be an event. Then
\begin{enumerate}[(i)]
\item If $B$ is an event and $B \subseteq A$, then $\pr{B} \leq \pr{A}$.
\item $0 = \pr{\varnothing} \leq \pr{A} \leq \pr{\Omega} = 1$
\end{enumerate}
\end{proposition}
\begin{proof}
(i) follows immediately from the monotonocity of measures.
For (ii): the middle inequality follows from (i) since $\varnothing \subseteq A \subseteq \Omega$.
We also have $\pr{\varnothing} = 0$ by applying the previous proposition with $A = \Omega$.
\end{proof}

\begin{proposition}
If $A$ and $B$ are events, then $\pr{A \cup B} = \pr{A} + \pr{B} - \pr{A \cap B}$.
\end{proposition}
\begin{proof}
The key is to break the events up into their various overlapping and non-overlapping parts.
\begin{align*}
\pr{A \cup B} &= \pr{(A \cap B) \dotcup (A \setminus B) \dotcup (B \setminus A)} \\
&= \pr{A \cap B} + \pr{A \setminus B} + \pr{B \setminus A} \\
&= \pr{A \cap B} + \pr{A} - \pr{A \cap B} + \pr{B} - \pr{A \cap B} \\
&= \pr{A} + \pr{B} - \pr{A \cap B}
\end{align*}
\end{proof}

\begin{proposition}
If $\{A_i\} \subseteq \calF$ is a countable set of events, disjoint or not, then
\[\prbigg{\bigcup_i A_i} \leq \sum_i \pr{A_i}\]
\end{proposition}
This inequality is sometimes referred to as \term{Boole's inequality} or the \term{union bound}.
\begin{proof}
Follows immediately from the sub-additivity of measures.
\end{proof}

\subsection{Random variables}
Intuitively, a \term{random variable} is some uncertain quantity with an associated probability distribution over the values it can assume.

Formally, a random variable on a probability space $(\Omega, \calF, \pm)$ is a measurable function $X: \Omega \to \R$.\footnote{
    More generally, the codomain can be any measurable space, but $\R$ is the most common case by far and sufficient for our purposes.
}

We denote the range of $X$ by $X(\Omega) = \{X(\omega) : \omega \in \Omega\}$.
To give a concrete example (taken from \cite{pitman}), suppose $X$ is the number of heads in two tosses of a fair coin.
The sample space is
\[\Omega = \{hh, tt, ht, th\}\]
and $X$ is determined completely by the outcome $\omega$, i.e. $X = X(\omega)$.
For example, the event $X = 1$ is the set of outcomes $\{ht, th\}$.

It is common to talk about the values of a random variable without directly referencing its sample space.
The two are related by the following definition: the event that the value of $X$ lies in some set $S \subseteq \R$ is
\[X \in S = X\inv(S) = \{\omega \in \Omega : X(\omega) \in S\}\]
Here the $X\inv$ notation means the preimage of $S$ under $X$, not the inverse of $X$.

Note that special cases of this definition include $X$ being equal to, less than, or greater than some specified value.
For example
\[\pr{X = x} = \pr{X\inv(\{x\})} = \pr{\{\omega \in \Omega : X(\omega) = x\}}\]

\subsubsection{The cumulative distribution function}
The \term{cumulative distribution function} (c.d.f.) gives the probability that a random variable is at most a certain value:
\[F(x) = \pr{X \leq x}\]
The c.d.f. can be used to give the probability that a variable lies within a certain range:
\[\pr{a < X \leq b} = F(b) - F(a)\]

\subsubsection{Discrete random variables}
A \term{discrete random variable} is a random variable that has a countable range and assumes each value in this range with positive probability.
Discrete random variables are completely specified by their \term{probability mass function} (p.m.f.) $p : X(\Omega) \to [0,1]$ which satisfies
\[\sum_x p(x) = 1\]
For a discrete $X$, the probability of a particular value is given exactly by its p.m.f.:
\[\pr{X = x} = p(x)\]
In fact, any nonnegative function that sums to one over a countable domain induces a discrete probability space.
\begin{proposition}
Suppose $\Omega$ is a non-empty countable set and $p : \Omega \to [0,1]$ is such that $\sum_{\omega \in \Omega} p(\omega) = 1$.
Let $\calF = \calP(\Omega)$ and
\[\pr{A} = \sum_{\omega \in A} p(\omega)\]
for any event $A \in \calF$.
Then
\begin{enumerate}[(i)]
\item $(\Omega, \calF, \pm)$ is a probability space.
\item If $S \subset \R$ with $|S| = |\Omega|$, then any bijection $X : \Omega \to S$ is a random variable on this space with probability mass function $p \circ X\inv$.
\end{enumerate}
\end{proposition}
\begin{proof}
$\calF$ is clearly a $\sigma$-algebra since it contains every subset of $\Omega$ and thus is closed under all complements and unions.
Thus all that must be shown is that $\pm$ is a probability measure.
We have $\pr{\Omega} = \sum_{\omega \in \Omega} p(\omega) = 1$ immediately by assumption.
To show countable additivity, we see that if $\{A_i\} \subseteq \calF$ are disjoint, then
\[\prbigg{\bigdotcup_i A_i} = \sum_{\omega \in \bigdotcup_i A_i} p(\omega) = \sum_i \sum_{\omega \in A_i} p(\omega) = \sum_i \pr{A_i}\]
which proves (i).

To show (ii), suppose $S \subset \R$ with $|S| = |\Omega|$ and let $X : \Omega \to S$ be a bijection.
It is clear that $X$ is measurable, again because $\calF$ contains every subset of $\Omega$.
We also have for any $x \in S$,
\[\pr{X = x} = \pr{X\inv(\{x\})} = \pr{\{X\inv(x)\}} = p(X\inv(x)) = (p \circ X\inv)(x)\]
so $p \circ X\inv$ is the probability mass function of $X$.
\end{proof}

\subsubsection{Continuous random variables}
A \term{continuous random variable} is a random variable that has an uncountable range and assumes each value in this range with probability zero.
Most of the continuous random variables that one would encounter in practice are \term{absolutely continuous random variables}\footnote{
    Random variables that are continuous but not absolutely continuous are called \term{singular random variables}.
    We will not discuss them, assuming rather that all continuous random variables admit a density function.
}, which means that there exists a function $p : \R \to [0,\infty)$ that satisfies
\[F(x) = \int_{-\infty}^x p(z)\dd{z}\]
The function $p$ is called a \term{probability density function} (abbreviated p.d.f.) and must satisfy
\[\int_{-\infty}^\infty p(x)\dd{x} = 1\]
The values of this function are not themselves probabilities, since they could exceed 1.
However, they do have a couple of reasonable interpretations.
One is as relative probabilities; even though the probability of each particular value being picked is technically zero, some points are still in a sense more likely than others.

One can also think of the density as determining the probability that the variable will lie in a small range about a given value.
Recall that for small $\epsilon$,
\[\pr{x-\nicefrac{\epsilon}{2} \leq X \leq x+\nicefrac{\epsilon}{2}} = \int_{x-\nicefrac{\epsilon}{2}}^{x+\nicefrac{\epsilon}{2}} p(z)\dd{z} \approx \epsilon p(x)\]
using a midpoint approximation to the integral.

Here are some useful identities that follow from the definitions above:
\begin{align*}
\pr{a \leq X \leq b} &= \int_a^b p(x)\dd{x} \\
p(x) &= F'(x)
\end{align*}

\subsubsection{Other kinds of random variables}
There are random variables that are neither discrete nor continuous.
For example, consider a random variable determined as follows:
flip a fair coin, then the value is zero if it comes up heads, otherwise draw a number uniformly at random from $[1,2]$.
Such a random variable can take on uncountably many values, but only finitely many of these with positive probability.
We will not discuss such random variables.

\bibliography{measure-probability}
\addcontentsline{toc}{section}{References}
\bibliographystyle{ieeetr}
\nocite{*}
\end{document}