Update documentation

stratisMarkou · Dec 25, 2024 · d8fc21e · d8fc21e
1 parent 7936a41
commit d8fc21e
Show file tree

Hide file tree

Showing 9 changed files with 274 additions and 17 deletions.
diff --git a/_sources/book/topology/002-topological-spaces.md b/_sources/book/topology/002-topological-spaces.md
@@ -32,7 +32,7 @@ We refer to the topology associated with a given metric as the induced topology.
 
 :::{prf:definition} Induced topology
 :label: topology:def-induced-topology
-Let $(X, d)$ be a metric space.
+Let $(X, d)$ be a {prf:ref}`metric space<topology:def-metric-space>`.
 Then, the topology induced by $d$ is the set of all open sets in $X$ with respect to the metric $d.$
 :::
 
@@ -45,3 +45,99 @@ Let $f: X \to Y$ be a function between topological spaces.
 Then, $f$ is continuous if for every open set $U \subseteq Y,$ the pre-image $f^{-1}(U)$ is an open set in $X.$
 :::
 
+
+:::{prf:lemma} Composition preserves continuity
+:label: topology:lemma-composition-preserves-continuity
+If $f: X \to Y$ and $g: Y \to Z$ are {prf:ref}`continuous functions<topology:def-continuous-function-topology>` between {prf:ref}`topological spaces<topology:def-topological-space>`, then the composition $g \circ f: X \to Z$ is continuous.
+:::
+
+
+In topology, we are interested in studying the properties of spaces that are preserved under continuous deformations.
+Therefore, from a topology perspective, two spaces are considered essentially the same up to a continuous bijection.
+This is captured by the notion of homeomorphism.
+
+:::{prf:definition} Homeomorphism
+:label: topology:def-homeomorphism
+A function $f: X \to Y$ between {prf:ref}`topological spaces<topology:def-topological-space>` is a {prf:ref}`homeomorphism<topology:def-homeomorphism>` if it is bijective, {prf:ref}`continuous<topology:def-continuous-function-topology>`, and its inverse $f^{-1}$ is also continuous.
+Equivalently, $f$ is a homeomorphism if $f$ is a bijection and $U \subseteq X$ is {prf:ref}`open<topology:def-topological-space>` if and only if $f(U) \subseteq Y$ is open.
+We say two spaces are homeomorphic if there exists a homeomorphism between them.
+:::
+
+:::{prf:lemma} Homeomorphism is an equivalence relation
+:label: topology:lemma-homeomorphism-equivalence-relation
+{prf:ref}`Homeomorphism<topology:def-homeomorphism>` is an equivalence relation between topological spaces.
+:::
+
+:::{dropdown} Proof: Homeomorphism is an equivalence relation
+__Reflexivity:__
+The identity map $I_X: X \to X$ is a homeomorphism, because it is bijective, continuous, and its inverse is itself.
+Therefore $X \equiv X.$
+
+__Symmetry:__
+If $f: X \to Y$ is a homeomorphism, then $f^{-1}: Y \to X$ is also a homeomorphism.
+Therefore $X \equiv Y$ implies $Y \equiv X.$
+
+__Transitivity:__
+If $f: X \to Y$ and $g: Y \to Z$ are homeomorphisms, then $g \circ f: X \to Z$ is a homeomorphism.
+Therefore $X \equiv Y$ and $Y \equiv Z$ implies $X \equiv Z.$
+:::
+
+In general, the approach for showing that two spaces are homeomorphic is to find a homeomorphism between them.
+However, showing that two spaces are _not_ homeomorphic is more difficult.
+In particular, there is no simple recipe for showing that two spaces are not homeomorphic.
+Instead, we resort to certain topological properties that are preserved under homeomorphisms.
+Whenever two spaces have different such properties, we can conclude that they are not homeomorphic.
+Two such properties are connectedness and compactness.
+In the remainder of this chapter we give definitions and results building up to these properties.
+
+
+## Sequences
+
+We now turn to re-defining concepts from metric spaces in terms of topological spaces, starting with sequences.
+First we re-define the following shorthand for open sets.
+
+:::{prf:definition} Open neighbourhood
+:label: topology:def-open-neighbourhood-topology
+An open neighbourhood of a point $x \in X$ in a {prf:ref}`topological space<topology:def-topological-space>` $(X, \mathcal{U})$ is an open set $U \in \mathcal{U}$ such that $x \in U.$
+:::
+
+
+In topological spaces, convergent sequences are defined directly in terms of open neighbourhoods, rather than using open balls.
+
+:::{prf:definition} Convergent sequence
+:label: topology:def-convergent-sequence-topology
+A sequence $x_n \to x$ if for every open neighbourhood $U$ of $x,$ there exists $N \in \mathbb{N}$ such that $x_n \in U$ for all $n > N.$
+:::
+
+
+We now turn to uniqueness of limits.
+In general, in a topological space limits need not be unique.
+For example, given a set $X$ with the coarse topology $\mathcal{U} = \{\emptyset, X\},$ every sequence converges to every point.
+However, further assumptions on the topology can result into unique limits.
+
+:::{prf:definition} Hausdorff space
+:label: topology:def-hausdorff-space
+A topological space $(X, \mathcal{U})$ is Hausdorff if for every pair of distinct points $x_1, x_2 \in X,$ there exist open neighbourhoods $U_1, U_2$ of $x_1, x_2$ respectively such that $U_1 \cap U_2 = \emptyset.$
+:::
+
+:::{margin}
+Earlier, we proved that {prf:ref}`limits in metric spaces are unique<topology:lemma-limits-in-metric-spaces-are-unique>`.
+The property we used in that proof was that, in a {prf:ref}`metric space <topology:def-metric-space>`, open balls centered around distinct points are disjoint if their radii are small enough.
+This was the Hausdorff property in disguise.
+Metric spaces are always {prf:ref}`Hausdorff<topology:def-hausdorff-space>`, and therefore have unique limits.
+:::
+
+:::{prf:lemma} Limits are unique in Hausdorff spaces
+:label: topology:lemma-limits-unique-hausdorff
+If $X$ is {prf:ref}`Hausdorff<topology:def-hausdorff-space>` and $(x_n)$ is a sequence in $X$ such that $x_n \to x$ and $x_n \to x',$ then $x = x'.$
+:::
+
+:::{dropdown} Proof: Limits are unique in Hausdorff spaces
+Let $(x_n)$ be a sequence in $X$ such that $x_n \to x$ and $x_n \to x'.$
+Suppose $x \neq x'.$
+Since $X$ is Hausdorff, there exist open neighbourhoods $U, U'$ of $x, x'$ respectively such that $U \cap U' = \emptyset.$
+Since $x_n \to x,$ there exists $N \in \mathbb{N}$ such that $x_n \in U$ for all $n > N.$
+Similarly, since $x_n \to x',$ there exists $N' \in \mathbb{N}$ such that $x_n \in U'$ for all $n > N'.$
+Then, for all $n > \max(N, N'),$ we have $x_n \in U \cap U' = \emptyset,$ which is a contradiction.
+Therefore, $x = x'.$
+:::
diff --git a/book/papers/ais/ais.html b/book/papers/ais/ais.html
@@ -591,7 +591,7 @@ <h2>Importance sampling<a class="headerlink" href="#importance-sampling" title="
 It is reasonable to expect that the more dissimilar <span class="math notranslate nohighlight">\(q\)</span> and <span class="math notranslate nohighlight">\(p\)</span> are, the larger the variance will be.
 In partricular, we can show that the variance of the importance weights can be lower bounded by a quantity that scales exponentially with the KL divergence.</p>
 <div class="proof lemma admonition" id="lemma-0">
-<p class="admonition-title"><span class="caption-number">Lemma 15 </span> (Lower bound to importance weight variance)</p>
+<p class="admonition-title"><span class="caption-number">Lemma 18 </span> (Lower bound to importance weight variance)</p>
 <section class="lemma-content" id="proof-content">
 <p>Given distributions <span class="math notranslate nohighlight">\(p\)</span> and <span class="math notranslate nohighlight">\(q\)</span>, it holds that</p>
 <div class="math notranslate nohighlight">
@@ -649,7 +649,7 @@ <h2>Importance-weighted MCMC<a class="headerlink" href="#importance-weighted-mcm
 So we could, in principle, use MCMC within an importance-weighted estimator to reduce its variance.
 The following algorithm is based on this intuition.</p>
 <div class="proof definition admonition" id="definition-1">
-<p class="admonition-title"><span class="caption-number">Definition 83 </span> (Importance weighted MCMC algorithm)</p>
+<p class="admonition-title"><span class="caption-number">Definition 87 </span> (Importance weighted MCMC algorithm)</p>
 <section class="definition-content" id="proof-content">
 <p>Given a proposal density <span class="math notranslate nohighlight">\(q\)</span>, a target density <span class="math notranslate nohighlight">\(p\)</span> and a sequence of transition kernels <span class="math notranslate nohighlight">\(T_1(x, x'), \dots, T_K(x, x')\)</span> be a sequence of transition kernels such that <span class="math notranslate nohighlight">\(T_k\)</span> leaves <span class="math notranslate nohighlight">\(p\)</span> invariant.
 Sampling <span class="math notranslate nohighlight">\(x_0 \sim q(x)\)</span> followed by</p>
@@ -737,7 +737,7 @@ <h2>Annealed Importance Sampling<a class="headerlink" href="#id2" title="Link to
 <p>These distributions interpolate between <span class="math notranslate nohighlight">\(q\)</span> and <span class="math notranslate nohighlight">\(p\)</span> as we vary <span class="math notranslate nohighlight">\(\beta\)</span>.
 AIS then proceeds in a similar way to the importance weighted MCMC algorithm we highlighted above, except that it requires that each <span class="math notranslate nohighlight">\(T_k\)</span> leaves <span class="math notranslate nohighlight">\(\pi_k\)</span>, instead of <span class="math notranslate nohighlight">\(p\)</span>, invariant.</p>
 <div class="proof definition admonition" id="definition-2">
-<p class="admonition-title"><span class="caption-number">Definition 84 </span> (Annealed Importance Sampling)</p>
+<p class="admonition-title"><span class="caption-number">Definition 88 </span> (Annealed Importance Sampling)</p>
 <section class="definition-content" id="proof-content">
 <p>Given a target density <span class="math notranslate nohighlight">\(p\)</span>, a proposal density <span class="math notranslate nohighlight">\(q\)</span> and a sequence <span class="math notranslate nohighlight">\(0 = \beta_0 \leq \dots \leq \beta_K = 1\)</span>, define</p>
 <div class="math notranslate nohighlight">

diff --git a/book/papers/num-sde/num-sde.html b/book/papers/num-sde/num-sde.html
@@ -467,7 +467,7 @@ <h2>Why stochastic differential equations<a class="headerlink" href="#why-stocha
 <h2>The Wiener process<a class="headerlink" href="#the-wiener-process" title="Link to this heading">#</a></h2>
 <p>In order to define the stochastic component of the transition rule of a stochastic system, we must define an appropriate noise model. The Wiener process is a stochastic process that is commonly used for this purpose.</p>
 <div class="proof definition admonition" id="definition-0">
-<p class="admonition-title"><span class="caption-number">Definition 88 </span> (Wiener process)</p>
+<p class="admonition-title"><span class="caption-number">Definition 92 </span> (Wiener process)</p>
 <section class="definition-content" id="proof-content">
 <p>A standard Wiener process over [0, T] is a random variable <span class="math notranslate nohighlight">\(W(t)\)</span> that depends continuously on <span class="math notranslate nohighlight">\(t \in [0, T]\)</span> and satisfies:</p>
 <ol class="arabic simple">
@@ -650,7 +650,7 @@ <h2>Evaluating a stochastic integral<a class="headerlink" href="#evaluating-a-st
 <h2>Euler-Maruyama method<a class="headerlink" href="#euler-maruyama-method" title="Link to this heading">#</a></h2>
 <p>The Euler-Maruyama method is the analoge of the Euler method for deterministic integrals, applied to the stochastic case.</p>
 <div class="proof definition admonition" id="definition-1">
-<p class="admonition-title"><span class="caption-number">Definition 89 </span> (Euler-Maruyama method)</p>
+<p class="admonition-title"><span class="caption-number">Definition 93 </span> (Euler-Maruyama method)</p>
 <section class="definition-content" id="proof-content">
 <p>Given a scalar SDE with drift and diffusion functions <span class="math notranslate nohighlight">\(f\)</span> and <span class="math notranslate nohighlight">\(g\)</span></p>
 <div class="math notranslate nohighlight">
@@ -770,7 +770,7 @@ <h2>Euler-Maruyama method<a class="headerlink" href="#euler-maruyama-method" tit
 <h2>Strong and weak convergence<a class="headerlink" href="#strong-and-weak-convergence" title="Link to this heading">#</a></h2>
 <p>Since the choice of the number of bins <span class="math notranslate nohighlight">\(N\)</span> of the discretisation affects the accuracy of our method, we are interested in how quickly the approximation converges to the exact solution as a function of <span class="math notranslate nohighlight">\(N\)</span>. To do so, we must first define <em>what convergence means</em> in the stochastic case, which leads us to two disctinct notions of convergence, the strong sence and the weak sense.</p>
 <div class="proof definition admonition" id="definition-2">
-<p class="admonition-title"><span class="caption-number">Definition 90 </span> (Strong convergence)</p>
+<p class="admonition-title"><span class="caption-number">Definition 94 </span> (Strong convergence)</p>
 <section class="definition-content" id="proof-content">
 <p>A method for approximating a stochastic process <span class="math notranslate nohighlight">\(X(t)\)</span> is said to have strong order of convergence <span class="math notranslate nohighlight">\(\gamma\)</span> if there exists a constant such that</p>
 <div class="math notranslate nohighlight">
@@ -780,7 +780,7 @@ <h2>Strong and weak convergence<a class="headerlink" href="#strong-and-weak-conv
 </section>
 </div><p>Strong convergence refers to the rate of convergence of the approximation <span class="math notranslate nohighlight">\(X_n\)</span> to the exact solution <span class="math notranslate nohighlight">\(X(\tau_n)\)</span> as <span class="math notranslate nohighlight">\(\Delta t \to 0\)</span>, in expectation. A weaker condition for convergence is rate at which the expected value of the approximation converges to the true expected value, as <span class="math notranslate nohighlight">\(\Delta t \to 0\)</span>, as given below.</p>
 <div class="proof definition admonition" id="definition-3">
-<p class="admonition-title"><span class="caption-number">Definition 91 </span> (Weak convergence)</p>
+<p class="admonition-title"><span class="caption-number">Definition 95 </span> (Weak convergence)</p>
 <section class="definition-content" id="proof-content">
 <p>A method for approximating a stochastic process <span class="math notranslate nohighlight">\(X(t)\)</span> is said to have weak order of convergence <span class="math notranslate nohighlight">\(\gamma\)</span> if there exists a constant such that</p>
 <div class="math notranslate nohighlight">
@@ -799,7 +799,7 @@ <h2>Strong and weak convergence<a class="headerlink" href="#strong-and-weak-conv
 <h2>Milstein’s higher order method<a class="headerlink" href="#milstein-s-higher-order-method" title="Link to this heading">#</a></h2>
 <p>Just as higher order methods for ODEs exist for obtaining refined estimates of the solution, so do methods for SDEs, such as Milstein’s higher order method.</p>
 <div class="proof definition admonition" id="definition-4">
-<p class="admonition-title"><span class="caption-number">Definition 92 </span> (Milstein’s method)</p>
+<p class="admonition-title"><span class="caption-number">Definition 96 </span> (Milstein’s method)</p>
 <section class="definition-content" id="proof-content">
 <p>Given a scalar SDE with drift and diffusion functions <span class="math notranslate nohighlight">\(f\)</span> and <span class="math notranslate nohighlight">\(g\)</span></p>
 <div class="math notranslate nohighlight">

diff --git a/book/papers/rff/rff.html b/book/papers/rff/rff.html
@@ -514,7 +514,7 @@ <h2>The RFF approximation<a class="headerlink" href="#the-rff-approximation" tit
 <p>This is also an unbiased estimate of the kernel, however its variance is lower than in the <span class="math notranslate nohighlight">\(M = 1\)</span> case, since the variance of the average of the sum of <span class="math notranslate nohighlight">\(K\)</span> i.i.d. random variables is lower than the variance of a single one of the variables.
 We therefore arrive at the following algorithm for estimating <span class="math notranslate nohighlight">\(k\)</span>.</p>
 <div class="proof definition admonition" id="definition-1">
-<p class="admonition-title"><span class="caption-number">Definition 85 </span> (Random Fourier Features)</p>
+<p class="admonition-title"><span class="caption-number">Definition 89 </span> (Random Fourier Features)</p>
 <section class="definition-content" id="proof-content">
 <p>Given a translation invariant kernel <span class="math notranslate nohighlight">\(k\)</span> that is the Fourier transform of a probability measure <span class="math notranslate nohighlight">\(p\)</span>, we have the unbiased real-valued estimator</p>
 <div class="math notranslate nohighlight">
@@ -536,7 +536,7 @@ <h3>RFF and Bayesian regression<a class="headerlink" href="#rff-and-bayesian-reg
 <h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title="Link to this heading">#</a></h3>
 <p>Now there remains the question of how large the error of the RFF estimator is. In other words, how closely does RFF estimate the exact kernel <span class="math notranslate nohighlight">\(k\)</span>? Since <span class="math notranslate nohighlight">\(-\sqrt{2} \leq z_{\omega, \phi} \leq \sqrt{2}\)</span>, we can use Hoeffding’s inequality<span id="id2">[<a class="reference internal" href="#id10" title="David Grimmett, Geoffrey Stirzaker. Probability and random processes. Oxford university press, 2020.">Grimmett, 2020</a>]</span> to obtain the following high-probability bound on the absolute error on our estimate of <span class="math notranslate nohighlight">\(k\)</span>.</p>
 <div class="proof lemma admonition" id="lemma-2">
-<p class="admonition-title"><span class="caption-number">Lemma 16 </span> (Hoeffding for RFF)</p>
+<p class="admonition-title"><span class="caption-number">Lemma 19 </span> (Hoeffding for RFF)</p>
 <section class="lemma-content" id="proof-content">
 <p>The RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span>, obeys</p>
 <div class="math notranslate nohighlight">
@@ -548,7 +548,7 @@ <h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title
 Note that this is a statement about the closeness of <span class="math notranslate nohighlight">\(z^\top(x)z(y)\)</span> and <span class="math notranslate nohighlight">\(k(x, y)\)</span> for any two input pairs, rather than the closeness of these functions over the whole input space.
 In fact, it is possible<span id="id3">[<a class="reference internal" href="#id11" title="Ali Rahimi, Benjamin Recht, and others. Random features for large-scale kernel machines. In NIPS. 2007.">Rahimi <em>et al.</em>, 2007</a>]</span> to make a stronger statement about the uniform convergence of the estimator.</p>
 <div class="proof lemma admonition" id="lemma-3">
-<p class="admonition-title"><span class="caption-number">Lemma 17 </span> (Uniform convergence of RFF)</p>
+<p class="admonition-title"><span class="caption-number">Lemma 20 </span> (Uniform convergence of RFF)</p>
 <section class="lemma-content" id="proof-content">
 <p>Let <span class="math notranslate nohighlight">\(\mathcal{M}\)</span> be a compact subset of <span class="math notranslate nohighlight">\(\mathbb{R}^D\)</span>. Then the RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span> converges uniformly to <span class="math notranslate nohighlight">\(k\)</span> according to</p>
 <div class="math notranslate nohighlight">