Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
stratisMarkou committed Dec 25, 2024
1 parent 7936a41 commit d8fc21e
Show file tree
Hide file tree
Showing 9 changed files with 274 additions and 17 deletions.
98 changes: 97 additions & 1 deletion _sources/book/topology/002-topological-spaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ We refer to the topology associated with a given metric as the induced topology.

:::{prf:definition} Induced topology
:label: topology:def-induced-topology
Let $(X, d)$ be a metric space.
Let $(X, d)$ be a {prf:ref}`metric space<topology:def-metric-space>`.
Then, the topology induced by $d$ is the set of all open sets in $X$ with respect to the metric $d.$
:::

Expand All @@ -45,3 +45,99 @@ Let $f: X \to Y$ be a function between topological spaces.
Then, $f$ is continuous if for every open set $U \subseteq Y,$ the pre-image $f^{-1}(U)$ is an open set in $X.$
:::


:::{prf:lemma} Composition preserves continuity
:label: topology:lemma-composition-preserves-continuity
If $f: X \to Y$ and $g: Y \to Z$ are {prf:ref}`continuous functions<topology:def-continuous-function-topology>` between {prf:ref}`topological spaces<topology:def-topological-space>`, then the composition $g \circ f: X \to Z$ is continuous.
:::


In topology, we are interested in studying the properties of spaces that are preserved under continuous deformations.
Therefore, from a topology perspective, two spaces are considered essentially the same up to a continuous bijection.
This is captured by the notion of homeomorphism.

:::{prf:definition} Homeomorphism
:label: topology:def-homeomorphism
A function $f: X \to Y$ between {prf:ref}`topological spaces<topology:def-topological-space>` is a {prf:ref}`homeomorphism<topology:def-homeomorphism>` if it is bijective, {prf:ref}`continuous<topology:def-continuous-function-topology>`, and its inverse $f^{-1}$ is also continuous.
Equivalently, $f$ is a homeomorphism if $f$ is a bijection and $U \subseteq X$ is {prf:ref}`open<topology:def-topological-space>` if and only if $f(U) \subseteq Y$ is open.
We say two spaces are homeomorphic if there exists a homeomorphism between them.
:::

:::{prf:lemma} Homeomorphism is an equivalence relation
:label: topology:lemma-homeomorphism-equivalence-relation
{prf:ref}`Homeomorphism<topology:def-homeomorphism>` is an equivalence relation between topological spaces.
:::

:::{dropdown} Proof: Homeomorphism is an equivalence relation
__Reflexivity:__
The identity map $I_X: X \to X$ is a homeomorphism, because it is bijective, continuous, and its inverse is itself.
Therefore $X \equiv X.$

__Symmetry:__
If $f: X \to Y$ is a homeomorphism, then $f^{-1}: Y \to X$ is also a homeomorphism.
Therefore $X \equiv Y$ implies $Y \equiv X.$

__Transitivity:__
If $f: X \to Y$ and $g: Y \to Z$ are homeomorphisms, then $g \circ f: X \to Z$ is a homeomorphism.
Therefore $X \equiv Y$ and $Y \equiv Z$ implies $X \equiv Z.$
:::

In general, the approach for showing that two spaces are homeomorphic is to find a homeomorphism between them.
However, showing that two spaces are _not_ homeomorphic is more difficult.
In particular, there is no simple recipe for showing that two spaces are not homeomorphic.
Instead, we resort to certain topological properties that are preserved under homeomorphisms.
Whenever two spaces have different such properties, we can conclude that they are not homeomorphic.
Two such properties are connectedness and compactness.
In the remainder of this chapter we give definitions and results building up to these properties.


## Sequences

We now turn to re-defining concepts from metric spaces in terms of topological spaces, starting with sequences.
First we re-define the following shorthand for open sets.

:::{prf:definition} Open neighbourhood
:label: topology:def-open-neighbourhood-topology
An open neighbourhood of a point $x \in X$ in a {prf:ref}`topological space<topology:def-topological-space>` $(X, \mathcal{U})$ is an open set $U \in \mathcal{U}$ such that $x \in U.$
:::


In topological spaces, convergent sequences are defined directly in terms of open neighbourhoods, rather than using open balls.

:::{prf:definition} Convergent sequence
:label: topology:def-convergent-sequence-topology
A sequence $x_n \to x$ if for every open neighbourhood $U$ of $x,$ there exists $N \in \mathbb{N}$ such that $x_n \in U$ for all $n > N.$
:::


We now turn to uniqueness of limits.
In general, in a topological space limits need not be unique.
For example, given a set $X$ with the coarse topology $\mathcal{U} = \{\emptyset, X\},$ every sequence converges to every point.
However, further assumptions on the topology can result into unique limits.

:::{prf:definition} Hausdorff space
:label: topology:def-hausdorff-space
A topological space $(X, \mathcal{U})$ is Hausdorff if for every pair of distinct points $x_1, x_2 \in X,$ there exist open neighbourhoods $U_1, U_2$ of $x_1, x_2$ respectively such that $U_1 \cap U_2 = \emptyset.$
:::

:::{margin}
Earlier, we proved that {prf:ref}`limits in metric spaces are unique<topology:lemma-limits-in-metric-spaces-are-unique>`.
The property we used in that proof was that, in a {prf:ref}`metric space <topology:def-metric-space>`, open balls centered around distinct points are disjoint if their radii are small enough.
This was the Hausdorff property in disguise.
Metric spaces are always {prf:ref}`Hausdorff<topology:def-hausdorff-space>`, and therefore have unique limits.
:::

:::{prf:lemma} Limits are unique in Hausdorff spaces
:label: topology:lemma-limits-unique-hausdorff
If $X$ is {prf:ref}`Hausdorff<topology:def-hausdorff-space>` and $(x_n)$ is a sequence in $X$ such that $x_n \to x$ and $x_n \to x',$ then $x = x'.$
:::

:::{dropdown} Proof: Limits are unique in Hausdorff spaces
Let $(x_n)$ be a sequence in $X$ such that $x_n \to x$ and $x_n \to x'.$
Suppose $x \neq x'.$
Since $X$ is Hausdorff, there exist open neighbourhoods $U, U'$ of $x, x'$ respectively such that $U \cap U' = \emptyset.$
Since $x_n \to x,$ there exists $N \in \mathbb{N}$ such that $x_n \in U$ for all $n > N.$
Similarly, since $x_n \to x',$ there exists $N' \in \mathbb{N}$ such that $x_n \in U'$ for all $n > N'.$
Then, for all $n > \max(N, N'),$ we have $x_n \in U \cap U' = \emptyset,$ which is a contradiction.
Therefore, $x = x'.$
:::
6 changes: 3 additions & 3 deletions book/papers/ais/ais.html
Original file line number Diff line number Diff line change
Expand Up @@ -591,7 +591,7 @@ <h2>Importance sampling<a class="headerlink" href="#importance-sampling" title="
It is reasonable to expect that the more dissimilar <span class="math notranslate nohighlight">\(q\)</span> and <span class="math notranslate nohighlight">\(p\)</span> are, the larger the variance will be.
In partricular, we can show that the variance of the importance weights can be lower bounded by a quantity that scales exponentially with the KL divergence.</p>
<div class="proof lemma admonition" id="lemma-0">
<p class="admonition-title"><span class="caption-number">Lemma 15 </span> (Lower bound to importance weight variance)</p>
<p class="admonition-title"><span class="caption-number">Lemma 18 </span> (Lower bound to importance weight variance)</p>
<section class="lemma-content" id="proof-content">
<p>Given distributions <span class="math notranslate nohighlight">\(p\)</span> and <span class="math notranslate nohighlight">\(q\)</span>, it holds that</p>
<div class="math notranslate nohighlight">
Expand Down Expand Up @@ -649,7 +649,7 @@ <h2>Importance-weighted MCMC<a class="headerlink" href="#importance-weighted-mcm
So we could, in principle, use MCMC within an importance-weighted estimator to reduce its variance.
The following algorithm is based on this intuition.</p>
<div class="proof definition admonition" id="definition-1">
<p class="admonition-title"><span class="caption-number">Definition 83 </span> (Importance weighted MCMC algorithm)</p>
<p class="admonition-title"><span class="caption-number">Definition 87 </span> (Importance weighted MCMC algorithm)</p>
<section class="definition-content" id="proof-content">
<p>Given a proposal density <span class="math notranslate nohighlight">\(q\)</span>, a target density <span class="math notranslate nohighlight">\(p\)</span> and a sequence of transition kernels <span class="math notranslate nohighlight">\(T_1(x, x'), \dots, T_K(x, x')\)</span> be a sequence of transition kernels such that <span class="math notranslate nohighlight">\(T_k\)</span> leaves <span class="math notranslate nohighlight">\(p\)</span> invariant.
Sampling <span class="math notranslate nohighlight">\(x_0 \sim q(x)\)</span> followed by</p>
Expand Down Expand Up @@ -737,7 +737,7 @@ <h2>Annealed Importance Sampling<a class="headerlink" href="#id2" title="Link to
<p>These distributions interpolate between <span class="math notranslate nohighlight">\(q\)</span> and <span class="math notranslate nohighlight">\(p\)</span> as we vary <span class="math notranslate nohighlight">\(\beta\)</span>.
AIS then proceeds in a similar way to the importance weighted MCMC algorithm we highlighted above, except that it requires that each <span class="math notranslate nohighlight">\(T_k\)</span> leaves <span class="math notranslate nohighlight">\(\pi_k\)</span>, instead of <span class="math notranslate nohighlight">\(p\)</span>, invariant.</p>
<div class="proof definition admonition" id="definition-2">
<p class="admonition-title"><span class="caption-number">Definition 84 </span> (Annealed Importance Sampling)</p>
<p class="admonition-title"><span class="caption-number">Definition 88 </span> (Annealed Importance Sampling)</p>
<section class="definition-content" id="proof-content">
<p>Given a target density <span class="math notranslate nohighlight">\(p\)</span>, a proposal density <span class="math notranslate nohighlight">\(q\)</span> and a sequence <span class="math notranslate nohighlight">\(0 = \beta_0 \leq \dots \leq \beta_K = 1\)</span>, define</p>
<div class="math notranslate nohighlight">
Expand Down
10 changes: 5 additions & 5 deletions book/papers/num-sde/num-sde.html
Original file line number Diff line number Diff line change
Expand Up @@ -467,7 +467,7 @@ <h2>Why stochastic differential equations<a class="headerlink" href="#why-stocha
<h2>The Wiener process<a class="headerlink" href="#the-wiener-process" title="Link to this heading">#</a></h2>
<p>In order to define the stochastic component of the transition rule of a stochastic system, we must define an appropriate noise model. The Wiener process is a stochastic process that is commonly used for this purpose.</p>
<div class="proof definition admonition" id="definition-0">
<p class="admonition-title"><span class="caption-number">Definition 88 </span> (Wiener process)</p>
<p class="admonition-title"><span class="caption-number">Definition 92 </span> (Wiener process)</p>
<section class="definition-content" id="proof-content">
<p>A standard Wiener process over [0, T] is a random variable <span class="math notranslate nohighlight">\(W(t)\)</span> that depends continuously on <span class="math notranslate nohighlight">\(t \in [0, T]\)</span> and satisfies:</p>
<ol class="arabic simple">
Expand Down Expand Up @@ -650,7 +650,7 @@ <h2>Evaluating a stochastic integral<a class="headerlink" href="#evaluating-a-st
<h2>Euler-Maruyama method<a class="headerlink" href="#euler-maruyama-method" title="Link to this heading">#</a></h2>
<p>The Euler-Maruyama method is the analoge of the Euler method for deterministic integrals, applied to the stochastic case.</p>
<div class="proof definition admonition" id="definition-1">
<p class="admonition-title"><span class="caption-number">Definition 89 </span> (Euler-Maruyama method)</p>
<p class="admonition-title"><span class="caption-number">Definition 93 </span> (Euler-Maruyama method)</p>
<section class="definition-content" id="proof-content">
<p>Given a scalar SDE with drift and diffusion functions <span class="math notranslate nohighlight">\(f\)</span> and <span class="math notranslate nohighlight">\(g\)</span></p>
<div class="math notranslate nohighlight">
Expand Down Expand Up @@ -770,7 +770,7 @@ <h2>Euler-Maruyama method<a class="headerlink" href="#euler-maruyama-method" tit
<h2>Strong and weak convergence<a class="headerlink" href="#strong-and-weak-convergence" title="Link to this heading">#</a></h2>
<p>Since the choice of the number of bins <span class="math notranslate nohighlight">\(N\)</span> of the discretisation affects the accuracy of our method, we are interested in how quickly the approximation converges to the exact solution as a function of <span class="math notranslate nohighlight">\(N\)</span>. To do so, we must first define <em>what convergence means</em> in the stochastic case, which leads us to two disctinct notions of convergence, the strong sence and the weak sense.</p>
<div class="proof definition admonition" id="definition-2">
<p class="admonition-title"><span class="caption-number">Definition 90 </span> (Strong convergence)</p>
<p class="admonition-title"><span class="caption-number">Definition 94 </span> (Strong convergence)</p>
<section class="definition-content" id="proof-content">
<p>A method for approximating a stochastic process <span class="math notranslate nohighlight">\(X(t)\)</span> is said to have strong order of convergence <span class="math notranslate nohighlight">\(\gamma\)</span> if there exists a constant such that</p>
<div class="math notranslate nohighlight">
Expand All @@ -780,7 +780,7 @@ <h2>Strong and weak convergence<a class="headerlink" href="#strong-and-weak-conv
</section>
</div><p>Strong convergence refers to the rate of convergence of the approximation <span class="math notranslate nohighlight">\(X_n\)</span> to the exact solution <span class="math notranslate nohighlight">\(X(\tau_n)\)</span> as <span class="math notranslate nohighlight">\(\Delta t \to 0\)</span>, in expectation. A weaker condition for convergence is rate at which the expected value of the approximation converges to the true expected value, as <span class="math notranslate nohighlight">\(\Delta t \to 0\)</span>, as given below.</p>
<div class="proof definition admonition" id="definition-3">
<p class="admonition-title"><span class="caption-number">Definition 91 </span> (Weak convergence)</p>
<p class="admonition-title"><span class="caption-number">Definition 95 </span> (Weak convergence)</p>
<section class="definition-content" id="proof-content">
<p>A method for approximating a stochastic process <span class="math notranslate nohighlight">\(X(t)\)</span> is said to have weak order of convergence <span class="math notranslate nohighlight">\(\gamma\)</span> if there exists a constant such that</p>
<div class="math notranslate nohighlight">
Expand All @@ -799,7 +799,7 @@ <h2>Strong and weak convergence<a class="headerlink" href="#strong-and-weak-conv
<h2>Milstein’s higher order method<a class="headerlink" href="#milstein-s-higher-order-method" title="Link to this heading">#</a></h2>
<p>Just as higher order methods for ODEs exist for obtaining refined estimates of the solution, so do methods for SDEs, such as Milstein’s higher order method.</p>
<div class="proof definition admonition" id="definition-4">
<p class="admonition-title"><span class="caption-number">Definition 92 </span> (Milstein’s method)</p>
<p class="admonition-title"><span class="caption-number">Definition 96 </span> (Milstein’s method)</p>
<section class="definition-content" id="proof-content">
<p>Given a scalar SDE with drift and diffusion functions <span class="math notranslate nohighlight">\(f\)</span> and <span class="math notranslate nohighlight">\(g\)</span></p>
<div class="math notranslate nohighlight">
Expand Down
6 changes: 3 additions & 3 deletions book/papers/rff/rff.html
Original file line number Diff line number Diff line change
Expand Up @@ -514,7 +514,7 @@ <h2>The RFF approximation<a class="headerlink" href="#the-rff-approximation" tit
<p>This is also an unbiased estimate of the kernel, however its variance is lower than in the <span class="math notranslate nohighlight">\(M = 1\)</span> case, since the variance of the average of the sum of <span class="math notranslate nohighlight">\(K\)</span> i.i.d. random variables is lower than the variance of a single one of the variables.
We therefore arrive at the following algorithm for estimating <span class="math notranslate nohighlight">\(k\)</span>.</p>
<div class="proof definition admonition" id="definition-1">
<p class="admonition-title"><span class="caption-number">Definition 85 </span> (Random Fourier Features)</p>
<p class="admonition-title"><span class="caption-number">Definition 89 </span> (Random Fourier Features)</p>
<section class="definition-content" id="proof-content">
<p>Given a translation invariant kernel <span class="math notranslate nohighlight">\(k\)</span> that is the Fourier transform of a probability measure <span class="math notranslate nohighlight">\(p\)</span>, we have the unbiased real-valued estimator</p>
<div class="math notranslate nohighlight">
Expand All @@ -536,7 +536,7 @@ <h3>RFF and Bayesian regression<a class="headerlink" href="#rff-and-bayesian-reg
<h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title="Link to this heading">#</a></h3>
<p>Now there remains the question of how large the error of the RFF estimator is. In other words, how closely does RFF estimate the exact kernel <span class="math notranslate nohighlight">\(k\)</span>? Since <span class="math notranslate nohighlight">\(-\sqrt{2} \leq z_{\omega, \phi} \leq \sqrt{2}\)</span>, we can use Hoeffding’s inequality<span id="id2">[<a class="reference internal" href="#id10" title="David Grimmett, Geoffrey Stirzaker. Probability and random processes. Oxford university press, 2020.">Grimmett, 2020</a>]</span> to obtain the following high-probability bound on the absolute error on our estimate of <span class="math notranslate nohighlight">\(k\)</span>.</p>
<div class="proof lemma admonition" id="lemma-2">
<p class="admonition-title"><span class="caption-number">Lemma 16 </span> (Hoeffding for RFF)</p>
<p class="admonition-title"><span class="caption-number">Lemma 19 </span> (Hoeffding for RFF)</p>
<section class="lemma-content" id="proof-content">
<p>The RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span>, obeys</p>
<div class="math notranslate nohighlight">
Expand All @@ -548,7 +548,7 @@ <h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title
Note that this is a statement about the closeness of <span class="math notranslate nohighlight">\(z^\top(x)z(y)\)</span> and <span class="math notranslate nohighlight">\(k(x, y)\)</span> for any two input pairs, rather than the closeness of these functions over the whole input space.
In fact, it is possible<span id="id3">[<a class="reference internal" href="#id11" title="Ali Rahimi, Benjamin Recht, and others. Random features for large-scale kernel machines. In NIPS. 2007.">Rahimi <em>et al.</em>, 2007</a>]</span> to make a stronger statement about the uniform convergence of the estimator.</p>
<div class="proof lemma admonition" id="lemma-3">
<p class="admonition-title"><span class="caption-number">Lemma 17 </span> (Uniform convergence of RFF)</p>
<p class="admonition-title"><span class="caption-number">Lemma 20 </span> (Uniform convergence of RFF)</p>
<section class="lemma-content" id="proof-content">
<p>Let <span class="math notranslate nohighlight">\(\mathcal{M}\)</span> be a compact subset of <span class="math notranslate nohighlight">\(\mathbb{R}^D\)</span>. Then the RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span> converges uniformly to <span class="math notranslate nohighlight">\(k\)</span> according to</p>
<div class="math notranslate nohighlight">
Expand Down
Loading

0 comments on commit d8fc21e

Please sign in to comment.