Update documentation

stratisMarkou · Dec 24, 2024 · d02b3a8 · d02b3a8
1 parent 5b00b06
commit d02b3a8
Show file tree

Hide file tree

Showing 9 changed files with 178 additions and 11 deletions.
diff --git a/_sources/book/topology/001-metric-spaces.md b/_sources/book/topology/001-metric-spaces.md
@@ -392,4 +392,78 @@ We will show that there exists $n \in \mathbb{N}$ such that $B_{1 / n}(x) \subse
 Suppose that this is not the case.
 Then, for each $n \in \mathbb{N},$ we have $B_{1 / n}(x) \cap C \neq \emptyset,$ so we can choose $x_n \in B_{1 / n}(x) \cap C.$
 Then, $x_n \to x,$ so $x$ is a limit point of $C$ and since $C$ contains all its limit points, it must also contain $x,$ which contradicts the assumption that $x \in X \setminus C.$
+:::
+
+
+Now we turn to the important result of the first part of the course.
+This result shows that the thing that determines whether a function is continuous is not the metric itself, but rather the collection of sets that are open under the metric.
+In particular, even if two metrics are different, if they define the same open sets, then the same functions will be continuous under both of them.
+
+:::{prf:theorem} Characterisation of continuity
+:label: topology:theorem-characterisation-of-continuity
+Let $(X, d_X)$ and $(Y, d_Y)$ be {prf:ref}`metric spaces<topology:def-metric-space>` and $f: X \to Y$ be a function.
+Then, the following are equivalent:
+
+1. $f$ is {prf:ref}`continuous<topology:def-continuous-function>`,
+2. $f(x_n) \to f(x)$ in $Y$ whenever $x_n \to x$ in $X,$
+3. For every {prf:ref}`open set<topology:def-open-and-closed-subsets>` $U \subseteq Y,$ the preimage $f^{-1}(U)$ is open in $X,$
+4. For every {prf:ref}`closed set<topology:def-open-and-closed-subsets>` $C \subseteq Y,$ the preimage $f^{-1}(C)$ is closed in $X,$
+5. For every $x \in X$ and $\epsilon > 0,$ there exists $\delta > 0$ such that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$
+:::
+
+:::{dropdown} Proof: Characterisation of continuity
+We break the proof down into a series of implications.
+
+($1 \iff 2$) This is the definition of {prf:ref}`continuity<topology:def-continuous-function>`.
+
+($2 \Rightarrow 3$) Suppose $f(x_n) \to f(x)$ in $Y$ whenever $x_n \to x$ in $X.$
+Suppose $U \subseteq Y$ is open but $f^{-1}(U)$ is not.
+Then, there exists $x \in f^{-1}(U)$ such that for every $n \in \mathbb{N}$ there exists $x_n \in B_{1 / n}(x)$ such that $x_n \notin f^{-1}(U).$
+This implies that $f(x_n) \notin U$ for any $n \in \mathbb{N}.$
+However, $x_n \to x$ in $X,$ so $f(x_n) \to f(x)$ in $Y.$
+This is a contradiction because since $U$ is open, any sequence that converges to a point in $U$ must {prf:ref}`eventually be<topology:lemma-convergence-implies-sequence-eventually-in-open-neighbourhood>` in $U$.
+
+($3 \Rightarrow 4$) Suppose that for every open set $U \subseteq Y,$ the preimage $f^{-1}(U)$ is open in $X.$
+Then, for every closed set $C \subseteq Y,$ the set $Y \setminus C$ is open so the set $f^{-1}(Y \setminus C)$ is open, so the preimage $f^{-1}(C) = X \setminus f^{-1}(Y \setminus C)$ is closed in $X.$
+
+($4 \Rightarrow 5$) Suppose that for every closed set $C \subseteq Y,$ the preimage $f^{-1}(C)$ is closed in $X.$
+Let $x \in X$ and $\epsilon > 0.$
+Since $Y \setminus B_\epsilon(f(x))$ is closed, $f^{-1}(Y \setminus B_\epsilon(f(x))) = X \setminus f^{-1}(B_\epsilon(f(x)))$ is closed.
+Therefore, the set $f^{-1}(B_\epsilon(f(x)))$ is open, so there exists $\delta > 0$ such that $B_\delta(x) \subseteq f^{-1}(B_\epsilon(f(x))),$ which implies that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$
+
+($5 \Rightarrow 2$) Suppose that for every $x \in X$ and $\epsilon > 0,$ there exists $\delta > 0$ such that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$
+Suppose $(x_n)$ is a sequence in $X$ such that $x_n \to x.$
+Let $\epsilon > 0.$
+Then, there exists $\delta > 0$ such that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$
+Since $x_n \to x,$ there exists $N \in \mathbb{N}$ such that $x_n \in B_\delta(x)$ for all $n > N,$ so $f(x_n) \in f(B_\delta(x)) \subseteq B_\epsilon(f(x))$ for all $n > N,$ which implies that $f(x_n) \to f(x).$
+:::
+
+
+We conclude with three properties of open sets that we will use to define toplogies in the next section.
+
+:::{prf:lemma} Properties of open sets
+:label: topology:lemma-properties-of-open-sets
+Let $(X, d_X)$ be a {prf:ref}`metric space<topology:def-metric-space>`.
+Then
+
+1. The empty set $\emptyset$ and $X$ are open,
+2. If $\{U_i\}_{i \in I}$ is a collection of open sets, then $\cup{i \in I} U_i$ is open,
+3. If $U_1, \ldots, U_N$ are open sets, then $\cap{n = 1}^N U_n$ is open.
+:::
+
+:::{dropdown} Proof: Properties of open sets
+__Property 1:__
+The empty set is open vacuously.
+The whole space $X$ is open because for every $x \in X,$ we have $B_r(x) \subseteq X,$ for any $r > 0.$
+
+__Property 2:__
+Let $\{U_i\}_{i \in I}$ be a collection of open sets.
+Suppose $x \in \cup_{i \in I} U_i.$
+Then, there exists $i \in I$ such that $x \in U_i,$ so there exists $r > 0$ such that $B_r(x) \subseteq U_i \subseteq \bigcup_{i \in I} U_i,$ so $\bigcup_{i \in I} U_i$ is open.
+
+__Property 3:__
+Let $U_1, \ldots, U_N$ be open sets.
+Suppose $x \in \cap_{n = 1}^N U_n.$
+Then, $x \in U_n$ for all $n = 1, \ldots, N,$ so there exists $r_n > 0$ such that $B_{r_n}(x) \subseteq U_n$ for all $n = 1, \ldots, N.$
+Taking $r = \min\{r_1, \ldots, r_N\},$ we have that $B_r(x) \subseteq U_n$ for all $n = 1, \ldots, N,$ so $\cap_{n = 1}^N U_n$ is open.
 :::
diff --git a/book/papers/ais/ais.html b/book/papers/ais/ais.html
@@ -590,7 +590,7 @@ <h2>Importance sampling<a class="headerlink" href="#importance-sampling" title="
 It is reasonable to expect that the more dissimilar <span class="math notranslate nohighlight">\(q\)</span> and <span class="math notranslate nohighlight">\(p\)</span> are, the larger the variance will be.
 In partricular, we can show that the variance of the importance weights can be lower bounded by a quantity that scales exponentially with the KL divergence.</p>
 <div class="proof lemma admonition" id="lemma-0">
-<p class="admonition-title"><span class="caption-number">Lemma 14 </span> (Lower bound to importance weight variance)</p>
+<p class="admonition-title"><span class="caption-number">Lemma 15 </span> (Lower bound to importance weight variance)</p>
 <section class="lemma-content" id="proof-content">
 <p>Given distributions <span class="math notranslate nohighlight">\(p\)</span> and <span class="math notranslate nohighlight">\(q\)</span>, it holds that</p>
 <div class="math notranslate nohighlight">

diff --git a/book/papers/num-sde/num-sde.html b/book/papers/num-sde/num-sde.html
@@ -838,7 +838,7 @@ <h2>Stochastic chain rule<a class="headerlink" href="#stochastic-chain-rule" tit
 \end{align}\]</div>
 <p>For an autonomous SDE however the chain rule takes a different form, which under the Ito definition is as follows.</p>
 <div class="proof theorem admonition" id="theorem-5">
-<p class="admonition-title"><span class="caption-number">Theorem 95 </span> (Ito’s result for one dimension)</p>
+<p class="admonition-title"><span class="caption-number">Theorem 96 </span> (Ito’s result for one dimension)</p>
 <section class="theorem-content" id="proof-content">
 <p>Let <span class="math notranslate nohighlight">\(X_t\)</span> be an Ito process given by</p>
 <div class="math notranslate nohighlight">

diff --git a/book/papers/rff/rff.html b/book/papers/rff/rff.html
@@ -468,7 +468,7 @@ <h1>Random Fourier features<a class="headerlink" href="#random-fourier-features"
 <h2>The RFF approximation<a class="headerlink" href="#the-rff-approximation" title="Link to this heading">#</a></h2>
 <p>The starting point for deriving RFF is Bochner’s theorem, which relates stationary kernels with probability distributions over frequencies via the Fourier transform.</p>
 <div class="proof theorem admonition" id="theorem-0">
-<p class="admonition-title"><span class="caption-number">Theorem 90 </span> (Bochner’s theorem)</p>
+<p class="admonition-title"><span class="caption-number">Theorem 91 </span> (Bochner’s theorem)</p>
 <section class="theorem-content" id="proof-content">
 <p>A continuous function of the form <span class="math notranslate nohighlight">\(k(x, y) = k(x - y)\)</span> is positive definite if and only if <span class="math notranslate nohighlight">\(k(\delta)\)</span> is the Fourier transform of a non-negative measure.</p>
 </section>
@@ -535,7 +535,7 @@ <h3>RFF and Bayesian regression<a class="headerlink" href="#rff-and-bayesian-reg
 <h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title="Link to this heading">#</a></h3>
 <p>Now there remains the question of how large the error of the RFF estimator is. In other words, how closely does RFF estimate the exact kernel <span class="math notranslate nohighlight">\(k\)</span>? Since <span class="math notranslate nohighlight">\(-\sqrt{2} \leq z_{\omega, \phi} \leq \sqrt{2}\)</span>, we can use Hoeffding’s inequality<span id="id2">[<a class="reference internal" href="#id10" title="David Grimmett, Geoffrey Stirzaker. Probability and random processes. Oxford university press, 2020.">Grimmett, 2020</a>]</span> to obtain the following high-probability bound on the absolute error on our estimate of <span class="math notranslate nohighlight">\(k\)</span>.</p>
 <div class="proof lemma admonition" id="lemma-2">
-<p class="admonition-title"><span class="caption-number">Lemma 15 </span> (Hoeffding for RFF)</p>
+<p class="admonition-title"><span class="caption-number">Lemma 16 </span> (Hoeffding for RFF)</p>
 <section class="lemma-content" id="proof-content">
 <p>The RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span>, obeys</p>
 <div class="math notranslate nohighlight">
@@ -547,7 +547,7 @@ <h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title
 Note that this is a statement about the closeness of <span class="math notranslate nohighlight">\(z^\top(x)z(y)\)</span> and <span class="math notranslate nohighlight">\(k(x, y)\)</span> for any two input pairs, rather than the closeness of these functions over the whole input space.
 In fact, it is possible<span id="id3">[<a class="reference internal" href="#id11" title="Ali Rahimi, Benjamin Recht, and others. Random features for large-scale kernel machines. In NIPS. 2007.">Rahimi <em>et al.</em>, 2007</a>]</span> to make a stronger statement about the uniform convergence of the estimator.</p>
 <div class="proof lemma admonition" id="lemma-3">
-<p class="admonition-title"><span class="caption-number">Lemma 16 </span> (Uniform convergence of RFF)</p>
+<p class="admonition-title"><span class="caption-number">Lemma 17 </span> (Uniform convergence of RFF)</p>
 <section class="lemma-content" id="proof-content">
 <p>Let <span class="math notranslate nohighlight">\(\mathcal{M}\)</span> be a compact subset of <span class="math notranslate nohighlight">\(\mathbb{R}^D\)</span>. Then the RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span> converges uniformly to <span class="math notranslate nohighlight">\(k\)</span> according to</p>
 <div class="math notranslate nohighlight">

diff --git a/book/papers/score-matching/score-matching.html b/book/papers/score-matching/score-matching.html
@@ -483,7 +483,7 @@ <h2>The score matching trick<a class="headerlink" href="#the-score-matching-tric
 So we might expect that in this case the model distribution <span class="math notranslate nohighlight">\(p_\theta(x)\)</span> and <span class="math notranslate nohighlight">\(p_d(x)\)</span> will also be equal.
 This intuition is formalised by the following result.</p>
 <div class="proof theorem admonition" id="theorem-1">
-<p class="admonition-title"><span class="caption-number">Theorem 91 </span> (Matching scores <span class="math notranslate nohighlight">\(\iff\)</span> matching distributions)</p>
+<p class="admonition-title"><span class="caption-number">Theorem 92 </span> (Matching scores <span class="math notranslate nohighlight">\(\iff\)</span> matching distributions)</p>
 <section class="theorem-content" id="proof-content">
 <p>Suppose that the probability density function of <span class="math notranslate nohighlight">\(x\)</span> satisfies <span class="math notranslate nohighlight">\(p_d(x) = p_\theta(x)\)</span> for some <span class="math notranslate nohighlight">\(\theta^*\)</span> and also that if <span class="math notranslate nohighlight">\(\theta^* \neq \theta\)</span> then <span class="math notranslate nohighlight">\(p_\theta(x) \neq p_d(x)\)</span>.
 Suppose also that <span class="math notranslate nohighlight">\(p_\theta(x) &gt; 0\)</span>. Then</p>
@@ -536,7 +536,7 @@ <h2>The score matching trick<a class="headerlink" href="#the-score-matching-tric
 Therefore this term must be considered in the optimisation of <span class="math notranslate nohighlight">\(J(\theta)\)</span> but is also not directly computable.
 By using integration by parts, one can show<span id="id2">[<a class="reference internal" href="#id14" title="Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.">Hyvärinen and Dayan, 2005</a>]</span> that this term can be rewritten in a way such that <span class="math notranslate nohighlight">\(J(\theta)\)</span> can be estimated empirically.</p>
 <div class="proof theorem admonition" id="theorem-2">
-<p class="admonition-title"><span class="caption-number">Theorem 92 </span> (Equivalent form of <span class="math notranslate nohighlight">\(J\)</span>)</p>
+<p class="admonition-title"><span class="caption-number">Theorem 93 </span> (Equivalent form of <span class="math notranslate nohighlight">\(J\)</span>)</p>
 <section class="theorem-content" id="proof-content">
 <p>Let <span class="math notranslate nohighlight">\(\psi_\theta(x)\)</span> be a score function which is differentiable with respect to <span class="math notranslate nohighlight">\(x.\)</span>
 Then, under some weak regularity conditions on <span class="math notranslate nohighlight">\(\psi_\theta(x),\)</span> the score-matching function <span class="math notranslate nohighlight">\(J\)</span> can be writtten as</p>

diff --git a/book/papers/svgd/svgd.html b/book/papers/svgd/svgd.html
@@ -483,7 +483,7 @@ <h3>Invertible transformations<a class="headerlink" href="#invertible-transforma
 <h3>Direction of steepest descent<a class="headerlink" href="#direction-of-steepest-descent" title="Link to this heading">#</a></h3>
 <p>Let us use the subscript notation <span class="math notranslate nohighlight">\(q_{[T]}\)</span> to denote the distribution obtained by passing <span class="math notranslate nohighlight">\(q\)</span> through <span class="math notranslate nohighlight">\(T\)</span>. Then we are interested in picking a <span class="math notranslate nohighlight">\(T\)</span> which minimises <span class="math notranslate nohighlight">\(\text{KL}(q_{[T]} || p)\)</span>. First, we compute the derivative of the KL w.r.t. <span class="math notranslate nohighlight">\(\epsilon\)</span>, which we obtain in closed form.</p>
 <div class="proof theorem admonition" id="theorem-0">
-<p class="admonition-title"><span class="caption-number">Theorem 93 </span> (Proof: Gradient of KL is the KSD)</p>
+<p class="admonition-title"><span class="caption-number">Theorem 94 </span> (Proof: Gradient of KL is the KSD)</p>
 <section class="theorem-content" id="proof-content">
 <p>Let <span class="math notranslate nohighlight">\(x \sim q(x)\)</span>, and <span class="math notranslate nohighlight">\(T(x) = x + \epsilon \phi(x)\)</span>, where <span class="math notranslate nohighlight">\(\phi\)</span> is a smooth function. Then</p>
 <div class="math notranslate nohighlight">
@@ -551,7 +551,7 @@ <h3>Direction of steepest descent<a class="headerlink" href="#direction-of-steep
 \end{align}\]</div>
 <p>If we now constrain <span class="math notranslate nohighlight">\(\phi \in \mathcal{H}_D\)</span> and <span class="math notranslate nohighlight">\(|| \phi ||_{\mathcal{H}_D} \leq 1\)</span> we obtain<span id="id2">[<a class="reference internal" href="#id13" title="Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, 276–284. PMLR, 2016.">Liu <em>et al.</em>, 2016</a>]</span> the following analytic expression for the direction of steepest descent.</p>
 <div class="proof theorem admonition" id="theorem-1">
-<p class="admonition-title"><span class="caption-number">Theorem 94 </span> (Direction of steepest descent)</p>
+<p class="admonition-title"><span class="caption-number">Theorem 95 </span> (Direction of steepest descent)</p>
 <section class="theorem-content" id="proof-content">
 <p>The function <span class="math notranslate nohighlight">\(\phi^* \in \mathcal{H}_D, || \phi^* ||_{\mathcal{H}_D} \leq 1\)</span> which maximises the rate of decrease KL-divergence is</p>
 <div class="math notranslate nohighlight">