Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
stratisMarkou committed Dec 24, 2024
1 parent 5b00b06 commit d02b3a8
Show file tree
Hide file tree
Showing 9 changed files with 178 additions and 11 deletions.
74 changes: 74 additions & 0 deletions _sources/book/topology/001-metric-spaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -392,4 +392,78 @@ We will show that there exists $n \in \mathbb{N}$ such that $B_{1 / n}(x) \subse
Suppose that this is not the case.
Then, for each $n \in \mathbb{N},$ we have $B_{1 / n}(x) \cap C \neq \emptyset,$ so we can choose $x_n \in B_{1 / n}(x) \cap C.$
Then, $x_n \to x,$ so $x$ is a limit point of $C$ and since $C$ contains all its limit points, it must also contain $x,$ which contradicts the assumption that $x \in X \setminus C.$
:::


Now we turn to the important result of the first part of the course.
This result shows that the thing that determines whether a function is continuous is not the metric itself, but rather the collection of sets that are open under the metric.
In particular, even if two metrics are different, if they define the same open sets, then the same functions will be continuous under both of them.

:::{prf:theorem} Characterisation of continuity
:label: topology:theorem-characterisation-of-continuity
Let $(X, d_X)$ and $(Y, d_Y)$ be {prf:ref}`metric spaces<topology:def-metric-space>` and $f: X \to Y$ be a function.
Then, the following are equivalent:

1. $f$ is {prf:ref}`continuous<topology:def-continuous-function>`,
2. $f(x_n) \to f(x)$ in $Y$ whenever $x_n \to x$ in $X,$
3. For every {prf:ref}`open set<topology:def-open-and-closed-subsets>` $U \subseteq Y,$ the preimage $f^{-1}(U)$ is open in $X,$
4. For every {prf:ref}`closed set<topology:def-open-and-closed-subsets>` $C \subseteq Y,$ the preimage $f^{-1}(C)$ is closed in $X,$
5. For every $x \in X$ and $\epsilon > 0,$ there exists $\delta > 0$ such that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$
:::

:::{dropdown} Proof: Characterisation of continuity
We break the proof down into a series of implications.

($1 \iff 2$) This is the definition of {prf:ref}`continuity<topology:def-continuous-function>`.

($2 \Rightarrow 3$) Suppose $f(x_n) \to f(x)$ in $Y$ whenever $x_n \to x$ in $X.$
Suppose $U \subseteq Y$ is open but $f^{-1}(U)$ is not.
Then, there exists $x \in f^{-1}(U)$ such that for every $n \in \mathbb{N}$ there exists $x_n \in B_{1 / n}(x)$ such that $x_n \notin f^{-1}(U).$
This implies that $f(x_n) \notin U$ for any $n \in \mathbb{N}.$
However, $x_n \to x$ in $X,$ so $f(x_n) \to f(x)$ in $Y.$
This is a contradiction because since $U$ is open, any sequence that converges to a point in $U$ must {prf:ref}`eventually be<topology:lemma-convergence-implies-sequence-eventually-in-open-neighbourhood>` in $U$.

($3 \Rightarrow 4$) Suppose that for every open set $U \subseteq Y,$ the preimage $f^{-1}(U)$ is open in $X.$
Then, for every closed set $C \subseteq Y,$ the set $Y \setminus C$ is open so the set $f^{-1}(Y \setminus C)$ is open, so the preimage $f^{-1}(C) = X \setminus f^{-1}(Y \setminus C)$ is closed in $X.$

($4 \Rightarrow 5$) Suppose that for every closed set $C \subseteq Y,$ the preimage $f^{-1}(C)$ is closed in $X.$
Let $x \in X$ and $\epsilon > 0.$
Since $Y \setminus B_\epsilon(f(x))$ is closed, $f^{-1}(Y \setminus B_\epsilon(f(x))) = X \setminus f^{-1}(B_\epsilon(f(x)))$ is closed.
Therefore, the set $f^{-1}(B_\epsilon(f(x)))$ is open, so there exists $\delta > 0$ such that $B_\delta(x) \subseteq f^{-1}(B_\epsilon(f(x))),$ which implies that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$

($5 \Rightarrow 2$) Suppose that for every $x \in X$ and $\epsilon > 0,$ there exists $\delta > 0$ such that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$
Suppose $(x_n)$ is a sequence in $X$ such that $x_n \to x.$
Let $\epsilon > 0.$
Then, there exists $\delta > 0$ such that $f(B_\delta(x)) \subseteq B_\epsilon(f(x)).$
Since $x_n \to x,$ there exists $N \in \mathbb{N}$ such that $x_n \in B_\delta(x)$ for all $n > N,$ so $f(x_n) \in f(B_\delta(x)) \subseteq B_\epsilon(f(x))$ for all $n > N,$ which implies that $f(x_n) \to f(x).$
:::


We conclude with three properties of open sets that we will use to define toplogies in the next section.

:::{prf:lemma} Properties of open sets
:label: topology:lemma-properties-of-open-sets
Let $(X, d_X)$ be a {prf:ref}`metric space<topology:def-metric-space>`.
Then

1. The empty set $\emptyset$ and $X$ are open,
2. If $\{U_i\}_{i \in I}$ is a collection of open sets, then $\cup{i \in I} U_i$ is open,
3. If $U_1, \ldots, U_N$ are open sets, then $\cap{n = 1}^N U_n$ is open.
:::

:::{dropdown} Proof: Properties of open sets
__Property 1:__
The empty set is open vacuously.
The whole space $X$ is open because for every $x \in X,$ we have $B_r(x) \subseteq X,$ for any $r > 0.$

__Property 2:__
Let $\{U_i\}_{i \in I}$ be a collection of open sets.
Suppose $x \in \cup_{i \in I} U_i.$
Then, there exists $i \in I$ such that $x \in U_i,$ so there exists $r > 0$ such that $B_r(x) \subseteq U_i \subseteq \bigcup_{i \in I} U_i,$ so $\bigcup_{i \in I} U_i$ is open.

__Property 3:__
Let $U_1, \ldots, U_N$ be open sets.
Suppose $x \in \cap_{n = 1}^N U_n.$
Then, $x \in U_n$ for all $n = 1, \ldots, N,$ so there exists $r_n > 0$ such that $B_{r_n}(x) \subseteq U_n$ for all $n = 1, \ldots, N.$
Taking $r = \min\{r_1, \ldots, r_N\},$ we have that $B_r(x) \subseteq U_n$ for all $n = 1, \ldots, N,$ so $\cap_{n = 1}^N U_n$ is open.
:::
2 changes: 1 addition & 1 deletion book/papers/ais/ais.html
Original file line number Diff line number Diff line change
Expand Up @@ -590,7 +590,7 @@ <h2>Importance sampling<a class="headerlink" href="#importance-sampling" title="
It is reasonable to expect that the more dissimilar <span class="math notranslate nohighlight">\(q\)</span> and <span class="math notranslate nohighlight">\(p\)</span> are, the larger the variance will be.
In partricular, we can show that the variance of the importance weights can be lower bounded by a quantity that scales exponentially with the KL divergence.</p>
<div class="proof lemma admonition" id="lemma-0">
<p class="admonition-title"><span class="caption-number">Lemma 14 </span> (Lower bound to importance weight variance)</p>
<p class="admonition-title"><span class="caption-number">Lemma 15 </span> (Lower bound to importance weight variance)</p>
<section class="lemma-content" id="proof-content">
<p>Given distributions <span class="math notranslate nohighlight">\(p\)</span> and <span class="math notranslate nohighlight">\(q\)</span>, it holds that</p>
<div class="math notranslate nohighlight">
Expand Down
2 changes: 1 addition & 1 deletion book/papers/num-sde/num-sde.html
Original file line number Diff line number Diff line change
Expand Up @@ -838,7 +838,7 @@ <h2>Stochastic chain rule<a class="headerlink" href="#stochastic-chain-rule" tit
\end{align}\]</div>
<p>For an autonomous SDE however the chain rule takes a different form, which under the Ito definition is as follows.</p>
<div class="proof theorem admonition" id="theorem-5">
<p class="admonition-title"><span class="caption-number">Theorem 95 </span> (Ito’s result for one dimension)</p>
<p class="admonition-title"><span class="caption-number">Theorem 96 </span> (Ito’s result for one dimension)</p>
<section class="theorem-content" id="proof-content">
<p>Let <span class="math notranslate nohighlight">\(X_t\)</span> be an Ito process given by</p>
<div class="math notranslate nohighlight">
Expand Down
6 changes: 3 additions & 3 deletions book/papers/rff/rff.html
Original file line number Diff line number Diff line change
Expand Up @@ -468,7 +468,7 @@ <h1>Random Fourier features<a class="headerlink" href="#random-fourier-features"
<h2>The RFF approximation<a class="headerlink" href="#the-rff-approximation" title="Link to this heading">#</a></h2>
<p>The starting point for deriving RFF is Bochner’s theorem, which relates stationary kernels with probability distributions over frequencies via the Fourier transform.</p>
<div class="proof theorem admonition" id="theorem-0">
<p class="admonition-title"><span class="caption-number">Theorem 90 </span> (Bochner’s theorem)</p>
<p class="admonition-title"><span class="caption-number">Theorem 91 </span> (Bochner’s theorem)</p>
<section class="theorem-content" id="proof-content">
<p>A continuous function of the form <span class="math notranslate nohighlight">\(k(x, y) = k(x - y)\)</span> is positive definite if and only if <span class="math notranslate nohighlight">\(k(\delta)\)</span> is the Fourier transform of a non-negative measure.</p>
</section>
Expand Down Expand Up @@ -535,7 +535,7 @@ <h3>RFF and Bayesian regression<a class="headerlink" href="#rff-and-bayesian-reg
<h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title="Link to this heading">#</a></h3>
<p>Now there remains the question of how large the error of the RFF estimator is. In other words, how closely does RFF estimate the exact kernel <span class="math notranslate nohighlight">\(k\)</span>? Since <span class="math notranslate nohighlight">\(-\sqrt{2} \leq z_{\omega, \phi} \leq \sqrt{2}\)</span>, we can use Hoeffding’s inequality<span id="id2">[<a class="reference internal" href="#id10" title="David Grimmett, Geoffrey Stirzaker. Probability and random processes. Oxford university press, 2020.">Grimmett, 2020</a>]</span> to obtain the following high-probability bound on the absolute error on our estimate of <span class="math notranslate nohighlight">\(k\)</span>.</p>
<div class="proof lemma admonition" id="lemma-2">
<p class="admonition-title"><span class="caption-number">Lemma 15 </span> (Hoeffding for RFF)</p>
<p class="admonition-title"><span class="caption-number">Lemma 16 </span> (Hoeffding for RFF)</p>
<section class="lemma-content" id="proof-content">
<p>The RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span>, obeys</p>
<div class="math notranslate nohighlight">
Expand All @@ -547,7 +547,7 @@ <h3>Rates of convergence<a class="headerlink" href="#rates-of-convergence" title
Note that this is a statement about the closeness of <span class="math notranslate nohighlight">\(z^\top(x)z(y)\)</span> and <span class="math notranslate nohighlight">\(k(x, y)\)</span> for any two input pairs, rather than the closeness of these functions over the whole input space.
In fact, it is possible<span id="id3">[<a class="reference internal" href="#id11" title="Ali Rahimi, Benjamin Recht, and others. Random features for large-scale kernel machines. In NIPS. 2007.">Rahimi <em>et al.</em>, 2007</a>]</span> to make a stronger statement about the uniform convergence of the estimator.</p>
<div class="proof lemma admonition" id="lemma-3">
<p class="admonition-title"><span class="caption-number">Lemma 16 </span> (Uniform convergence of RFF)</p>
<p class="admonition-title"><span class="caption-number">Lemma 17 </span> (Uniform convergence of RFF)</p>
<section class="lemma-content" id="proof-content">
<p>Let <span class="math notranslate nohighlight">\(\mathcal{M}\)</span> be a compact subset of <span class="math notranslate nohighlight">\(\mathbb{R}^D\)</span>. Then the RFF estimator of <span class="math notranslate nohighlight">\(k\)</span>, using <span class="math notranslate nohighlight">\(M\)</span> pairs of <span class="math notranslate nohighlight">\(\omega, \phi\)</span> converges uniformly to <span class="math notranslate nohighlight">\(k\)</span> according to</p>
<div class="math notranslate nohighlight">
Expand Down
4 changes: 2 additions & 2 deletions book/papers/score-matching/score-matching.html
Original file line number Diff line number Diff line change
Expand Up @@ -483,7 +483,7 @@ <h2>The score matching trick<a class="headerlink" href="#the-score-matching-tric
So we might expect that in this case the model distribution <span class="math notranslate nohighlight">\(p_\theta(x)\)</span> and <span class="math notranslate nohighlight">\(p_d(x)\)</span> will also be equal.
This intuition is formalised by the following result.</p>
<div class="proof theorem admonition" id="theorem-1">
<p class="admonition-title"><span class="caption-number">Theorem 91 </span> (Matching scores <span class="math notranslate nohighlight">\(\iff\)</span> matching distributions)</p>
<p class="admonition-title"><span class="caption-number">Theorem 92 </span> (Matching scores <span class="math notranslate nohighlight">\(\iff\)</span> matching distributions)</p>
<section class="theorem-content" id="proof-content">
<p>Suppose that the probability density function of <span class="math notranslate nohighlight">\(x\)</span> satisfies <span class="math notranslate nohighlight">\(p_d(x) = p_\theta(x)\)</span> for some <span class="math notranslate nohighlight">\(\theta^*\)</span> and also that if <span class="math notranslate nohighlight">\(\theta^* \neq \theta\)</span> then <span class="math notranslate nohighlight">\(p_\theta(x) \neq p_d(x)\)</span>.
Suppose also that <span class="math notranslate nohighlight">\(p_\theta(x) &gt; 0\)</span>. Then</p>
Expand Down Expand Up @@ -536,7 +536,7 @@ <h2>The score matching trick<a class="headerlink" href="#the-score-matching-tric
Therefore this term must be considered in the optimisation of <span class="math notranslate nohighlight">\(J(\theta)\)</span> but is also not directly computable.
By using integration by parts, one can show<span id="id2">[<a class="reference internal" href="#id14" title="Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.">Hyvärinen and Dayan, 2005</a>]</span> that this term can be rewritten in a way such that <span class="math notranslate nohighlight">\(J(\theta)\)</span> can be estimated empirically.</p>
<div class="proof theorem admonition" id="theorem-2">
<p class="admonition-title"><span class="caption-number">Theorem 92 </span> (Equivalent form of <span class="math notranslate nohighlight">\(J\)</span>)</p>
<p class="admonition-title"><span class="caption-number">Theorem 93 </span> (Equivalent form of <span class="math notranslate nohighlight">\(J\)</span>)</p>
<section class="theorem-content" id="proof-content">
<p>Let <span class="math notranslate nohighlight">\(\psi_\theta(x)\)</span> be a score function which is differentiable with respect to <span class="math notranslate nohighlight">\(x.\)</span>
Then, under some weak regularity conditions on <span class="math notranslate nohighlight">\(\psi_\theta(x),\)</span> the score-matching function <span class="math notranslate nohighlight">\(J\)</span> can be writtten as</p>
Expand Down
4 changes: 2 additions & 2 deletions book/papers/svgd/svgd.html
Original file line number Diff line number Diff line change
Expand Up @@ -483,7 +483,7 @@ <h3>Invertible transformations<a class="headerlink" href="#invertible-transforma
<h3>Direction of steepest descent<a class="headerlink" href="#direction-of-steepest-descent" title="Link to this heading">#</a></h3>
<p>Let us use the subscript notation <span class="math notranslate nohighlight">\(q_{[T]}\)</span> to denote the distribution obtained by passing <span class="math notranslate nohighlight">\(q\)</span> through <span class="math notranslate nohighlight">\(T\)</span>. Then we are interested in picking a <span class="math notranslate nohighlight">\(T\)</span> which minimises <span class="math notranslate nohighlight">\(\text{KL}(q_{[T]} || p)\)</span>. First, we compute the derivative of the KL w.r.t. <span class="math notranslate nohighlight">\(\epsilon\)</span>, which we obtain in closed form.</p>
<div class="proof theorem admonition" id="theorem-0">
<p class="admonition-title"><span class="caption-number">Theorem 93 </span> (Proof: Gradient of KL is the KSD)</p>
<p class="admonition-title"><span class="caption-number">Theorem 94 </span> (Proof: Gradient of KL is the KSD)</p>
<section class="theorem-content" id="proof-content">
<p>Let <span class="math notranslate nohighlight">\(x \sim q(x)\)</span>, and <span class="math notranslate nohighlight">\(T(x) = x + \epsilon \phi(x)\)</span>, where <span class="math notranslate nohighlight">\(\phi\)</span> is a smooth function. Then</p>
<div class="math notranslate nohighlight">
Expand Down Expand Up @@ -551,7 +551,7 @@ <h3>Direction of steepest descent<a class="headerlink" href="#direction-of-steep
\end{align}\]</div>
<p>If we now constrain <span class="math notranslate nohighlight">\(\phi \in \mathcal{H}_D\)</span> and <span class="math notranslate nohighlight">\(|| \phi ||_{\mathcal{H}_D} \leq 1\)</span> we obtain<span id="id2">[<a class="reference internal" href="#id13" title="Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, 276–284. PMLR, 2016.">Liu <em>et al.</em>, 2016</a>]</span> the following analytic expression for the direction of steepest descent.</p>
<div class="proof theorem admonition" id="theorem-1">
<p class="admonition-title"><span class="caption-number">Theorem 94 </span> (Direction of steepest descent)</p>
<p class="admonition-title"><span class="caption-number">Theorem 95 </span> (Direction of steepest descent)</p>
<section class="theorem-content" id="proof-content">
<p>The function <span class="math notranslate nohighlight">\(\phi^* \in \mathcal{H}_D, || \phi^* ||_{\mathcal{H}_D} \leq 1\)</span> which maximises the rate of decrease KL-divergence is</p>
<div class="math notranslate nohighlight">
Expand Down
Loading

0 comments on commit d02b3a8

Please sign in to comment.