diff --git a/_images/4c4c0f29cc22ddd0c694e31c7f01d15dc89190797a214d20199c3eef11d3e7c8.svg b/_images/4c4c0f29cc22ddd0c694e31c7f01d15dc89190797a214d20199c3eef11d3e7c8.svg new file mode 100644 index 00000000..2646cb25 --- /dev/null +++ b/_images/4c4c0f29cc22ddd0c694e31c7f01d15dc89190797a214d20199c3eef11d3e7c8.svg @@ -0,0 +1,8004 @@ + + + + + + + + 2024-12-17T15:45:32.764353 + image/svg+xml + + + Matplotlib v3.8.4, https://matplotlib.orgo newline at end of file diff --git a/_images/5b555565b39c4a5f1f1d65d51458d0ce8dfec8a60ba62cc51cb301d287ecd9e5.svg b/_images/5b555565b39c4a5f1f1d65d51458d0ce8dfec8a60ba62cc51cb301d287ecd9e5.svg deleted file mode 100644 index 60876639..00000000 --- a/_images/5b555565b39c4a5f1f1d65d51458d0ce8dfec8a60ba62cc51cb301d287ecd9e5.svg +++ /dev/null @@ -1,10232 +0,0 @@ - - - - - - - - 2024-05-25T08:15:57.399962 - image/svg+xml - - - Matplotlib v3.8.4, https://matplotlib.orgo newline at end of file diff --git a/_sources/book/mira/000-exercises.md b/_sources/book/mira/000-exercises.md index 17800ba8..e01912a4 100644 --- a/_sources/book/mira/000-exercises.md +++ b/_sources/book/mira/000-exercises.md @@ -7,7 +7,7 @@ Follow This page gives solutions to the exercises from the book Measure, Integration and Real Analysis by Sheldon Axler. -We have been working through the book and exercises with Adrian Goldwaser, Bruno Mlodozeniec and Shreyas Padhy, and these solutions are a joint effort. +We have been working through the book and exercises with Adrian Goldwaser and Shreyas Padhy, and these solutions are a joint effort. Please [email me](mailto:stratismar@gmail.com) if you find any errors in these solutions or have any other comments. ## Chapter 1.A @@ -947,74 +947,6 @@ Putting these results together arrive at the required conclusion. :::: -::::{admonition} Excercise 2.A.8 -:class: tip -Prove that if $A \subseteq \mathbb{R}$ and $t > 0,$ then $|A| = |A \cap (-t, t)| + |A \cap (\mathbb{R} \setminus (-t, t))|.$ - -:::{dropdown} Solution -First, by the {prf:ref}`countable subadditivity of the outer measure` we have - -$$|A| \leq |A \cap (-t, t)| + |A \cap (\mathbb{R} \setminus (-t, t))|$$ - -for all $t > 0.$ -To prove the inequality the other way, suppose $I_1, I_2, \dots$ is a sequence of open intervals whose union contains $A.$ -Then, we have - -$$\begin{align} -\sum_{n = 1}^\infty \ell(I_n) &= \sum_{n = 1}^\infty \ell(I_n \cap (-t, t)) + \ell(I_n \cap (-\infty, t)) + \ell(I_n \cap (\mathbb{R} \setminus (t, \infty))) \\ -&\geq |A \cap (-t, t)| + |A \cap (\mathbb{R} \setminus (-t, t))| -\end{align}$$ - -where we have used the fact that the sequence of sets - -$$I_1 \cap (-\infty, t], I_1 \cap [t, \infty), I_2 \cap (-\infty, t], \dots$$ - -has a union that contains $A \cap (\mathbb{R} \setminus (-t, t)),$ and the outer measures of these sets are equal to - -$$\ell(I_1 \cap (-\infty, t)), \ell(I_1 \cap (t, \infty)), \ell(I_2 \cap (-\infty, t)), \dots,$$ - -completing the proof. -::: -:::: - - -::::{admonition} Exercise 2.A.9 -:class: tip -Prove that $|A| = \lim_{t \to \infty} |A \cap (-t, t)|$ for all $A \subseteq \mathbb{R}.$ - -:::{dropdown} Solution -First, by the {prf:ref}`countable subadditivity of the outer measure` we have - -$$\begin{align} -|A| &= \left|\bigcup_{n = 1}^\infty (A \cap (n-1, n)) \cup (A \cap (-n, -n + 1)) \cup \{-n-1, n-1\} \right| \\ -&\leq \sum_{n = 1}^\infty \left|(A \cap (n-1, n)) \cup (A \cap (-n, -n+1)) \cup \{-n-1, n-1\} \right| \\ -&= \lim_{N \to \infty} \sum_{n = 1}^N \left|(A \cap (n-1, n)) \cup (A \cap (-n, -n+1)) \cup \{-n-1, n-1\} \right| \\ -&= \lim_{N \to \infty} \left|A \cap (-N, N)\right|. -\end{align}$$ - -Note that since $\left|A \cap (-t, t)\right|$ is non-decreasing in $t \in \mathbb{Z}^+,$ the limit above is unchanged even if $N \not \in \mathbb{Z}^+.$ -In addition, we have $|A| \geq |A \cap (-t, t)|$ for all $t \in \mathbb{R},$ and putting these two inequalities together, we conclude that - -$$|A| = \lim_{t \to \infty} |A \cap (-t, t)|.$$ -::: -:::: - - -::::{admonition} Exercise 2.A.10 -:class: tip -Prove that $|[0, 1] \setminus \mathbb{Q}| = 1.$ - -:::{dropdown} Solution -First, by the {prf:ref}`countable subadditivity of the outer measure` we have - -$$\begin{equation} -|[0, 1] \setminus \mathbb{Q}| \geq |[0, 1]| - |\mathbb{Q}| = |[0, 1]| = 1, -\end{equation}$$ - -where we have used the fact that {prf:ref}`countable sets have measure zero`. -Using the fact that the {prf:ref}`outer measure preserves order` we have $|[0, 1] \setminus \mathbb{Q}| \leq |[0, 1]| = 1,$ concluding the proof. -::: -:::: ## Chapter 2.C @@ -1286,6 +1218,7 @@ $$\lim_{n \to \infty} \mu(E_n) = \lim_{n \to \infty} \sum_{k = n}^{\infty} \frac ::::{admonition} Exercise 2.C.11 :class: tip + Suppose $(X, S, \mu)$ is a measure space and $C, D, E \in S$ are such that $$\mu(C \cap D) < \infty, \mu(C \cap E) < \infty, \mu(D \cap E) < \infty.$$ @@ -1293,6 +1226,7 @@ $$\mu(C \cap D) < \infty, \mu(C \cap E) < \infty, \mu(D \cap E) < \infty.$$ Find and prove a formula for $\mu(C \cup D \cup E)$ in terms of $\mu(C),$ $\mu(D),$ $\mu(E),$ $\mu(C \cap D),$ $\mu(C \cap E),$ $\mu(D \cap E),$ and $\mu(C \cap D \cap E).$ :::{dropdown} Solution + Suppose $(X, S, \mu)$ is a measure space and $C, D, E \in S$ are such that $$\mu(C \cap D) < \infty, \mu(C \cap E) < \infty, \mu(D \cap E) < \infty.$$ @@ -1306,19 +1240,24 @@ $$\begin{align} &= \mu(C) + \mu(D) + \mu(E) - \mu(C \cap D) - \mu(C \cap E) - \mu(D \cap E) + \\ &~~~~+ \mu(C \cap D \cap E). \end{align}$$ + ::: :::: + ::::{admonition} Exercise 2.C.12 :class: tip -Suppose $X$ is a set and $S$ is the $\sigma$-algebra of all subsets $E$ of $X$ such that $E$ is countable or $X \setminus E$ is countable. + +Suppose $X$ is a set and $S$ is the $\sigma$-algebra of all subsets $E$ of $X$ such that $E$ is countatble or $X \setminus E$ is countable. Give a complete description of the set of all measures $\mu$ on $(X, S).$ :::{dropdown} Solution -Suppose $X$ is a set and $S$ is the $\sigma$-algebra of all subsets $E$ of $X$ such that $E$ is countable or $X \setminus E$ is countable. + +Suppose $X$ is a set and $S$ is the $\sigma$-algebra of all subsets $E$ of $X$ such that $E$ is countatble or $X \setminus E$ is countable. Then, a measure $\mu$ on $(X, S)$ is completely determined by the value of $\mu(\{x\})$ for each $x \in X,$ along with $\mu(X).$ + ::: :::: @@ -2065,487 +2004,4 @@ $$(\mu \times \nu)(E) = \lim_{n \to \infty} (\mu \times \nu)(E_n).$$ Putting these results together, we have $\omega(E) = (\mu \times \nu)(E),$ so $\mathcal{M}$ is a monotone class. Since $\mathcal{M}$ is a monotone class that contains $\mathcal{A},$ it contains $S \otimes T$ so $\mathcal{M} = S \otimes T,$ and the two measures $\omega$ and $\mu \times \nu$ agree on all of $S \otimes T.$ ::: -:::: - - -## Chapter 6.A - -::::{admonition} Exercise 6.A.1 -:class: tip -Verify that each of the following examples of sets $V$ and functions $d: V \times V \to \mathbb{R}$ are indeed metric spaces. - -1. Suppose $V$ is a nonempty set. -Define $d$ as - -$$\begin{equation} -d(f, g) = \begin{cases} -1 &\text{ if } f = g \\ -0 &\text{ if } f \neq g -\end{cases} -\end{equation}$$ - -2. Let $V = \mathbb{R}.$ -Define $d$ as - -$$\begin{equation} -d(x, y) = |x - y|. -\end{equation}$$ - -3. Let $V = \mathbb{R}.$ -For $n \in \mathbb{Z}^+,$ define $d$ as - -$$\begin{equation} -d((x_1, \dots, x_n), (y_1, \dots, y_n)) = \max\{\|x_1 - y_1|, \dots, |x_n - y_n|\}. -\end{equation}$$ - -4. Let $V = C([0, 1]),$ the set of continuous real-valued functions on $[0, 1].$ -Define $d$ as - -$$\begin{equation} -d(f, g) = \sup\{|f(t) - g(t)|: t \in [0, 1]|\}. -\end{equation}$$ - -5. Let $V$ be the set of sequences $(a_1, a_2, \dots)$ with $a_k \in \mathbb{R}$ and $\sum_{n=1}^\infty |a_k| < \infty.$ -Define $d$ as - -$$\begin{equation} -d((a_1, a_2, \dots), (b_1, b_2, \dots)) = \sum_{n=1}^\infty |a_k - b_k|. -\end{equation}$$ - - -:::{dropdown} Solution -For all the definitions above, $d(f, g) \geq 0$ and equality with zero holds only if $f, g.$ -Further, $d(f, g) = d(g, f).$ -It remains to show the triangle inequality - -$$\begin{equation} -d(f, h) \leq d(f, g) + d(g, h) \text{ for all } f, g, h \in V -\end{equation}$$ - -holds for each example. - -__Example 1:__ -Suppose $f, g, h \in \mathbb{V}.$ -If $f = h,$ then the triangle inequality holds since - -$$\begin{equation} -d(f, h) = 0 \leq d(f, g) + d(g, h), -\end{equation}$$ - -holds. -If $f \neq h,$ then at least one of $f \neq g$ and $g \neq h$ must hold, so the triangle inequality holds because - -$$\begin{equation} -d(f, h) = 1 \leq d(f, g) + d(g, h). -\end{equation}$$ - -__Example 2:__ -Suppose $f, g, h \in \mathbb{R}.$ -Then the triangle inequality holds because - -$$\begin{equation} -d(f, h) = |f - h| = |(f - g) + (g - h)| \leq |f - g| + |g - h| \leq d(f, g) + d(g, h). -\end{equation}$$ - -__Example 3:__ -Suppose $n \in \mathbb{Z}^+$ and $(x_1, \dots, x_n), (y_1, \dots, y_n), (z_1, \dots, z_n) \in \mathbb{R}^n.$ -Then the triangle inequality holds because - -$$\begin{align} -&~~~~d((x_1, \dots, x_n), (z_1, \dots, z_n)) = \\ -&= \max\{|x_1 - z_1|, \dots, |x_n - z_n|\} \\ -&= \max\{|(x_1 - y_1) + (y_1 - z_1)|, \dots, |(x_n - y_n) + (y_n - z_n)|\} \\ -&\leq \max\{|x_1 - y_1| + |y_1 - z_1|, \dots, |x_n - y_n| + |y_n - z_n|\} \\ -&\leq \max\{|x_1 - y_1|, \dots, |x_n - y_n|\} + \max\{|y_1 - z_1|, \dots, |y_n - z_n|\} \\ -&\leq d((x_1, \dots, x_n), (y_1, \dots, y_n)) + d((y_1, \dots, y_n), (z_1, \dots, z_n)). -\end{align}$$ - -__Example 4:__ -Suppose $f, g, h \in C([0, 1]).$ -Then, the triangle inequality holds because - -$$\begin{align} -d(f, h) &= \sup\{|f(t) - h(t)|: t \in [0, 1]|\} \\ -&= \sup\{|(f(t) - g(t)) + (g(t) - h(t))|: t \in [0, 1]|\} \\ -&\leq \sup\{|f(t) - g(t)| + |g(t) - h(t)|: t \in [0, 1]|\} \\ -&\leq \sup\{|f(t) - g(t)|: t \in [0, 1]|\} + \sup\{|g(t) - h(t)|: t \in [0, 1]|\} \\ -&= d(f, g) + d(g, h). -\end{align}$$ - -__Example 5:__ -Suppose $(f_1, f_2, \dots), (g_1, g_2, \dots), (h_1, h_2, \dots) \in V,$ where $V$ is the set of sequences $(a_1, a_2, \dots)$ of real numbers such that $\sum_{n=1}^\infty |a_k| < \infty.$ -Then, the triangle inequality holds because - -$$\begin{align} -&~~~~d((f_1, f_2, \dots), (h_1, h_2, \dots)) = \\ -&= \sum_{n=1}^\infty |f_k - h_k| \\ -&= \sum_{n=1}^\infty |(f_k - g_k) + (g_k - h_k)| \\ -&\leq \sum_{n=1}^\infty |f_k - g_k| + |g_k - h_k| \\ -&= d((f_1, f_2, \dots), (g_1, g_2, \dots)) + d((g_1, g_2, \dots), (h_1, h_2, \dots)). -\end{align}$$ - -::: -:::: - - -::::{admonition} Exercise 6.A.2 -:class: tip -Prove that every finite subset of a metric space is closed. - -:::{dropdown} Solution -Suppose $(V, d)$ is a metric space and let $A$ be a finite subset of $V.$ -Let $x \in A'.$ -Since $A$ is finite, the set $\{d(x, y) \in \mathbb{R}^+: y \in A\}$ has a minimum element, and this minimum element is non-zero, say equal to some positive $m \in \mathbb{R}$ -Therefore, any open ball centered on $x$ with radius less than or equal to $m$ is contained in $A'.$ -Since $x$ was arbitrary, $A'$ is open, so $A$ is closed. -::: -:::: - -::::{admonition} Exercise 6.A.3 -:class: tip -Prove that every closed ball in a metric space is closed. - -:::{dropdown} Solution -Suppose $(V, d)$ is a metric space. -Let $r \in \mathbb{R}^+$ and $f \in V.$ -Let $g \in \overline{B}(f, r)'.$ -Then, $d(f, g) > r$ so the open ball $B(g, d(f, g) - r)$ does not intersect the closed ball $\overline{B}(f, r).$ -[Otherwise there would exist a common element $h \in B(g, d(f, g) - r) \cap \overline{B}(f, r)$ which leads to a contradiction via the triangle inequality since $d(f, g) \leq d(f, h) + d(h, g) < r.$] -Since the two balls do not intersect, $B(g, r - d(f, g)) \subseteq \overline{B}(f, r)',$ which means that $\overline{B}(f, r)'$ is an open set, so $\overline{B}(f, r)$ is a closed set. -::: -:::: - -::::{admonition} Exercise 6.A.4 -:class: tip - -Suppose $V$ is a metric space. - -1. Prove that the union of each collection of open subsets of $V$ is an open subset of $V.$ -2. Prove that the intersection of each finite collection of open subsets of $V$ is an open subset of $V.$ - -:::{dropdown} Solution -In the following parts, $V$ is a metric space and $\mathcal{A}$ is a collection of subsets of $V.$ - -__Part 1:__ -Let $S = \cup_{A \in \mathcal{A}} A.$ -If $s \in S,$ there exists some $A \in \mathcal{A}$ such that $s \in A.$ -Since every $A \in \mathcal{A}$ is open, there exists an open ball centered on $s$ that is contained in $A.$ -This open ball is also contained in $S,$ so $S$ is open. - -__Part 2:__ -Now suppose that, in addition, $\mathcal{A} = \{A_1, \dots A_N\}$ is finite. -If $s \in S,$ then $s \in A_n$ for $n = 1, \dots, N.$ -Since each $A$ is open, for each $A_n \in \mathcal{A},$ there exists an open ball centered on $s$ with radius $r_n,$ which is contained in $A_n.$ -Now, letting $r = \min\{r_1, \dots, r_N\}$ we see that the open ball centered on $s$ with radius $r$ is contained in each $A_n$ and therefore it is also contained in their intersection, i.e. it is contained in $S.$ -Therefore there exists an open ball centered on $s \in S,$ which implies that $S$ is open. -::: -:::: - - -::::{admonition} Exercise 6.A.5 -:class: tip - -Suppose $V$ is a metric space. - -1. Prove that the intersection of each collection of closed subsets of $V$ is an open subset of $V.$ -2. Prove that the union of each finite collection of open subsets of $V$ is an open subset of $V.$ - -:::{dropdown} Solution -The complement of an intersection of a collection of sets is equal to the union of the complements of the sets in the collection. -Similarly, the complement of a union of a collection of sets is equal to the intersection of the complements of the sets in the collection. -Therefore, applying the result of the previous exercise we arrive at the two required results. -::: -:::: - - -::::{admonition} Exercise 6.A.6 -:class: tip - -1. Prove that if $V$ is a metric space, $f \in V,$ and $r > 0,$ then $\overline{B(f, r)} \subseteq \overline{B}(f, r).$ -2. Give an example of a metric space $V,$ $f \in V,$ and $r > 0$ such that $\overline{B(f, r)} \neq \overline{B}(f, r).$ - -:::{dropdown} Solution - -__Part 1:__ -First, note that $B(f, r) \subseteq \overline{B}(f, r).$ -Second $\overline{B}(f, r)$ is a closed set, and the closure of a set is the intersection of all closed sets that contain it, so $\overline{B(f, r)}$ is contained in any closed set that contains $B(f, r),$ so $\overline{B(f, r)} \subseteq \overline{B}(f, r).$ - -__Part 2:__ -Let $V = \mathbb{Z}$ with the metric $d: \mathbb{Z} \times \mathbb{Z} \to \mathbb{R}^+$ defined as $d(f, g) = |f - g|.$ -Also let $f = 0$ and $r = 1.$ -Then $B(f, r) = \{0\}$ and so $\overline{B(f, r)} = \{0\}.$ -However, $\overline{B}(f, r) = \{-1, 0, 1\}$ so $\overline{B(f, r)} \neq \overline{B}(f, r)$ as required. -::: -:::: - -::::{admonition} Exercise 6.A.7 -:class: tip -Show that each sequence in a metric space has at most one limit. - -:::{dropdown} Solution -Let $(V, d)$ be a metric space. -Suppose $f_1, f_2, \dots \in V$ is a sequence in $V.$ -If $a, b \in V$ are limits of $f_1, f_2, \dots,$ then - -$$\begin{equation} -\lim_{n \to \infty} d(f_n, a) = 0 \text{ and } \lim_{n \to \infty} d(f_n, b) = 0. -\end{equation}$$ - -From the triangle inequality, $d(a, b) \leq d(a, f_n) + d(f_n, b),$ and taking limits of both sides, we conclude that $d(a, b) = 0,$ which implies that $a = b.$ -Therefore the sequence can have at most one limit in $V.$ -::: -:::: - - -::::{admonition} Exercise 6.A.8 -:class: tip -Prove that each open subset of a metric space $V$ is the union of some sequence of closed subsets of $V.$ - -:::{dropdown} Solution -Let $V$ be a metric space, let $U$ be an open subset of $V$ and define - -$$\begin{equation} -U_r = \{f \in V: d(f, g) \geq r \text{ for all } g \in U'\}. -\end{equation}$$ - -Therefore $U_r$ is the set of elements in $U$ that are at least a distance $r$ away from $V.$ -Note also that $U_r \subseteq U.$ -Now, note that for fixed $g \in U',$ the set $\{f \in V: d(f, g) \geq r\}$ is closed, because it is the complement of the open set $\{f \in V: d(f, g) < r\}.$ -The intersection of a collection of closed sets is closed, so $U_r$ is closed. - -Now, note that for each $n \in \mathbb{Z}^+,$ we have $U_{\frac{1}{n}} \subseteq U,$ from which it follows that $\bigcup_{n = 1}^\infty U_{\frac{1}{n}} \subseteq U.$ -Conversely, if $x \in U,$ since $U$ is open there exists a ball of radius $\frac{1}{n}$ for some $n \in \mathbb{Z}^+$ contained in $U,$ so $x \in U_{\frac{1}{n}},$ from which it follows that $U \subseteq \bigcup_{n = 1}^\infty U_{\frac{1}{n}}.$ -Therefore $U$ is the union of a sequence of closed sets in $V,$ as required. - -::: -:::: - - -::::{admonition} Exercise 6.A.10 -:class: tip -Prove or give a counterexample: -If $V$ is a metric space and $U, W$ are subserts of $V,$ then $\overline{U} \cup \overline{W} = \overline{U \cup W}.$ - -:::{dropdown} Solution -If $v \in \overline{U},$ then there exists a sequence of elements in $U$ whose limit is $v.$ -Therefore there exists a sequence of elements in $U \cup W$ whose limit is $v,$ so $v \in \overline{U \cup W}.$ -Similarly, if $v \in \overline{W},$ it follows that $v \in \overline{U \cup W}.$ -We conclude that $\overline{U} \cup \overline{W} \subseteq \overline{U \cup W}.$ - -If $v \in \overline{U \cup W},$ then $v$ must be the limit of a sequence of elements in $U \cup W.$ -This sequence must have a infinite subsequence of elements in at least one of $U$ or $W$ with $v$ as its limit, so $v \in \overline{U} \cup \overline{W}.$ -We conclude that $\overline{U \cup W} \subseteq \overline{U} \cup \overline{W},$ which completes the proof. -::: -:::: - - -::::{admonition} Exercise 6.A.11 -:class: tip -Prove or give a counterexample: -If $V$ is a metric space and $U, W$ are subsets of $V,$ then $\overline{U} \cap \overline{W} = \overline{U \cap W}.$ - -:::{dropdown} Solution -The equation does not hold. -As a counterexample, consider $\mathbb{R}$ with the metric - -$$d(f, g) = |f - g|,$$ - -and let $U = (-1, 0), W = (0, 1).$ -Then we have $\overline{U} = [-1, 0]$ and $\overline{W} = [0, 1].$ -Therefore, we have $\overline{U} \cap \overline{W} = \{0\},$ but $U \cap W = \emptyset$ so $\overline{U \cap W} = \emptyset.$ -::: -:::: - - -::::{admonition} Exercise 6.A.12 -:class: tip -Suppose $(U, d_U), (V, d_V)$ and $(W, d_W)$ are metric spaces. -Suppose also that $T: U \to V$ and $S: V \to W$ are continuous functions. - -1. Using the definition of continuity, show that $S \circ T: U \to W$ is continuous. -2. Using the equivalence of continuity with the property that the limit of a function is equal to the function of its limit, show that $S \circ T: U \to W$ is continuous. -3. Using the equivalence of continuity with the property that the inverse image of an open set under a function is open, show that $S \circ T: U \to W$ is continuous. - -:::{dropdown} Solution -__Part 1:__ -Let $v \in V$ and $\epsilon > 0.$ -Since $S$ is continuous, there exists $\delta_S > 0$ such that $d_W(S(v), S(v')) < \epsilon$ for all $v' \in V$ such that $d_V(v, v') < \delta_S.$ -Let $u \in U.$ -Since $T$ is continuous, there exists $\delta_T > 0$ such that $d_V(T(u), T(u')) < \delta_S$ for all $u' \in U$ such that $d_U(u, u') < \delta_T.$ -Letting $v = T(u)$ and putting together these facts, we see that $d_W(S \circ T(u), S \circ T(u')) < \epsilon$ for all $u' \in U$ such that $d_U(u, u') < \delta_T,$ which shows that $S \circ T$ is continuous. - -__Part 2:__ -Suppose $u_1, u_2, \dots$ is a sequence in $U$ with limit $u \in U.$ -Since $S$ and $T$ are both continuous - -$$\begin{align} -S \circ T(u) &= S(T(u)) \\ -&= S\left(T\left(\lim_{k \to \infty} u_k\right)\right) \\ -&= S\left(\lim_{k \to \infty} T(u_k)\right) \\ -&= \lim_{k \to \infty} S\left(T(u_k)\right) \\ -&= \lim_{k \to \infty} S \circ T (u_k) -\end{align}$$ - -Therefore $S \circ T$ is continuous. - -__Part 3:__ -Suppose $G$ is an open subset in $W.$ -Since $S$ is continuous, $S^{-1}(G)$ is open in $V$ and since $T$ is continuous, $T^{-1}(S^{-1}(G)) = (S \circ T)^{-1}(G)$ is open in $U,$ so $S \circ T$ is continuous. -::: -:::: - -::::{admonition} Exercise 6.A.14 -:class: tip -Suppose a Cauchy sequence in a metric space has a convergent subsequence. -Prove that the Cauchy sequence converges. - -:::{dropdown} Solution -Let $(V, d)$ be a metric space and $v_1, v_2, \dots$ be a Cauchy sequence in $V.$ -Suppose that $v_{k_1}, v_{k_2}, \dots$ is a subsequence which converges to $v \in V.$ -Let $\epsilon > 0.$ -Since the sequence $v_1, v_2, \dots$ is Cauchy, there exists $K$ such that for all $k, k' \geq K$ we have $d(v_k, v_{k'}) < \frac{\epsilon}{2}.$ -In addition, by our earlier assumption, there exists $N$ such that for all $n \geq N$ we have $d(v_{k_n}, v) < \frac{\epsilon}{2}.$ -Then, letting $L$ be the maximum of $K$ and $k_N$ we see that for any $l \geq L$ it holds that $d(v_l, v) \leq d(v_l, v_k) + d(v_k, v) < \epsilon,$ concluding the proof. -::: -:::: - - -::::{admonition} Exercise 6.A.16 -:class: tip -Suppose $(U, d)$ is a metric space. -Let $W$ denote the set of all Cauchy sequences of elements of $U.$ - -1. For $(f_1, f_2, \dots)$ and $(g_1, g_2, \dots)$ in $W,$ define $(f_1, f_2, \dots) \equiv (g_1, g_2, \dots)$ to mean that $\lim_{k \to \infty} d(f_k, g_k) = 0.$ -Show that $\equiv$ is an equivalence relation on $W.$ - -2. Let $V$ denote the set of equivalence classses of elements of $W$ under the equivalence relation above. -For $(f_1, f_2, \dots) \in W,$ let $(f_1, f_2, \dots)\hat{~}$ denote the equivalence class of $(f_1, f_2, \dots).$ -Show that the following definition of $d_V: V \times V \to [0, \infty)$ makes sense and that $d_V$ is a metric on $V$ - -$$\begin{equation} -d_V((f_1, f_2, \dots)\hat{~}, (g_1, g_2, \dots)\hat{~}) = \lim_{k \to \infty} d(f_k, g_k). -\end{equation}$$ - -3. Show that $(V, d_V)$ is a complete metric space. - -4. Show that the map from $U$ to $V$ that takes $f \in U$ to $(f, f, f, \dots)\hat{~}$ preserves distances, meaning that for all $f, g \in U,$ we have - -$$d(f, g) = d_V((f, f, f, \dots)\hat{~}, (g, g, g, \dots)\hat{~})$$ - -5. Explain why (4) shows that every metric space is a subset of some complete metric space. - - -:::{dropdown} Solution -__Part 1:__ -First, we have that $d(f_k, f_k) = 0$ so $(f_1, f_2, \dots) \equiv (f_1, f_2, \dots).$ -Second, if $(f_1, f_2, \dots) \equiv (g_1, g_2, \dots),$ we have - -$$\lim_{k \to \infty} d(f_k, g_k) = 0 \implies \lim_{k \to \infty} d(g_k, f_k) = 0,$$ - -so it follows that $(g_1, g_2, \dots) \equiv (f_1, f_2, \dots).$ -Third, if $(f_1, f_2, \dots) \equiv (g_1, g_2, \dots),$ and $(g_1, g_2, \dots) \equiv (h_1, h_2, \dots),$ we have that - -$$\lim_{k \to \infty} d(f_k, h_k) \leq \lim_{k \to \infty} (d(f_k, g_k) + d(g_k, h_k)) = 0,$$ - -which means that $(f_1, f_2, \dots) \equiv (h_1, h_2, \dots).$ -Therefore $\equiv$ is an equivalence relation. - -__Part 2:__ -First, for $d_V$ to be well-defined, it should not matter which representative elements of $(f_1, f_2, \dots)\hat{~}$ and $(g_1, g_2, \dots)\hat{~}$ we pick in the right hand side of the equation that defines $d_V.$ -In particular, suppose $(\tilde{f}_1, \tilde{f}_2, \dots) \in (f_1, f_2, \dots)\hat{~}.$ -Then - -$$\begin{equation} -\lim_{k \to \infty} d(\tilde{f}_k, g_k) \leq \lim_{k \to \infty} (d(f_k, g_k) + d(f_k, \tilde{f}_k)) = \lim_{k \to \infty} d(f_k, g_k), -\end{equation}$$ - -so it does not matter which representative element of $(f_1, f_2, \dots)\hat{~}$ we pick when defining $d_V.$ -The same holds for $(g_1, g_2, \dots)\hat{~},$ so $d_V$ is well-defined. - -Second, we show that $d_V$ is a metric. -By the definition of $d_V,$ we have that - -$$d_V((f_1, f_2, \dots)\hat{~}, (g_1, g_2, \dots)\hat{~}) \geq 0,$$ - -with equality holding if and only if $\lim_{k \to \infty} d(f_k, g_k) = 0,$ which in turn holds if and only if $(f_1, f_2, \dots) \equiv (g_1, g_2, \dots),$ which is equivalent to $(f_1, f_2, \dots)\hat{~} = (g_1, g_2, \dots)\hat{~}.$ -In addition, we have that - -$$\begin{equation} -d_V((f_1, f_2, \dots)\hat{~}, (g_1, g_2, \dots)\hat{~}) = d_V((g_1, g_2, \dots)\hat{~}, (f_1, f_2, \dots)\hat{~}) -\end{equation}$$ - -because $d(f_k, g_k) = d(g_k, f_k).$ -Lastly, if $(h_1, h_2, \dots) \in W,$ we have that - -$$\begin{align} -&~~~~d_V((f_1, f_2, \dots)\hat{~}, (h_1, h_2, \dots)\hat{~}) =\\ -&= \lim_{k \to \infty} d(f_k, h_k) \\ -&\leq\lim_{k \to \infty} (d(f_k, g_k) + d(g_k, h_k)) \\ -&= d_V((f_1, f_2, \dots)\hat{~}, (g_1, g_2, \dots)\hat{~}) + d_V((g_1, g_2, \dots)\hat{~}, (h_1, h_2, \dots)\hat{~}), -\end{align}$$ - -so $d_V$ satisfies the triangle inequality, completing the proof that it is a metric. - -__Part 3:__ -Suppose that $(v_1, v_2, \dots)$ is a Cauchy sequence in $V.$ -We will show that there exists an element $w \in V$ and $K \in \mathbb{Z}^+$ such that for all $k \geq K$ we have $\lim_{i \to \infty} d((v_k)_i, w_i) = 0.$ -Since each sequence $v_k$ is itself Cauchy, there exists $N_k \in \mathbb{Z}^+$ such that $N_k \geq k$ and for all $i, j \geq N_k,$ we have $d(v_i, v_j) < \frac{1}{k}.$ -Define the terms of the sequence $w$ to be $w_k = (v_k)_{N_k}.$ -We now show that $w$ is in fact the limit of $(v_1, v_2, \dots).$ - -Let $\epsilon > 0.$ -Since $(v_1, v_2, \dots)$ is Cauchy, there exists $N_{\epsilon}^{(1)} \in \mathbb{Z}^+$ such that for all $i, j \geq N_{\epsilon}^{(1)},$ we have - -$$\lim_{k \to \infty} d((v_i)_k, (v_j)_k) < \frac{\epsilon}{3}.$$ - -In addition, for fixed $k \in \mathbb{Z}^+$ since each sequence $v_k$ is Cauchy, there exists $N_{k, \epsilon}^{(2)} \in \mathbb{Z}^+$ such that for all $i, j \geq N_{k, \epsilon}^{(2)},$ we have - -$$\lim_{k \to \infty} d((v_k)_i, (v_k)_j) < \frac{\epsilon}{3}.$$ - -Now, we have that - -$$\begin{align} -d((v_k)_i, w_i) &\leq d((v_k)_i, (v_k)_j) + d((v_k)_j, w_i) \\ -&= d((v_k)_i, (v_k)_j) + d((v_k)_j, (v_i)_{N_i}) \\ -&\leq d((v_k)_i, (v_k)_j) + d((v_k)_j, (v_i)_j) + d((v_i)_j, (v_i)_{N_i}) -\end{align}$$ - -where in the first line we have applied the triangle inequality, in the second line we have substituted the definition $w_i = (v_i)_{N_i}$ and in the third line we have again used the triangle inequality again. -Now, letting $i, k \geq N_{\epsilon}^{(1)}$ means that the limit of the second term in the inequality, as $j \to \infty$, is smaller than $\epsilon / 3.$ -In addition, letting $i, j \geq \max(N_{\epsilon}^{(1)}, N_{k, \epsilon}^{(2)})$ means that the first term in the inequality is smaller than $\epsilon / 3.$ -Lastly, by the definition of $N_i,$ letting $j \geq \max(N_{\epsilon}^{(1)}, N_{k, \epsilon}^{(2)}, N_i)$ means that the last term in the inequality is smaller than $\frac{1}{i}.$ -We therefore obtain - -$$\begin{align} -d((v_k)_i, w_i) &\leq \lim_{j \to \infty} d((v_k)_i, (v_k)_j) + \lim_{j \to \infty} d((v_k)_j, (v_i)_j) + \lim_{j \to \infty} d((v_i)_j, (v_i)_{N_i}) \\ -&< \frac{2\epsilon}{3} + \frac{1}{i} -\end{align}$$ - -so for all $i > \frac{3}{\epsilon}$ we have $d((v_k)_i, w_i) < \epsilon.$ -This means that $d_V(v_k\hat{~}, w\hat{~}) < \epsilon$ for all $k \geq N_{\epsilon}^{(1)}$ which means that $w\hat{~}$ is the limit of the sequence $(v_1\hat{~}, v_2\hat{~}, \dots),$ so $V$ is a complete metric space. - -__Part 4:__ -By part (2), defining $f_k = f$ and $g_k = g$ for all $k \in \mathbb{Z}^+,$ we have that - -$$d_V((f, f, f, \dots)\hat{~}, (g, g, g, \dots)\hat{~}) = \lim_{k \to \infty} d(f_k, g_k) = d(f, g),$$ - -as required. - -__Part 5:__ -We can add elements to the set $U$ to ensure that the resulting set is complete. -Specifically, we add to $U$ the set - -$S = \left\{w \in W | w \neq (f, f, f, \dots)\hat{~} \text{ for any } f \in U \right\},$ - -to obtain the larger set $\hat{U} = U \cup S.$ -Now, for $u \in \hat{U},$ we define $\hat{u}$ to be equal to $(u, u, \dots)\hat{~}$ if $u \in U$ and equal to $u$ otherwise. -Then, define the function $d_\hat{U}: \hat{U} \times \hat{U} \to \mathbb{R}$ as - -$$d_\hat{U}(u, v) = d_V(\hat{u}, \hat{v}).$$ - -This function is a metric over $\hat{U}.$ -Further, if $(u_1, u_2, \dots)$ is a Cauchy sequence of elements of $U,$ then $(\hat{u_1}, \hat{u_2}, \dots)$ is a Cauchy sequence of equivalence classes in $V.$ -Since $V$ is complete, the sequence $(\hat{u_1}, \hat{u_2}, \dots)$ has a limit, denoted $\hat{u},$ in $V.$ -Therefore the sequence $(u_1, u_2, \dots)$ converges to $\hat{u}$ in the metric space $(\hat{U}, d_{hat{U}}).$ -::: -:::: +:::: \ No newline at end of file diff --git a/_sources/book/mira/002-measures.md b/_sources/book/mira/002-measures.md index c87fbc3f..ff5ae0bf 100644 --- a/_sources/book/mira/002-measures.md +++ b/_sources/book/mira/002-measures.md @@ -63,7 +63,6 @@ The outer measure has a number of good properties. First, the outer measure of countable subsets of $\mathbb{R}$ is zero. :::{prf:theorem} Countable sets have outer measure zero -:label: mira:thm:countable-sets-have-measure-zero Every countable subset of $\mathbb{R}$ has outer measure $0.$ @@ -158,7 +157,7 @@ Another useful property of the outer measure is countable subadditivity. This property will also turn out to be true of more general measures which we will define later. :::{prf:theorem} Outer measure is countably subadditive -:label: mira:thm:countable-subadditivity-of-outer-measure +:label: mira:thm:outer-measure-is-countably-subadditive Suppose $A_1, A_2, \ldots$ are subsets of $\mathbb{R}.$ Then @@ -455,7 +454,7 @@ $$\left|\bigcup_{n=1}^\infty S_n\right| = \sum_{n=1}^\infty |S_n|.$$ :::{dropdown} Proof: Outer measure is additive if sets are contained by disjoint open intervals -First, by the {prf:ref}`subadditivity of the outer measure`, we have +First, by the {prf:ref}`subadditivity of the outer measure`, we have $$\left|\bigcup_{n=1}^\infty S_n\right| \leq \sum_{n=1}^\infty |S_n|.$$ @@ -484,9 +483,6 @@ This result highlights that if there is a sequence of sets on which the outer me ## Measurable spaces and functions -A natural question is whether the {prf:ref}`nonadditivity of the outer measure ` is due to a flaw in our definition. -However, this next result shows that any notion of length that satisfies certain intuitive properties cannot be additive. - :::{prf:theorem} Nonexistence of extension of length to all subsets of $\mathbb{R}$ :label: mira:thm:nonexistence-length @@ -566,40 +562,35 @@ reaching a contradiction, because $|V| > 0,$ so the above inequality cannot hold ### Sigma algebras -{prf:ref}`mira:thm:nonexistence-length` shows that there does not exist a notion of length that satisfies all four intuitive properties that we would like length to satisfy. -We therefore need to relax at least one of these properties to proceed. -We cannot relax (b), because we want all open intervals to have the length one would expect. -We also cannot give up (c), because we want length to be additive. -Finally, we cannot give up (d) either, because we want length to be translation invariant. -Therefore, our only option is to relax (a). -This is where $\sigma$-algebras come in: instead of using all subsets of $\mathbb{R}$ as the domain of $\mu,$ we will restrict the domain to be a certain collection of subsets, which circumvents the issue of non-measurable sets. - :::{prf:definition} $\sigma$-algebra -:label: mira:def:sigma-algebra -Suppose $X$ is a set and $\mathcal{S}$ is a set of subsets of $X.$ -Then $\mathcal{S}$ is called a $\sigma$-algebra on $X$ if it satisfies: +Suppose $X$ is a set and $S$ is a set of subsets of $X.$ +Then $S$ is called a $\sigma$-algebra on $X$ if it satisfies: -- $\emptyset \in \mathcal{S},$ -- if $E \in \mathcal{S},$ then $X \setminus E \in \mathcal{S},$ -- if $E_1, E_2, \ldots$ is a sequnece of elements of $\mathcal{S},$ then $\bigcup_{k=1}^\infty E_k \in S.$ +- $\emptyset \in S,$ +- if $E \in S,$ then $X \setminus E \in S,$ +- if $E_1, E_2, \ldots$ is a sequnece of elements of $S,$ then $\bigcup_{k=1}^\infty E_k \in S.$ ::: -From this definition, a number of basic properties of $\sigma$ algebras follow immediately. + :::{prf:theorem} Other properties of $\sigma$-algebras -Suppose $\mathcal{S}$ is a $\sigma$-algebra on a set $X.$ + +Suppose $S$ is a $\sigma$-algebra on a set $X.$ Then -(a) $X \in \mathcal{S},$ +(a) $X \in S,$ + +(b) if $D, E \in S,$ then $D \cup E \in S, D \cap E \in S$ and $D \setminus E \in S,$ -(b) if $D, E \in \mathcal{S},$ then $D \cup E \in \mathcal{S}, D \cap E \in \mathcal{S}$ and $D \setminus E \in \mathcal{S},$ +(c) if $E_1, E_2, \ldots$ is a sequence of elements of $S,$ then $\cap_{k = 1}^\infty E_k \in S.$ -(c) if $E_1, E_2, \ldots$ is a sequence of elements of $\mathcal{S},$ then $\cap_{k = 1}^\infty E_k \in \mathcal{S}.$ ::: + :::{dropdown} Proof: Other properties of $~\sigma$-algebras + Because $\emptyset \in S$ and $X = X \setminus \emptyset,$ we have $X \in S.$ Suppose $D, E \in S.$ Then $D \cup E \in S$ because this is the union of the sequence $D, E, \emptyset, \emptyset, \ldots \in S.$ @@ -611,57 +602,58 @@ and also $D \setminus (X \setminus E) = D \cap E \in S.$ Lastly, if $E_1, E_2, \ldots$ is a sequence of elements of $S,$ then $$\bigcap_{k=1}^\infty E_k = X \setminus \bigcup_{k=1}^\infty (X \setminus E_k) \in S.$$ + ::: -We will later define measures, which will be functions that take values on $\sigma$-algebras over sets rather than entire powersets. -The fact that $\sigma$-algebras are closed under complements, as well as countable unions and intersections will allow us to prove useful theorems about limits of measures. -For this, we first define measurable spaces and measurable sets. :::{prf:definition} Measurable space, measurable set + A measurable space is an ordered pair $(X, S),$ where $X$ is a set and $S$ is a $\sigma$-algebra on $X.$ An element of $S$ is called a $S$-measurable set, or simply a measurable set if $S$ is clear from the context. + ::: -One very useful theorem for proving results about $\sigma$-algebras is that given a set $X$ and a set of subsets of $X,$ say $A,$ the intersection of all $\sigma$-algebras on $X$ that contain $A$ is also a sigma algebra. + :::{prf:theorem} Smallest $\sigma$-algebra containing a collection of subsets + Suppose $X$ is a set and $A$ is a set of subsets of $X.$ Then, the intersection of all $\sigma$-algebras on $X$ that contain $A$ is a $\sigma$-algebra on $X.$ + ::: :::{dropdown} Proof: Smallest $~\sigma$-algebra containing a collection of subsets + There is at least one $\sigma$-algebra on $X$ that contains $A,$ because the power set of $X$ is a $\sigma$-algebra on $X$ that contains $A.$ Let $S$ be the intersection of all $\sigma$-algebras on $X$ that contain $A.$ First, $\emptyset \in S,$ because $\emptyset$ is in every $\sigma$-algebra on $X.$ - Second, if $E \in S,$ then $E$ is in every $\sigma$-algebra on $X$ that contains $A,$ so $X \setminus E$ is in every $\sigma$-algebra on $X$ that contains $A,$ so $X \setminus E \in S.$ - Third, if $E_1, E_2, \ldots$ is a sequence of elements of $S,$ then $E_1, E_2, \ldots$ is a sequence of elements of every $\sigma$-algebra on $X$ that contains $A,$ so $\bigcup_{k=1}^\infty E_k$ is in every $\sigma$-algebra on $X$ that contains $A,$ so $\bigcup_{k=1}^\infty E_k \in S.$ + ::: -The intersection of all $\sigma$-algebras containing a collection of sets $A$ is sometimes also referred to as the $\sigma$-algebra generated by $A.$ -We now come to the definition of an important $\sigma$-algebra, the Borel $\sigma$-algebra over $\mathbb{R}.$ -This is the $\sigma$-algebra generated by the open subsets of $\mathbb{R}.$ + :::{prf:definition} Borel set + The smallest $\sigma$-algebra on $\mathbb{R}$ that contains all the open subsets of $\mathbb{R}$ is called the collection of Borel subsets on $\mathbb{R}.$ An element of this $\sigma$-algebra is called a Borel set. + ::: -Before moving to measurable functions, we will define inverse images of functions. :::{prf:definition} Inverse image + If $f: X \in Y$ is a function and $A \subseteq Y,$ then the inverse image of $A$ under $f$ is defined by $$f^{-1}(A) = \{ x \in X : f(x) \in A \}.$$ + ::: -We now prove certain useful properties that inverse images have -This will allow us to prove some very useful results about measurable functions. -:::{prf:theorem} Algebra of inverse images +:::{def:theorem} Algebra of inverse images Suppose $f: X \to Y$ is a function. Then @@ -750,42 +742,43 @@ Thus $(g \circ f)^{-1}(A) = f^{-1}(g^{-1}(A)).$ ### Measurable functions -Now we introduce measurable functions. -As the name suggests, measurable functions are functions such that the inverse images of Borel sets under these functions are measurable sets. + :::{prf:definition} Measurable function + Suppose $(X, S)$ is a measurable space. A function $f: X \to \mathbb{R}$ is called $S$-measurable if $$f^{-1}(B) \in S$$ for all Borel sets $B \subseteq \mathbb{R}.$ + ::: -One important kind of function is the characteristic function, a function which is equal to one on a given set and zero everywhere else. + + :::{prf:defintion} Characteristic function + Suppose $E$ is a subset of a set $X.$ The characteristic function of $E$ is the function $\chi_E: X \to \mathbb{R}$ defined by $$\chi_E(x) = \begin{cases} 1 & \text{if } x \in E \\ 0 & \text{if } x \notin E. \end{cases}$$ + ::: -## Conditions for measurable functions -Now we show one very useful result on measurable functions. -Naively showing a function to be measurable would be tricky because we would have to check that the inverse image of any Borel set under the function is measurable. -Since there are lots of Borel sets, this approach would not be possible. -The following result gives a sufficient condition for a function to be measurable, which is far easier to work with. :::{prf:theorem} Condition for measurable function :label: mira-thm-condition-measurable + Suppose $(X, S)$ is a measurable space and $f: X \to \mathbb{R}$ is a function such that $$f^{-1}((a, \infty)) \in S$$ for all $a \in \mathbb{R}.$ Then $f$ is $S$-measurable. + ::: :::{dropdown} Proof: Condition for measurable function @@ -812,24 +805,31 @@ __Every open interval is in $T$:__ By the hypothesis in the theorem statement, it follows that $f^{-1}((a, \infty)) \in S$ for all $a \in \mathbb{R},$ so $(a, \infty) \in T$ for all $a \in \mathbb{R}.$ Since $T$ is a $\sigma$-algebra on $\mathbb{R},$ it is closed under complementation and intersection so $(-\infty, b] \in T$ for all $b \in \mathbb{R},$ and $(a, b) \in T$ for all $a, b \in \mathbb{R}.$ Therefore $T$ contains all the open intervals of $\mathbb{R},$ so $T$ contains all the Borel subsets of $\mathbb{R}.$ + ::: -## Properties of measurable functions -In the special case that $X$ is a subset of $\mathbb{R}$ and $S$ is the set of Borel subsets of $\mathbb{R},$ we use the term Borel measurable to refer to $S$-measurable functions. + +In the special case that $X$ is a subset of the reals and $S$ is the set of Borel subsets of $\mathbb{R},$ we use the term Borel measurable to refer to $S$-measurable functions. :::{prf:definition} Borel measurable function + Suppose $X \subseteq \mathbb{R}.$ A function $f: X \to \mathbb{R}$ is called Borel measurable if $f^{-1}(B)$ is a Borel set for every Borel set $B \subseteq \mathbb{R}.$ + ::: -Now we prove a number a few results on sufficient conditions for Borel measurable functions. + + :::{prf:theorem} Every continuous function is Borel measurable + Every continuous real-valued function defined on a Borel subset of $\mathbb{R}$ is a Borel measurable function. + ::: :::{dropdown} Proof: Every continuous function is Borel measurable + Suppose that $X \subseteq \mathbb{R}$ is a Borel set and $f: X \to \mathbb{R}$ is a Borel measurable function. Suppose $a \in \mathbb{R}.$ If $x \in X$ such that $f(x) > a,$ then by the continuity of $f,$ there exists $\delta_x > 0$ such that $f(y) > a$ for all $y \in (x - \delta_x, x + \delta_x).$ @@ -841,21 +841,29 @@ f^{-1}((a, \infty)) = \left( \bigcup_{x \in f^{-1}((a, \infty))} (x - \delta_x, The above union is a union of open sets, which is therefore also open, so its intersection with $X$ is a Bore set. By our earlier {prf:ref}`condition for measurable functions`, $f$ is Borel measurable. + ::: :::{prf:defintion} Increasing function; strictly increasing function + Suppose $X \subseteq \mathbb{R}$ and $f: X \to \mathbb{R}$ is a function. Then $f$ is called increasing if $f(x) \leq f(y)$ for all $x, y \in X$ with $x < y.$ If $f(x) < f(y)$ for all $x, y \in X$ with $x < y,$ then $f$ is called strictly increasing. + ::: + + :::{prf:theorem} Every increasing function is Borel measurable + Every increasing function defined on a Borel subset of $\mathbb{R}$ is a Borel measurable function. + ::: :::{dropdown} Proof: Every increasing function is Borel measurable + Suppose that $X \subseteq \mathbb{R}$ is a Borel set and $f: X \to \mathbb{R}$ is an increasing function. Suppose $a \in \mathbb{R}.$ Let $b = \inf f^{-1}((a, \infty)).$ @@ -866,8 +874,13 @@ $$f^{-1}((a, \infty)) = (b, \infty) \cap X \text{ or } f^{-1}((a, \infty)) = [b, holds. Since $X$ is a Borel set, and both $(b, \infty)$ and $[b, \infty)$ are Borel sets, it follows that $f^{-1}((a, \infty))$ is a Borel set. By our earlier {prf:ref}`condition for measurable functions`, $f$ is Borel measurable. + ::: + + + + :::{prf:theorem} Composition of measurable functions Suppose $(X, S)$ is a measurable space and $f: X \to \mathbb{R}$ is a measurable function. @@ -877,25 +890,31 @@ Then $g \circ f: X \to \mathbb{R}$ is a measurable function. ::: :::{dropdown} Proof: Composition of measurable functions + Suppose $(X, S)$ is a measurable space and $f: X \to \mathbb{R}$ is a measurable function. Suppose that $g$ is a real-valued Borel measurable function defined on a subset of $\mathbb{R}$ that includes the range of $f.$ Let $B \subseteq \mathbb{R}$ be a Borel set. Because $g$ is a Borel measurable function, and $B$ is a Borel set, $g^{-1}(B)$ is also a Borel set. Because $f$ is a measurable function, and $g^{-1}(B)$ is a Borel set, $f^{-1}(g^{-1}(B))$ is in $S,$ so $g \circ f$ is Borel measurable. + ::: -Measurable functions also satisfy intuitive algebraic properties that are very useful for proving measurability. + + :::{prf:theorem} Algebraic operations with measurable functions + Suppose $(X, S)$ is a measurable space and $f, g: X \to \mathbb{R}$ are $S$-measurable functions. Then (a) $f + g, f - g, f g$ are $S$-measurable functions, (b) if $g(x) \neq 0$ for all $x \in X,$ then $f / g$ is a $S$-measurable function. + ::: :::{dropdown} Proof: Algebraic operations with measurable functions + Suppose $(X, S)$ is a measurable space and $f, g: X \to \mathbb{R}$ are $S$-measurable functions. __Part (a):__ @@ -924,12 +943,14 @@ Suppose $g(x) \neq 0$ for all $x \in X.$ Note that the function $r: \mathbb{R} \setminus \{0\} \to \mathbb{R}$ defined by $r(x) = 1/x$ is a Borel measurable function, because it is continuous on its domain. Then $1/g = r \circ g$ is a composition of Borel measurable functions, so it is a Borel measurable function. Lastly, $f/g$ is a product of two Borel measurable functions, $f$ and $1 / g,$ so it is a Borel measurable function. + ::: -We now prove a very useful result, namely that pointwise limits of measurable functions are measurable. -This is a highly desirable property, and perhaps somewhat surprising that it holds: recall that the pointwise limit of Riemann integrable functions on some interval is not closed under taking pointwise limits. + + :::{prf:theorem} Pointwise limit of $S$-measurable functions is $S$-measurable + Suppose $(X, S)$ is a measurable space and $f_1, f_2, \ldots$ are $S$-measurable functions from $X$ to $\mathbb{R}.$ Suppose $\lim_{k \to \infty} f_k(x)$ exists for each $x \in X.$ Define $f: X \to \mathbb{R}$ by @@ -937,9 +958,14 @@ Define $f: X \to \mathbb{R}$ by $$f(x) = \lim_{k \to \infty} f_k(x).$$ Then $f$ is a $S$-measurable function. + ::: + + + :::{dropdown} Proof: Pointwise limit of $~S$-measurable functions is $~S$-measurable + Suppose $(X, S)$ is a measurable space and $f_1, f_2, \ldots$ are $S$-measurable functions from $X$ to $\mathbb{R}.$ Suppose $\lim_{k \to \infty} f_k(x)$ exists for each $x \in X.$ Define $f: X \to \mathbb{R}$ by @@ -960,28 +986,31 @@ Then there exists $j \in \mathbb{Z}^+$ and $m \in \mathbb{Z}^+$ such that $f_k(x Taking the limit as $k \to \infty,$ we have $f(x) \geq a + 1/j > a,$ so $x \in f^{-1}((a, \infty)).$ We conclude that $f^{-1}((a, \infty))$ is a Borel set and by our earlier {prf:ref}`condition for measurable functions`, $f$ is a Borel measurable function. + ::: -Sometimes, we may need to consider functions which take values in $[-\infty, \infty].$ -We therefore extend the notion of Vorel sets to subsets of $[-\infty, \infty]$ in the following way. -:::{prf:definition} Borel subsets of $[-\infty, \infty]$ + +:::{prf:definition} Borel subsets of $[-infty, \infty]$ + A subset of $[-\infty, \infty]$ is called a Borel subset if its intersection with $\mathbb{R}$ is a Borel set. + ::: -With the above definition in place, we can also extend the definition of measurable functions. -:::{prf:definition} Measurable function on $[-\infty, \infty]$ -:label: mira:def:measurable-function-infinity + +:::{prf:theorem} Measurable function on $[-\infty, \infty]$ + Suppose $(X, \mathcal{S})$ is a measurable space. A function $f: X \to [-\infty, \infty]$ is $\mathcal{S}$-measurable if $f^{-1}(B) \in \mathcal{S}$ for every Borel set $B \subseteq [-\infty, \infty].$ + ::: -The following result is the counterpart of our earlier {prf:ref}`sufficient condition for measurability`, but with infinities included in the range of the function. + :::{prf:theorem} Sufficient condition for measurable function :label: mira-thm-sufficient-condition-measurable-with-infinity @@ -1020,8 +1049,7 @@ f^{-1}(B) &= f^{-1}((B \cap \mathbb{R}) \cup (B \cap \{\infty\}) \cup (B \cap \{ so $f$ is $\mathcal{S}$-measurable. ::: -Concluding this section, we show that pointwise infimuma and pointwise supremuma of measurable functions are measurable. -Note that this result would not have made sense before modifying the {prf:ref}`definition of measurability to include infinity`, because the supremum and infimum can be $\infty$ and $- \infty.$ + :::{prf:theorem} Infimum and supremum of a sequence of measurable functions is measurable Suppose $(X, \mathcal{S})$ is a measurable space and $f_1, f_2, \ldots$ is a sequence of $\mathcal{S}$-measurable functions from $X$ to $[-\infty, \infty].$ @@ -1219,178 +1247,3 @@ $$\begin{align} &= \mu(D) + \mu(E) - \mu(D \cap E). \end{align}$$ ::: - - -## Lebesgue measure -```{margin} -Despite the name "outer measure", the outer measure is not in fact a measure, because it is not additive. -Restricting its domain to the set of all subsets of $\mathbb{R}$ gives the Lebesgue measure, which is in fact a measure. -``` -Now we move to define the Lebesgue measure which is central in measure theory. -In short the Lebesgue measure is the modified notion of length we have been building up towards. -Specifically, it will be the outer measure restricted (from the set of all subsets of $\mathbb{R}$) to the Borel sets of $\mathbb{R}.$ -The main result in this section will be proving that the outer measure, when restricted to the Borel sets of $\mathbb{R},$ is in fact a measure. - -### Additivity of outer measure on Borel sets -The main task for showing that the outer measure is a measure, when restricted to Borel sets, is to show that the outer measure is additive on the borel $\sigma$-algebra. -We break up this proof into intermediate results, including some results that are useful in subsequent chapters. -First, we show that the outer measure is additive if one of the sets is open. - -:::{prf:theorem} Additivity of outer measure if one of the sets is open -:label: mira:thm:additivity-of-outer-measure-if-one-set-is-open -Suppose $A$ and $G$ are disjoint subsets of $\mathbb{R}$ and $G$ is open. -Then - -$$|A \cup G| = |A| + |G|.$$ -::: - -```{dropdown} Proof: additivity of outer measure if one of the sets is open -First, we can assume that $|G| < \infty,$ because otherwise both sides of the equation above are equal to $\infty.$ - -The {prf:ref}`subadditivity of the outer measure` implies that $|A \cup G| \leq |A| + |G|.$ -It therefore remains to show the inequality in the opposite direction. - -Consider the special case where $G = (a, b),$ for some $a, b \in \mathbb{R}$ with $a < b.$ -Further, we can assume that $a, b \neq A,$ because changing a set by at most two points does not change its outer measure. -Let $I_1, I_2, \dots$ be a sequence of open intervals whose union contains $A \cup G.$ -For each $n \in \mathbb{Z}^+,$ let - -$$J_n = I_n \cap (-\infty, a), K_n = I_n \cap (a, b), L_n = I_n \cap (b, \infty).$$ - -From this definition, we have that - -$$\ell(I_n) = \ell(J_n) + \ell(K_n) + \ell(L_n).$$ - -Note that $J_1, L_1, J_2, L_2, \dots$ is a sequence of open intervals whose union contains $A,$ and $K_1, K_2, \dots$ is a sequence of open intervals whose union contains $G.$ -Thus - -$$\begin{align} -\sum_{n = 1}^\infty \ell(I_n) &= \sum_{n = 1}^\infty (\ell(J_n) + \ell(K_n)) + \sum_{n = 1}^\infty \ell(L_n) \\ -&\geq |A| + |G|. -\end{align}$$ - -This inequality implies that $|A \cup G| \geq |A| + |G|$ in the special case that $G$ is an open interval. -Using induction on $m$ we conclude that if $m \in \mathbb{Z}^+$ and $G$ is a union of $m$ disjoint open intervals that are all disjoint from $A,$ then $|A \cup G| = |A| + |G|.$ -Now, suppose that $G$ is an arbitrary open subset of $\mathbb{R}$ that is disjoint from $A.$ -Then $G = \cup_{n=1}^\infty I_n$ for some sequence of disjoint open intervals $I_1, I_2, \dots,$ each of which is disjoint from $A.$ -For each $m \in \mathbb{Z}^+$ we have - -$$\begin{align} -|A \cup G| &\geq \left| A \cup \left(\bigcup_{n=1}^\infty I_n \right) \right| \\ -&\geq |A| + \sum_{n=1}^\infty \ell(I_n) -\end{align}$$ - -which in turn implies that - -$$\begin{align} -|A \cup G| &\geq |A| + \sum_{n=1}^\infty \ell(I_n) \\ -&\geq |A| + |G| -\end{align}$$ - -completing the proof that $|A \cup G| = |A| + |G|$ for the case of a general open set $G.$ -``` - -Then we show that the outer measure is additive if one of the sets is closed. - -```{prf:theorem} Additivity of outer measure if one of the sets is closed -Suppose $A$ and $F$ are disjoint subsets of $\mathbb{R}$ and $F$ is closed. -Then - -$$|A \cup F| = |A| + |F|.$$ -``` - -```{dropdown} Proof: additivity of outer measure if one of the sets is closed -Suppose $I_1, I_2, \dots$ is a sequence of open intervals whose union contains $A \cup F.$ -Let $G = \cup_{n = 1}^\infty I_n.$ -Then $G$ is an open set which contains $A \cup F.$ -Now, note that $G \setminus F = G \cap (\mathbb{R} \setminus F)$ is an intersection of two open sets and is therefore an open set. -Applying our previous result showing that the {prf:ref}`outer measure is additive if one of the sets is open` we have that - -$$|G| = |F| + |G \setminus F|.$$ - -Using the fact that $A \subseteq G \setminus F,$ and the above equation, we have that - -$$|G| \geq |F| + |A|,$$ - -which in turn implies that - -$$|F \cup A| \geq |F| + |A|,$$ - -from which we conclude that $|F \cup A| = |F| + |A|.$ -``` - -Now we turn to a very useful result, which says that any Borel set can be approximated by a closed subset arbitrarily well. - -```{margin} -Note that this result would not hold if we replaced closed sets by open sets. -For example, if $B = [0, 1] \setminus \mathbb{Q},$ then the only open subset of $B$ is the empty set, and thus $B$ cannot be approximated arbitrarily well by open subsets. -``` -```{prf:theorem} Approximation of Borel sets from below by closed sets -Suppose $B \subseteq \mathbb{R}$ is a Borel set. -Then, for every $\epsilon > 0,$ there exists a closed set $F \subseteq B$ such that $|B \setminus F| < \epsilon.$ -``` - -```{dropdown} Proof: approximation of Borel sets from below by closed sets -Consider the set - -$$\mathcal{L} = \{D \subseteq \mathbb{R}: \text{ for every } \epsilon > 0, \text{ there exists a closed set } F \subseteq D \text{ such that } |D \setminus F| < \epsilon \}.$$ - -This is the set of all subsets of $\mathbb{R}$ which can be approximated below with closed sets. -Our approach to proving the result will be to show that $\mathcal{L}$ is a $\sigma$-algebra. -Then, noting that $\mathcal{L}$ contains all closed subsets of $\mathbb{R},$ by taking complements we conclude that it must contain all open subsets of $\mathbb{R},$ so it also contains every Borel subset of $\mathbb{R},$ which will complete the proof. - -To show that $\mathcal{L}$ is a $\sigma$-algebra, we will first show that it is closed under countable intersections. -Suppose $D_1, D_2, \cdot \subseteq \mathcal{L}.$ -Let $\epsilon > 0.$ -For each $k \in \mathbb{Z}^+,$ there exists a closed set $F_k$ such that - -$$F_k \subseteq D_k \text{ and } |D_k \setminus F_k| < \frac{\epsilon}{2^k}.$$ - -Thus $\cap_{k=1}^\infty F_k$ is a closed set and - -$$\begin{align} -\left(\bigcap_{n = 1}^\infty D_k\right) \setminus \left(\bigcap_{n = 1}^\infty F_k\right) \subseteq \bigcap_{n = 1}^\infty (D_k \setminus F_k) -\end{align}$$ - -from which it follows that - -$$\begin{align} -\left|\left(\bigcap^\infty_{n = 1} D_k\right) \setminus \left(\bigcap^\infty_{n = 1} F_k\right) \right| < \epsilon. -\end{align}$$ - -Thus $\cap_{n=1}^\infty D_k \in \mathcal{L},$ proving that $\mathcal{L}$ is closed under countable intersections. -Now we turn to proving that $\mathcal{L}$ is closed under complementation. -Suppose $D \in \mathcal{L}$ and $\epsilon > 0.$ -We will first consider the case $|D| < \infty.$ -Let $F \subseteq D$ be a closed set such that $|D \setminus F| < \frac{\epsilon}{2}.$ -The definition of outer measure implies that there exists an open set $G$ such that $D \subseteq G$ and $|G| < |D| + \frac{\epsilon}{2}.$ -Therefore we have - -$$\begin{equation} -(\mathbb{R} \setminus D) \setminus (\mathbb{R} \setminus G) \subseteq G \setminus F. -\end{equation}$$ - -Now, the set $G \setminus F$ is open, and using the fact that $|F| > |D| - |D \setminus F|,$ we have - -$$\begin{equation} -|G \setminus F| = |G| - |F| < \left(|D| + \frac{\epsilon}{2}\right) - \left(|D| - \frac{\epsilon}{2}\right) = \epsilon, -\end{equation}$$ - -from which we conclude - -$$\begin{equation} -\left|(\mathbb{R} \setminus D)\right| \setminus (\mathbb{R} \setminus G) \leq \epsilon. -\end{equation}$$ - -Therefore $\mathbb{R} \setminus D$ in the case $|D| < \infty.$ -Now, for general $D,$ let $\epsilon > 0$ and define the sets $D_k = D \cap [-k, k]$ for $k \in \mathbb{Z}^+.$ -The previous case implies that $\mathbb{R} \setminus D_k \in \mathcal{L}.$ -Using the fact that $\mathcal{L}$ is closed under intersections and that - -$$\begin{equation} -\mathbb{R} \setminus D = \mathbb{R} \setminus \left(\bigcup_{k=1}^\infty D_k\right) = \bigcap_{n = 1}^\infty (\mathbb{R} \setminus D_k), -\end{equation}$$ - -we conclude that $\mathbb{R} \setminus D \in \mathcal{L}.$ -Thus $\mathcal{L}$ is a $\sigma$-algebra, which concludes the proof. -``` diff --git a/_sources/book/papers/num-sde/num-sde.ipynb b/_sources/book/papers/num-sde/num-sde.ipynb index 997220d5..4e483641 100644 --- a/_sources/book/papers/num-sde/num-sde.ipynb +++ b/_sources/book/papers/num-sde/num-sde.ipynb @@ -20,7 +20,7 @@ "This paper is an accessible introduction to SDEs, which is centered around ten scripts.\n", "Below are reproductions of these scripts (excluding two on linear stability) and some supplementary notes.\n", "\n", - "## Why stochastic differential equations\n", + "## Why Stochastic differential equations\n", "\n", "We are often interested in modelling a system whose state takes values in a continuous range, and over a continuous time domain.\n", "Whereas ordinary differential equations (ODEs) describe variables which change according to a deterministic rule, SDEs describe variables whose change is governed partly by a deterministic component and partly by a stochastic component.\n", diff --git a/_sources/book/papers/swin/swin.ipynb b/_sources/book/papers/swin/swin.ipynb index b7a90501..46dbdc91 100644 --- a/_sources/book/papers/swin/swin.ipynb +++ b/_sources/book/papers/swin/swin.ipynb @@ -13,15 +13,15 @@ "Follow\n", "\n", "\n", - "[Transformers](../transformers/transformers.ipynb) are an extremely flexible deep architecture which has greatly impacted a range of machine learning applications.\n", - "Arguably, the impact of the transformer is due to its high modelling capacity and the fact that it can easily be applied to different data modalities with minimal implementation changes.\n", - "The distinguishing feature which gives the transformer these advantages is its attention layer.\n", - "Attention allows the transformer block to update the features of its input tokens in an adaptive way that depends on the features themselves, making the overall architecture extremely flexible.\n", - "Further, after the data have been converted into tokens, attention can be straightforwardly applied and many of the details of the data can be abstracted away, which means that the transformer can be easily applied to a range of modalities such as text, images, graphs and many more.\n", + "[Transformers](../transformers/transformers.ipynb) are an extremely flexible deep architecture which has impacted a range of machine learning applications.\n", + "Arguably, the impact of the transformer is due to a combination of its expressivity and its flexibility: it can easily be applied to different data modalities with minimal implementation changes.\n", + "The feature that distinguishes the transformer over other architectures is its attention layer.\n", + "Attention allows the transformer block to update the features of its input tokens in a way that depends on the features themselves, making the overall architecture highly expressive.\n", + "Further, given pre-tokenised data, attention can be straightforwardly applied and many of the details of the data can be abstracted away, which means that the transformer can be easily applied to a range of modalities such as text, images, graphs and many more.\n", "\n", "However, one important limitation of atention is that its computation and memory costs scale quadratically with the number of tokens.\n", "This makes standard transformers difficult to scale to inputs with many tokens such as, for example, long sentences or large images.\n", - "The shifted window transformer (Swin) is an architecture which helps mitigate the issues of computational and memory complexity.\n", + "The shifted window transformer (Swin) {cite}`liu2021swin` is an architecture which helps mitigate the issues of computational and memory complexity.\n", "Swin was originally formulated to tackle image data, on which we focus here, but note that the main idea behind Swin is also applicable to other data modalities such as text or, more generally, any kind of gridded data.\n", "\n", "The innovation of the Swin transformer is to apply attention in a local way, such that only tokens which are near each other (in some appropriate sense of closeness) are allowed to attend to one another.\n", @@ -35,7 +35,13 @@ "source": [ "## Windowed self-attention\n", "The main bottleneck in scaling transformers to large numbers of tokens is the self-attention operation.\n", - "Given $N$ tokens, simply building an $N \\times N$ attention matrix requires $\\mathcal{O}(N^2)$ compute and memory, which quickly gets very expensive for large images.\n", + "Given $N$ tokens, simply building an $N \\times N$ attention matrix requires $\\mathcal{O}(N^2)$ compute and memory, which gets very expensive very quickly for large images.\n", + "\n", + "\n", + ":::{margin}\n", + "Note that we are using the words _chunks_ and _windows_ to distinguish from the word _patches_.\n", + "A window in a Swin transformer may contain several patches, each of which may contain several input tokens, e.g. pixels.\n", + ":::\n", "The idea behind the Swin transformer is to modify the self-attention operation, by breaking up an input image into smaller chunks, or windows, and having each token attend to all other tokens within its window, and to no other tokens outside it, as illustrated in {numref}`swin:vit_vs_swin`." ] }, @@ -72,9 +78,14 @@ "metadata": {}, "source": [ "## Shifted windows\n", - "In order to allow information to propagate across windows, Swin uses shifted windows.\n", - "Specifically, each windowed transformer block is followed by another transformer block with shifted windows, where the amount of shift is set to be half the window size, as illustrated by {numref}`swin:shifting`.\n", - "Since the original and shifted windows overlap, information can propagate across them.\n", + ":::{margin}\n", + "Note that applying a transformer block with shifted attention windows is equivalent to shifting the image first and applying the transformer block on the original windows.\n", + ":::\n", + "In order to allow information to propagate across windows, Swin uses window shifting.\n", + "Specifically, each windowed transformer block is followed by a window shifting operation.\n", + "The next windowed transformer block is applied on the shifted windows and is followed by a reverse shift operation, which brings the windows back to their original positions.\n", + "This is illustrated by {numref}`swin:shifting`.\n", + "Since the original and shifted windows overlap, information can propagate across tokens from different windows.\n", "\n", "```{figure} ./swin_two_windows.png\n", "---\n", @@ -85,10 +96,10 @@ "Transformer blocks with standard attention windows (first) are interleaved with blocks using shifted windows (second).\n", "This enables information to propagate across windows.\n", "```\n", - "The question then becomes how to handle windows which are cut short across the border of the image.\n", - "Swin handles this by grouping these edge windows together by using cyclic shifts.\n", - "First, the input image is shifted vertically and horizontally by half the window size.\n", - "This aligns \n", + "\n", + "The only remaining question is how to handle windows that are cut short at the image boundaries when applying a shifting operation.\n", + "Swin handles this by applying cyclic boundary conditions, which group together the window parts that are cut short at the boundaries.\n", + "This is illustrated by {numref}`swin:cyclic`.\n", "\n", "```{figure} ./cyclic_shifts.png\n", "---\n", @@ -96,158 +107,33 @@ "name: swin:cyclic\n", "---\n", "Illustration of a shifted window block in Swin.\n", - "Colours illustrate the groups of tokens which attend to one another.\n", - "The first shift here is two pixels (half the window size of four pixels) downwards and rightwards.\n", - "The windowed transformer block is applied in the shifted space modifying the values of the tokens (indicated by heavier coloured entries).\n", - "The second, reverse, shift here is two pixels up\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Patch merging" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Attention vs convolutions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ":::{dropdown} An aside: reverse of patch extraction as transpose operation\n", - "\n", - "__Invertibility.__\n", - "Under certain conditions patch extraction can be undone, leveraging the fact that the reverse operation is actually its transpose.\n", - "However, there are exaples of patch extraction which cannot be reversed.\n", - "For example, if we use large strides and small patch sizes, or if we use dilation (also known as à trous; this is the `rate` argument in `tf.image.extract_patches`), then some of the pixels of the input image may be skipped over, and not affect the output at all.\n", - "This means the patch extraction operation cannot be undone.\n", - "From now on, we will assume that the patch extraction is performed using a combination of patch size, striding and dilation such that every pixel in the input affects the output, and no pixels are skipped over.\n", - "\n", - "__Matrix view.__\n", - "Suppose the input image $x$ is an array of shape $(H, W, C),$ where $H$ is its height, $W$ is its width and $C$ is the number of channels.\n", - "We can view patch extraction as a linear operation $P$ which maps the image $x$ to another image $y = P(x),$ of shape $(I, J, K),$ where the dimensions denote the horizontal patch index, vertical patch index and patch dimension respectively.\n", - "Note that by the definition of patch extraction, each pixel in the output is affected by exactly one pixel in the input (though each pixel in the input may affect more than one pixel in the output).\n", - "For now, let us assume that each pixel in the input affects exactly one pixel in the output, i.e. each $x_{hwc}$ affects exactly one $y_{ijk}.$\n", - "Let's define $\\texttt{flat}$ to be the operation that takes a multi-dimensional array and flattens it into a single-dimensional array.\n", - "Since $\\texttt{flat}$ is an invertible linear operation, and patch extraction is also an invertible linear operation, then their composition is also an invertible linear opeartion $P'.$\n", - "Let $\\mathcal{M}(P')$ be the matrix corresponding to $P',$ so that\n", - "\n", - "$$\\texttt{flat}(y) = \\mathcal{M}(P') \\texttt{flat}(x).$$\n", - "\n", - "Now note that since each element of $\\texttt{flat}(x)$ affects precisely one element of $\\texttt{flat}(y),$ the matrix $\\mathcal{M}(P')$ is actually a permutation matrix, i.e. it contains exactly one $1$ in each row and in each column, and all other entries are $0.$\n", - "Now, the inverse of a permutation matrix is its transpose, that is\n", - "\n", - "$$\\texttt{flat}(x) = \\mathcal{M}(P')^\\top \\texttt{flat}(y).$$\n", - "\n", - "So the inverse of the patch embedding operation is actually its transpose operation, and that's because the patch embedding is actually a permutation!\n", - "\n", - "__General case.__\n", - "But what happens in the more general case where some $x_{hwc}$ affect more than one $y_{ijk}$?\n", - "The equation $\\texttt{flat}(y) = \\mathcal{M}(P') \\texttt{flat}(x)$ still holds, and the matrix $\\mathcal{M}(P')$ stil contains a single $1$ per row.\n", - "However, $\\mathcal{M}(P')$ may now contain multiple $1$ entries in each column.\n", - "Therefore, it is no longer a permutation matrix, and $\\texttt{flat}(x) \\neq \\mathcal{M}(P')^\\top \\texttt{flat}(y).$\n", - "In particular, when we multiply $\\texttt{flat}(y)$ by $\\mathcal{M}(P')^\\top,$ each entry of $\\texttt{flat}(x)$ which affects $n$ entries in $\\texttt{flat}(y)$ will be counted $n$ times.\n", - "To illustrate this problem, suppose consider the following vectors and matrices (which do not correspond to patch extraction, and are just an illustration of the problem) and the following matrix-vector multiplication\n", - "\n", - "$$\n", - "u = \\begin{bmatrix}\n", - "u_1 \\\\\n", - "u_2\n", - "\\end{bmatrix}, M = \\begin{bmatrix}\n", - "1 & 0 \\\\\n", - "0 & 1 \\\\\n", - "0 & 1\n", - "\\end{bmatrix} \\implies v = Mu = \\begin{bmatrix}\n", - "u_1 \\\\\n", - "u_2 \\\\\n", - "u_2\n", - "\\end{bmatrix}.\n", - "$$\n", - "\n", - "If we multiply $v = Mu$ by $M^\\top$ we obtain\n", - "\n", - "$$\n", - "M^\\top v = M^\\top Mu = \\begin{bmatrix}\n", - "1 & 0 & 0\\\\\n", - "0 & 1 & 1\n", - "\\end{bmatrix} \\begin{bmatrix}\n", - "u_1 \\\\\n", - "u_2 \\\\\n", - "u_2\n", - "\\end{bmatrix} = \\begin{bmatrix}\n", - "u_1 \\\\\n", - "2u_2\n", - "\\end{bmatrix},\n", - "$$\n", - "\n", - "i.e. we have double-counted $u_2.$\n", - "If instead we divide each row of $M^\\top$ by its sum before multiplying we obtain\n", - "\n", - "$$\n", - "M^\\top v = \\begin{bmatrix}\n", - "1 & 0 & 0\\\\\n", - "0 & \\frac{1}{2} & \\frac{1}{2}\n", - "\\end{bmatrix} \\begin{bmatrix}\n", - "u_1 \\\\\n", - "u_2 \\\\\n", - "u_2\n", - "\\end{bmatrix} = \\begin{bmatrix}\n", - "u_1 \\\\\n", - "u_2\n", - "\\end{bmatrix},\n", - "$$\n", - "\n", - "which is the desired result.\n", - "Therefore, all we have to do is divide each row of $M^\\top$ by its sum, or alternatively divide each entry of $M^\\top v$ by the sum of each row of $M^\\top.$\n", - "\n", - "__Implementation.__\n", - "We can perform all of the above in a few lines in Tensorflow, or a similar autodiff framework, as shown above.\n", - "First, the command\n", - "\n", - "```\n", - "tf.gradients(y, x, grad_ys=y)[0]\n", - "```\n", - "\n", - "computes the gradients of the scalar `sum(y)` with respect to `x` and multiplies these gradients together with the corresponding entries in `grad_ys`, as performed in revrse mode differentiation.\n", - "This is equivalent to the multiplication $\\mathcal{M}(P')^\\top \\texttt{flat}(y)$ except the tensors are not actually flattened, but retain their original shapes.\n", - "Then, the command\n", - "\n", - "```\n", - "tf.gradients(y, x)[0]\n", - "```\n", - "\n", - "computes the derivative of `sum(y)` with respect to `x`.\n", - "If an entry in `x` affects $n$ entries in `y`, then the gradient of `sum(y)` with respect to that entry of `x` will be $n,$ so the division\n", - "\n", - "```\n", - "tf.gradients(y, x, grad_ys=y)[0] / tf.gradients(y, x)[0]\n", + "The colours correspond to groups of tokens which attend to one another within the shifted block.\n", + "Faint colours denote pixel values before and after the application of the transformer block itself.\n", + "Given an input image (first) we apply a shift of two pixels (half the window size of four pixels) downwards and rightwards, which groups together tokens along the boundaries of the image (second).\n", + "We then apply the transformer block to the shifted image (third) followed by the reverse shifting operation, which brings back the tokens in their original positions (right).\n", + "This procedure, i.e. shifting the image and applying a transformer block with regular attention windows, is equivalent to applying a transformer block with shifted attention windows.\n", "```\n", "\n", - "gives the result we were after.\n", - ":::" + "With these ideas in mind, we are ready to implement a Swin transformer!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Here is an illustration of the application of a shifted multi-head self-attention layer.\n", - "\n" + "## Implementation\n", + "\n", + "The implementation of a Swin transformer is relatively simple, and requires only a minimal set of changes over a regular transformer.\n", + "In fact, we can reuse most of the code from the [introduction to transformers example](../transformers/transformers.ipynb) (see dropdown below)." ] }, { "cell_type": "code", - "execution_count": 185, + "execution_count": 2, "metadata": { "tags": [ - "remove-cell" + "remove-output", + "hide-input" ] }, "outputs": [], @@ -260,27 +146,8 @@ "tfk = tf.keras\n", "\n", "# Type for random seed\n", - "Seed = [tf.Tensor, tf.Tensor]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Most of the implementation for a Swin transformer is identical to that of a standard transformer, so we will reuse most of the code from the [introduction to transformers example](../transformers/transformers.ipynb)." - ] - }, - { - "cell_type": "code", - "execution_count": 186, - "metadata": { - "tags": [ - "remove-output", - "hide-input" - ] - }, - "outputs": [], - "source": [ + "Seed = [tf.Tensor, tf.Tensor]\n", + "\n", "class SelfAttention(tfk.Model):\n", "\n", " def __init__(\n", @@ -513,8 +380,7 @@ " )\n", "\n", " def call(self, x: tf.Tensor) -> tf.Tensor:\n", - " \"\"\"\n", - " Tokenise the image `x`, applying a strided convolution.\n", + " \"\"\"Tokenise the image `x`, applying a strided convolution.\n", " This is equivalent to splitting the image into patches,\n", " and then linearly projecting each one of these using a\n", " shared linear projection.\n", @@ -558,8 +424,7 @@ " )\n", "\n", " def call(self, x: tf.Tensor) -> tf.Tensor:\n", - " \"\"\"\n", - " Add position embeddings to input tensor.\n", + " \"\"\"Add position embeddings to input tensor.\n", "\n", " Arguments:\n", " x: input tensor of shape (B, H, W, D)\n", @@ -570,39 +435,27 @@ " return x + self.embeddings[None, :, :, :]" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Window extraction\n", + "\n", + "The only two Swin-specific pieces we need are methods for extracting and combining windows from a given image, and a method for shifting the image.\n", + "To extract the attention windows from a batch of images of shape `(B, H, W, D)`, say of dimension `(S, S)`, we can reshape the input batch to `(B, H//S, S, W//S, S, D)`.\n", + "Indexing `[b, h, a, w, b, d]` in the resulting array corresponds to indexing the `[a, b]` entry within the patch indexed by `[h, w]` within the image indexed by `[b]`.\n", + "We can transpose the resulting array to `(B, H//S, W//S, S, S, D)` and reshape it into `(B*(H//S)*(W//S), S, S, D)`, which folds all the window indices into the batch index.\n", + "Applying a regular all-to-all transformer block to the resulting array is equivalent to windowed attention where all entries in a window attend to one another, but there is no attention across windows, over which we parallelise together with the batch dimension." + ] + }, { "cell_type": "code", - "execution_count": 187, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ - "def shift_horizontally_and_vertically(x: tf.Tensor, shift: int) -> tf.Tensor:\n", - " \"\"\"\n", - " Shift windows in the input tensor `x` by shift along its width and height.\n", - " For example, using shift == 1 (and neglecting the B and D dimensions),\n", - " the H and W dimensions would change as follows:\n", - " \n", - " Original Shifted\n", - " ----------------- -----------------\n", - " | x x x o | | * + + + |\n", - " | x x x o | | o x x x |\n", - " | x x x o | | o x x x |\n", - " | + + + * | | o x x x |\n", - " ----------------- -----------------\n", - " \n", - " Arguments:\n", - " x: input tensor of shape (B, H, W, D)\n", - " shift: amount of shift to apply\n", - "\n", - " Returns:\n", - " output tensor of shape (B, H, W, D)\n", - " \"\"\"\n", - " return tf.roll(tf.roll(x, shift, axis=1), shift, axis=2)\n", - "\n", - "\n", - "def extract_windows(x: tf.Tensor) -> tf.Tensor:\n", - " \"\"\"\n", - " Extract non-overlapping windows from input tensor `x`.\n", + "def extract_windows(x: tf.Tensor, window_size: int) -> tf.Tensor:\n", + " \"\"\"Extract non-overlapping windows from input tensor `x`.\n", "\n", " Arguments:\n", " x: input tensor of shape\n", @@ -614,16 +467,20 @@ " H = tf.shape(x)[1]\n", " W = tf.shape(x)[2]\n", " D = tf.shape(x)[3]\n", + " S = window_size\n", "\n", - " x = tf.reshape(x, [B, H//2, 2, W//2, 2, D])\n", - " x = tf.transpose(x, [0, 1, 3, 2, 4, 5]) # (B, H//2, W//2, 2, 2, D)\n", - " x = tf.reshape(x, [B*H//2*W//2, 4, D])\n", - " return (B, H, W, D), x\n", + " x = tf.reshape(x, [B, H//S, S, W//S, S, D])\n", + " x = tf.transpose(x, [0, 1, 3, 2, 4, 5]) # (B, H//S, W//S, S, S, D)\n", + " x = tf.reshape(x, [B*(H//S)*(W//S), S**2, D])\n", + " return x\n", "\n", "\n", - "def combine_windows(x: tf.Tensor, original_shape: tf.Tensor) -> tf.Tensor:\n", - " \"\"\"\n", - " Combine windows extracted from input tensor `x`.\n", + "def combine_windows(\n", + " x: tf.Tensor,\n", + " original_shape: tf.Tensor,\n", + " window_size: int,\n", + " ) -> tf.Tensor:\n", + " \"\"\"Combine windows extracted from input tensor `x`.\n", "\n", " Arguments:\n", " x: input tensor of shape (B*(H//2)*(W//2), 4, D)\n", @@ -632,66 +489,74 @@ " Returns:\n", " output tensor of shape (B, H, W, D)\n", " \"\"\"\n", - " B, H, W, D = original_shape\n", - " x = tf.reshape(x, [B, H//2, W//2, 2, 2, D])\n", - " x = tf.transpose(x, [0, 1, 3, 2, 4, 5]) # (B, H//2, 2, W//2, 2, D)\n", + " B = original_shape[0]\n", + " H = original_shape[1]\n", + " W = original_shape[2]\n", + " D = original_shape[3]\n", + " S = window_size\n", + "\n", + " x = tf.reshape(x, [B, H//S, W//S, S, S, D])\n", + " x = tf.transpose(x, [0, 1, 3, 2, 4, 5]) # (B, H//S, S, W//S, S, D)\n", " x = tf.reshape(x, [B, H, W, D])\n", " return x" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Window shifting\n", + "\n", + "The only remaining part is the shifting operation itself.\n", + "We can straightforwardly achieve this by applying cyclic boundary conditions using `tf.roll` across the image dimensions." + ] + }, { "cell_type": "code", - "execution_count": 206, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ - "class PatchMergingLayer(tfk.Model):\n", - "\n", - " def __init__(\n", - " self,\n", - " seed: Seed,\n", - " num_out_features: int,\n", - " name: str = \"patch_merging\",\n", - " **kwargs,\n", - " ):\n", - " super().__init__(name=name, **kwargs)\n", - "\n", - " self.linear = tfk.layers.Dense(\n", - " num_out_features,\n", - " activation=None,\n", - " use_bias=False,\n", - " kernel_initializer=tf.initializers.GlorotNormal(seed=int(seed[0])),\n", - " )\n", - "\n", - " self.num_out_features = num_out_features\n", + "def shift_horizontally_and_vertically(x: tf.Tensor, shift: int) -> tf.Tensor:\n", + " \"\"\"Shift windows in the input tensor `x` by shift along its width and\n", + " height. For example, using shift == 1 (and fixing an index for the B and D\n", + " dimensions), the corresponding image would change as follows:\n", " \n", - " def call(self, x: tf.Tensor) -> tf.Tensor:\n", - " \"\"\"\n", - " Apply patch merging to input tensor `x`.\n", + " Original Shifted\n", + " ----------------- -----------------\n", + " | x x x o | | * + + + |\n", + " | x x x o | | o x x x |\n", + " | x x x o | | o x x x |\n", + " | + + + * | | o x x x |\n", + " ----------------- -----------------\n", + " \n", + " Arguments:\n", + " x: input tensor of shape (B, H, W, D)\n", + " shift: amount of shift to apply\n", "\n", - " Arguments:\n", - " x: input tensor of shape (B, H, W, D)\n", + " Returns:\n", + " output tensor of shape (B, H, W, D)\n", + " \"\"\"\n", + " return tf.roll(tf.roll(x, shift, axis=1), shift, axis=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Swin transformer block\n", "\n", - " Returns:\n", - " output tensor of shape (B, H//2, W//2, 2*D)\n", - " \"\"\"\n", - " x = tf.image.extract_patches(\n", - " x,\n", - " sizes=[1, 2, 2, 1],\n", - " strides=[1, 2, 2, 1],\n", - " rates=[1, 1, 1, 1],\n", - " padding=\"VALID\",\n", - " ) # (B, H//2, W//2, 4*D)\n", - " return self.linear(x) # (B, H//2, W//2, 2*D)" + "We can put the above together into a Swin transformer block.\n", + "This applies a regular transformer block, followed by another transformer block sandwiched between a window shifting operation and its inverse." ] }, { "cell_type": "code", - "execution_count": 216, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ - "class SwinTransformerBlockStage(tfk.Model):\n", + "class SwinTransformerBlock(tfk.Model):\n", " def __init__(\n", " self,\n", " seed: Seed,\n", @@ -700,6 +565,7 @@ " mlp_num_layers: int,\n", " num_heads: int,\n", " num_block_pairs: int,\n", + " window_size: int,\n", " name: str = \"swin_transformer_block\", \n", " **kwargs,\n", " ):\n", @@ -729,38 +595,231 @@ " for i in range(num_block_pairs)\n", " ]\n", "\n", + " self.window_size = window_size\n", + "\n", + "\n", + " def call(self, x: tf.Tensor) -> tf.Tensor:\n", + " \"\"\"Apply the Swin Transformer block to input tokens `x`.\n", + "\n", + " Arguments:\n", + " x: input tensor of shape (B, H, W, D)\n", + "\n", + " Returns:\n", + " output tensor of shape (B, H, W, D)\n", + " \"\"\"\n", + " original_shape = tf.shape(x)\n", + " S = self.window_size\n", + "\n", + " for first_block, second_block in zip(self.first_blocks, self.second_blocks):\n", + "\n", + " # Apply first transformer block, extracting windows, applying the\n", + " # transformer block to them, and re-combining them to the original image.\n", + " x = extract_windows(x, S) # (B*H//S*W//S, S**2, D)\n", + " x = first_block(x) # (B*H//S*W//S, S**2, D)\n", + " x = combine_windows(x, original_shape, S) # (B, H, W, D)\n", + "\n", + " # Apply second transformer block same as the first block, but shifting\n", + " # the windows before and after the block.\n", + " x = shift_horizontally_and_vertically(x, S // 2) # (B, H, W, D)\n", + " x = extract_windows(x, S) # (B*H//S*W//S, S**2, D)\n", + " x = second_block(x) # (B*H//S*W//S, S**2, D)\n", + " x = combine_windows(x, original_shape, S) # (B, H, W, D)\n", + " x = shift_horizontally_and_vertically(x, -S // 2) # (B, H, W, D)\n", + "\n", + " x = combine_windows(x, original_shape, S) # (B, H, W, D)\n", + "\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One last, optional, ingredient that is not specific to Swin is patch merging, which can be viewed as a pooling operation, similar to mean-pooling.\n", + "Patch merging combines a collection of patches by concatenating their features channel-wise and then applying a linear operation to project these back to their original channel dimension." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "tags": [ + "hide-input", + "remove-output" + ] + }, + "outputs": [], + "source": [ + "class PatchMergingLayer(tfk.Model):\n", + "\n", + " def __init__(\n", + " self,\n", + " seed: Seed,\n", + " num_out_features: int,\n", + " name: str = \"patch_merging\",\n", + " **kwargs,\n", + " ):\n", + " super().__init__(name=name, **kwargs)\n", + "\n", + " self.linear = tfk.layers.Dense(\n", + " num_out_features,\n", + " activation=None,\n", + " use_bias=False,\n", + " kernel_initializer=tf.initializers.GlorotNormal(seed=int(seed[0])),\n", + " )\n", + "\n", + " self.num_out_features = num_out_features\n", + " \n", + " def call(self, x: tf.Tensor) -> tf.Tensor:\n", + " \"\"\"Apply patch merging to input tensor `x`.\n", + "\n", + " Arguments:\n", + " x: input tensor of shape (B, H, W, D)\n", + "\n", + " Returns:\n", + " output tensor of shape (B, H//2, W//2, 2*D)\n", + " \"\"\"\n", + " x = tf.image.extract_patches(\n", + " x,\n", + " sizes=[1, 2, 2, 1],\n", + " strides=[1, 2, 2, 1],\n", + " rates=[1, 1, 1, 1],\n", + " padding=\"VALID\",\n", + " ) # (B, H//2, W//2, 4*D)\n", + " return self.linear(x) # (B, H//2, W//2, 2*D)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + ":::{dropdown} An aside: inverse of patch extraction as transpose\n", + "\n", + "While we do not need this, here is a neat aside on patch extraction, the `tf.image.extract_patches` operation in the `PatchMergingLayer` above.\n", + "One might wonder what the inverse of a patch extraction operation is, or rather how to implement it efficiently.\n", + "It turns out that patch extraction operations can't always be inverted, i.e. not for all choices of striding, window sizes and dilation, but for those settings for which an inverse exists, the inverse can be expressed and implemented in a very neat way.\n", + "\n", + "__Invertibility.__\n", + "Under certain conditions patch extraction can be inverted, leveraging the fact that the reverse operation is actually its transpose.\n", + "However, there are exaples of patch extraction which are not invertible.\n", + "For example, if we use large strides and small patch sizes, or if we use dilation (also known as à trous; this is the `rate` argument in `tf.image.extract_patches`), then some of the pixels of the input image may be skipped over, making patch extraction non-invertible.\n", + "From now on, we will assume that the patch extraction is performed using a combination of patch size, striding and dilation such that every pixel in the input affects the output, and no pixels are skipped over.\n", + "\n", + "__Matrix view.__\n", + "Suppose the input image $x$ is an array of shape $(H, W, C),$ where $H$ is its height, $W$ is its width and $C$ is the number of channels.\n", + "We can view patch extraction as a linear operation $P$ which maps the image $x$ to another image $y = P(x),$ of shape $(I, J, K),$ where the dimensions denote the horizontal patch index, vertical patch index and patch dimension respectively.\n", + "Note that by the definition of patch extraction, each pixel in the output is affected by exactly one pixel in the input (though each pixel in the input may affect more than one pixel in the output).\n", + "For now, let us assume that each pixel in the input affects exactly one pixel in the output, i.e. each $x_{hwc}$ affects exactly one $y_{ijk}.$\n", + "Let's define $\\texttt{flat}$ to be the operation that takes a multi-dimensional array and flattens it into a single-dimensional array.\n", + "Since $\\texttt{flat}$ is an invertible linear operation, and patch extraction is also an invertible linear operation, then their composition is also an invertible linear opeartion $P'.$\n", + "Let $\\mathcal{M}(P')$ be the matrix corresponding to $P',$ so that\n", + "\n", + "$$\\texttt{flat}(y) = \\mathcal{M}(P') \\texttt{flat}(x).$$\n", + "\n", + "Now note that since each element of $\\texttt{flat}(x)$ affects precisely one element of $\\texttt{flat}(y),$ the matrix $\\mathcal{M}(P')$ is actually a permutation matrix, i.e. it contains exactly one $1$ in each row and in each column, and all other entries are $0.$\n", + "Now, the inverse of a permutation matrix is its transpose, that is\n", + "\n", + "$$\\texttt{flat}(x) = \\mathcal{M}(P')^\\top \\texttt{flat}(y).$$\n", + "\n", + "So the inverse of the patch embedding operation is actually its transpose operation, and that's because the patch embedding is actually a permutation!\n", + "\n", + "__General case.__\n", + "But what happens in the more general case where some $x_{hwc}$ affect more than one $y_{ijk}$?\n", + "The equation $\\texttt{flat}(y) = \\mathcal{M}(P') \\texttt{flat}(x)$ still holds, and the matrix $\\mathcal{M}(P')$ stil contains a single $1$ per row.\n", + "However, $\\mathcal{M}(P')$ may now contain multiple $1$ entries in each column.\n", + "Therefore, it is no longer a permutation matrix, and $\\texttt{flat}(x) \\neq \\mathcal{M}(P')^\\top \\texttt{flat}(y).$\n", + "In particular, when we multiply $\\texttt{flat}(y)$ by $\\mathcal{M}(P')^\\top,$ each entry of $\\texttt{flat}(x)$ which affects $n$ entries in $\\texttt{flat}(y)$ will be counted $n$ times.\n", + "To illustrate this problem, suppose consider the following vectors and matrices (which do not correspond to patch extraction, and are just an illustration of the problem) and the following matrix-vector multiplication\n", + "\n", + "$$\n", + "u = \\begin{bmatrix}\n", + "u_1 \\\\\n", + "u_2\n", + "\\end{bmatrix}, M = \\begin{bmatrix}\n", + "1 & 0 \\\\\n", + "0 & 1 \\\\\n", + "0 & 1\n", + "\\end{bmatrix} \\implies v = Mu = \\begin{bmatrix}\n", + "u_1 \\\\\n", + "u_2 \\\\\n", + "u_2\n", + "\\end{bmatrix}.\n", + "$$\n", + "\n", + "If we multiply $v = Mu$ by $M^\\top$ we obtain\n", + "\n", + "$$\n", + "M^\\top v = M^\\top Mu = \\begin{bmatrix}\n", + "1 & 0 & 0\\\\\n", + "0 & 1 & 1\n", + "\\end{bmatrix} \\begin{bmatrix}\n", + "u_1 \\\\\n", + "u_2 \\\\\n", + "u_2\n", + "\\end{bmatrix} = \\begin{bmatrix}\n", + "u_1 \\\\\n", + "2u_2\n", + "\\end{bmatrix},\n", + "$$\n", + "\n", + "i.e. we have double-counted $u_2.$\n", + "If instead we divide each row of $M^\\top$ by its sum before multiplying we obtain\n", + "\n", + "$$\n", + "M^\\top v = \\begin{bmatrix}\n", + "1 & 0 & 0\\\\\n", + "0 & \\frac{1}{2} & \\frac{1}{2}\n", + "\\end{bmatrix} \\begin{bmatrix}\n", + "u_1 \\\\\n", + "u_2 \\\\\n", + "u_2\n", + "\\end{bmatrix} = \\begin{bmatrix}\n", + "u_1 \\\\\n", + "u_2\n", + "\\end{bmatrix},\n", + "$$\n", "\n", - " def call(self, x: tf.Tensor) -> tf.Tensor:\n", - " \"\"\"\n", - " Apply the Swin Transformer block to input tokens `x`.\n", + "which is the desired result.\n", + "Therefore, all we have to do is divide each row of $M^\\top$ by its sum, or alternatively divide each entry of $M^\\top v$ by the sum of each row of $M^\\top.$\n", "\n", - " Arguments:\n", - " x: input tensor of shape (B, H, W, D)\n", + "__Implementation.__\n", + "We can perform all of the above in a few lines in Tensorflow, or a similar autodiff framework, as shown above.\n", + "First, the command\n", "\n", - " Returns:\n", - " output tensor of shape (B, H, W, D)\n", - " \"\"\"\n", + "```\n", + "tf.gradients(y, x, grad_ys=y)[0]\n", + "```\n", "\n", - " for first_block, second_block in zip(self.first_blocks, self.second_blocks):\n", + "computes the gradients of the scalar `sum(y)` with respect to `x` and multiplies these gradients together with the corresponding entries in `grad_ys`, as performed in revrse mode differentiation.\n", + "This is equivalent to the multiplication $\\mathcal{M}(P')^\\top \\texttt{flat}(y)$ except the tensors are not actually flattened, but retain their original shapes.\n", + "Then, the command\n", "\n", - " original_shape, x = extract_windows(x) # (B*H//2*W//2, 4, D)\n", - " x = first_block(x) # (B*H//2*W//2, 4, D)\n", - " x = combine_windows(x, original_shape) # (B, H, W, D)\n", - " x = shift_horizontally_and_vertically(x, 1) # (B, H, W, D)\n", + "```\n", + "tf.gradients(y, x)[0]\n", + "```\n", "\n", - " original_shape, x = extract_windows(x) # (B*H//2*W//2, 4, D)\n", - " x = second_block(x) # (B*H//2*W//2, 4, D)\n", - " x = combine_windows(x, original_shape) # (B, H, W, D)\n", - " x = shift_horizontally_and_vertically(x, -1) # (B, H, W, D)\n", + "computes the derivative of `sum(y)` with respect to `x`.\n", + "If an entry in `x` affects $n$ entries in `y`, then the gradient of `sum(y)` with respect to that entry of `x` will be $n,$ so the division\n", "\n", - " x = combine_windows(x, original_shape) # (B, H, W, D)\n", + "```\n", + "tf.gradients(y, x, grad_ys=y)[0] / tf.gradients(y, x)[0]\n", + "```\n", "\n", - " return x" + "gives the result we were after.\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Putting it together\n", + "We are now ready to put all of the above together into a `TinySwinTransformer` model which handles tokenisation and position embeddings, and then applies the `SwinTransformerBlock`." ] }, { "cell_type": "code", - "execution_count": 217, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -772,6 +831,7 @@ " tokeniser: ImageTokeniser,\n", " embedding: PositionEmbedding,\n", " token_dimension: int,\n", + " window_size: int,\n", " mlp_num_hidden: int,\n", " mlp_num_layers: int,\n", " num_heads: int,\n", @@ -796,13 +856,14 @@ " )\n", " )\n", " self.stages.append(\n", - " SwinTransformerBlockStage(\n", + " SwinTransformerBlock(\n", " seed=seeds[2*i],\n", " token_dimension=(2**i)*token_dimension,\n", " mlp_num_hidden=mlp_num_hidden,\n", " mlp_num_layers=mlp_num_layers,\n", " num_heads=num_heads,\n", " num_block_pairs=n,\n", + " window_size=window_size,\n", " )\n", " )\n", "\n", @@ -843,14 +904,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Dataset\n", + "## Example application\n", + "Now let's train and evaluate our `TinySwinTransformer` on an equally tiny example.\n", "Because this is meant to be a demo that should run on a laptop, we'll use the MNIST dataset.\n", - "We'll use [tensorflow datasets](https://www.tensorflow.org/datasets/api_docs/python/tfds) to load the data and preprocess it." + "\n", + "\n", + "### Dataset\n", + "We'll use [tensorflow datasets](https://www.tensorflow.org/datasets/api_docs/python/tfds) to load and preprocess the MNIST data." ] }, { "cell_type": "code", - "execution_count": 218, + "execution_count": 8, "metadata": { "tags": [ "hide-cell" @@ -894,14 +959,14 @@ "source": [ "### Training\n", "Now let's train the network.\n", - "In general, when training a ViT, a few tricks are typically used, including for example, learning rate scheduling and data augmentation.\n", + "In general, when training a vision transformer, a few tricks are typically used, including for example, learning rate scheduling and data augmentation.\n", "Dropout is also sometimes used in the architecture itself.\n", - "We won't use any of these techniques here to keep it simple." + "We won't use any of these techniques here to keep things simple." ] }, { "cell_type": "code", - "execution_count": 250, + "execution_count": 9, "metadata": { "tags": [ "hide-input", @@ -935,6 +1000,7 @@ "num_mlp_layers = 1\n", "num_heads = 8\n", "num_classes = 10\n", + "window_size = 2\n", "num_blocks_per_stage = [1, 1, 1, 1]\n", "\n", "# Training parameters\n", @@ -965,6 +1031,7 @@ " tokeniser=tokeniser,\n", " embedding=embedding,\n", " token_dimension=token_dimension,\n", + " window_size=window_size,\n", " mlp_num_hidden=num_mlp_hidden,\n", " mlp_num_layers=num_mlp_layers,\n", " num_heads=num_heads,\n", @@ -985,7 +1052,7 @@ }, { "cell_type": "code", - "execution_count": 251, + "execution_count": 10, "metadata": { "tags": [ "remove-input" @@ -995,77 +1062,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "db81c9ae1ece4e4e8aae5015311cd800", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "0it [00:00, ?it/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "5d994592c82746b3b7cbd91004ae8db8", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "0it [00:00, ?it/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "414378247fee43b99303f0e3e9e63aa0", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "0it [00:00, ?it/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "b3224cf642c84356961c7c5e8c28b182", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "0it [00:00, ?it/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "31a1d95291db4bf685277244b6104a75", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "0it [00:00, ?it/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "fc638e5b9731407ca4281e40699e6641", + "model_id": "2393c2436bd34cbab881a81a56c670a7", "version_major": 2, "version_minor": 0 }, @@ -1079,7 +1076,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "705b39fc26064483bf3a93e6a7046175", + "model_id": "a819c625819f48a9afb0c03c82ac24b0", "version_major": 2, "version_minor": 0 }, @@ -1093,7 +1090,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "9cbec332b7bb4d799dc306fb05e17bbf", + "model_id": "6c95e78dc81f443cb81a2d8e7ec842fe", "version_major": 2, "version_minor": 0 }, @@ -1107,7 +1104,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "0095bd4551014018808667bc26b395e0", + "model_id": "a35e59e3e62640018e2f74831b7fa75e", "version_major": 2, "version_minor": 0 }, @@ -1121,7 +1118,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "77c391fd18ae463aba43a499f35e1e2d", + "model_id": "f2e8bf63d9f14e99ae7e2e6f5bf86708", "version_major": 2, "version_minor": 0 }, @@ -1136,7 +1133,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Epoch 10: loss 0.089 (train 0.024), acc. 0.973 (train 0.972)\n" + "Epoch 5: loss 0.081 (train 0.045), acc. 0.959 (train 0.956)\n" ] } ], @@ -1201,7 +1198,7 @@ }, { "cell_type": "code", - "execution_count": 253, + "execution_count": 12, "metadata": { "tags": [ "remove-input", @@ -1211,17 +1208,17 @@ "outputs": [ { "data": { - "application/pdf": "", + "application/pdf": "", "image/svg+xml": [ "\n", "\n", - "\n", + "\n", " \n", " \n", " \n", " \n", - " 2024-05-25T08:15:57.399962\n", + " 2024-12-17T15:45:32.764353\n", " image/svg+xml\n", " \n", " \n", @@ -1236,8 +1233,8 @@ " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", @@ -1256,17 +1253,17 @@ " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -1373,13 +1341,34 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", @@ -1389,29 +1378,41 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", @@ -1420,94 +1421,64 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + "M 2272 3411 \n", + "Q 2272 3744 2086 3945 \n", + "Q 1901 4147 1574 4147 \n", + "Q 1248 4147 1059 3977 \n", + "Q 870 3808 870 3513 \n", + "Q 870 3219 1059 2979 \n", + "Q 1248 2739 1670 2490 \n", + "Q 1997 2682 2134 2896 \n", + "Q 2272 3110 2272 3411 \n", + "z\n", + "M 1734 1741 \n", + "L 1357 1997 \n", + "Q 1075 1766 960 1545 \n", + "Q 845 1325 845 1011 \n", + "Q 845 576 1065 333 \n", + "Q 1286 90 1658 90 \n", + "Q 1971 90 2166 285 \n", + "Q 2362 480 2362 794 \n", + "Q 2362 1082 2214 1299 \n", + "Q 2067 1517 1734 1741 \n", + "z\n", + "\" transform=\"scale(0.015625)\"/>\n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", @@ -1959,14 +1961,34 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", @@ -1974,14 +1996,14 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -1989,23 +2011,23 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -10918,14 +8754,14 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -10933,14 +8769,14 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -10948,35 +8784,14 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -10984,41 +8799,14 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -11026,56 +8814,14 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -11083,12 +8829,12 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -11097,9 +8843,9 @@ " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", @@ -11437,11 +9206,11 @@ " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", "\n" @@ -11499,10 +9268,11 @@ "metadata": {}, "source": [ "## Conclusion\n", - "We have looked at the details of the transformer architecture.\n", - "It consists of identical blocks, each of which contains a self-attention and a multi-layer perceptron operation, together with normalisation layers and residual connections.\n", - "Coupling these together with position embeddings and an appropriate tokenisation layer makes up the entire transformer architecture.\n", - "We looked at a specific example for computer vision, the ViT, and trained it on MNIST." + "We have looked at the details of the Swin transformer.\n", + "The Swin transformer amounts to modifying the attention operation in a regular transformer by adding windows.\n", + "All tokens within a window attend to one another and there is no attention across windows.\n", + "To allow for information to propagate across windows, Swin applies window shifting between consecutive transformer blocks.\n", + "We looked at a specific example for computer vision and trained it on MNIST." ] }, { @@ -11533,7 +9303,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.1.undefined" + "version": "3.10.14" } }, "nbformat": 4, diff --git a/book/mira/000-exercises.html b/book/mira/000-exercises.html index 20ce1293..deb6d8ef 100644 --- a/book/mira/000-exercises.html +++ b/book/mira/000-exercises.html @@ -217,6 +217,7 @@

Papers & Miscellanous