parameteriz -> parametriz

d2l-ai · Aug 10, 2023 · f120119 · f120119
1 parent 11a22b6
commit f120119
Show file tree

Hide file tree

Showing 18 changed files with 85 additions and 86 deletions.
diff --git a/chapter_appendix-mathematics-for-deep-learning/information-theory.md b/chapter_appendix-mathematics-for-deep-learning/information-theory.md
@@ -523,7 +523,7 @@ kl_q2p, differ_percentage
 
 If you are curious about applications of information theory in deep learning, here is a quick example. We define the true distribution $P$ with probability distribution $p(x)$, and the estimated distribution $Q$ with probability distribution $q(x)$, and we will use them in the rest of this section.
 
-Say we need to solve a binary classification problem based on given $n$ data examples {$x_1, \ldots, x_n$}. Assume that we encode $1$ and $0$ as the positive and negative class label $y_i$ respectively, and our neural network is parameterized by $\theta$. If we aim to find a best $\theta$ so that $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, it is natural to apply the maximum log-likelihood approach as was seen in :numref:`sec_maximum_likelihood`. To be specific, for true labels $y_i$ and predictions $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, the probability to be classified as positive is $\pi_i= p_{\theta}(y_i = 1 \mid x_i)$. Hence, the log-likelihood function would be
+Say we need to solve a binary classification problem based on given $n$ data examples {$x_1, \ldots, x_n$}. Assume that we encode $1$ and $0$ as the positive and negative class label $y_i$ respectively, and our neural network is parametrized by $\theta$. If we aim to find a best $\theta$ so that $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, it is natural to apply the maximum log-likelihood approach as was seen in :numref:`sec_maximum_likelihood`. To be specific, for true labels $y_i$ and predictions $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, the probability to be classified as positive is $\pi_i= p_{\theta}(y_i = 1 \mid x_i)$. Hence, the log-likelihood function would be
 
 $$
 \begin{aligned}
@@ -624,7 +624,7 @@ $$ y_{ij} = \begin{cases}1 & j \in J; \\ 0 &\text{otherwise.}\end{cases}$$
 For instance, if a multi-class classification problem contains three classes $A$, $B$, and $C$, then the labels $\mathbf{y}_i$ can be encoded in {$A: (1, 0, 0); B: (0, 1, 0); C: (0, 0, 1)$}.
 
 
-Assume that our neural network is parameterized by $\theta$. For true label vectors $\mathbf{y}_i$ and predictions $$\hat{\mathbf{y}}_i= p_{\theta}(\mathbf{y}_i \mid \mathbf{x}_i) = \sum_{j=1}^k y_{ij} p_{\theta} (y_{ij}  \mid  \mathbf{x}_i).$$
+Assume that our neural network is parametrized by $\theta$. For true label vectors $\mathbf{y}_i$ and predictions $$\hat{\mathbf{y}}_i= p_{\theta}(\mathbf{y}_i \mid \mathbf{x}_i) = \sum_{j=1}^k y_{ij} p_{\theta} (y_{ij}  \mid  \mathbf{x}_i).$$
 
 Hence, the *cross-entropy loss* would be
 

diff --git a/chapter_appendix-mathematics-for-deep-learning/integral-calculus.md b/chapter_appendix-mathematics-for-deep-learning/integral-calculus.md
@@ -472,7 +472,7 @@ $$
 ## Change of Variables in Multiple Integrals
 As with single variables in :eqref:`eq_change_var`, the ability to change variables inside a higher dimensional integral is a key tool.  Let's summarize the result without derivation.
 
-We need a function that reparameterizes our domain of integration.  We can take this to be $\phi : \mathbb{R}^n \rightarrow \mathbb{R}^n$, that is any function which takes in $n$ real variables and returns another $n$.  To keep the expressions clean, we will assume that $\phi$ is *injective* which is to say it never folds over itself ($\phi(\mathbf{x}) = \phi(\mathbf{y}) \implies \mathbf{x} = \mathbf{y}$).
+We need a function that reparametrizes our domain of integration.  We can take this to be $\phi : \mathbb{R}^n \rightarrow \mathbb{R}^n$, that is any function which takes in $n$ real variables and returns another $n$.  To keep the expressions clean, we will assume that $\phi$ is *injective* which is to say it never folds over itself ($\phi(\mathbf{x}) = \phi(\mathbf{y}) \implies \mathbf{x} = \mathbf{y}$).
 
 In this case, we can say that
 

diff --git a/chapter_attention-mechanisms-and-transformers/multihead-attention.md b/chapter_attention-mechanisms-and-transformers/multihead-attention.md
@@ -115,7 +115,7 @@ than the simple weighted average can be expressed.
 In our implementation,
 we [**choose the scaled dot-product attention
 for each head**] of the multi-head attention.
-To avoid significant growth of computational cost and parameterization cost,
+To avoid significant growth of computational cost and parametrization cost,
 we set $p_q = p_k = p_v = p_o / h$.
 Note that $h$ heads can be computed in parallel
 if we set the number of outputs 

diff --git a/chapter_computer-vision/ssd.md b/chapter_computer-vision/ssd.md
@@ -92,7 +92,7 @@ are generated with
 each spatial position of these feature maps as their center,
 a total of $hwa$ anchor boxes need to be classified.
 This often makes classification with fully connected layers infeasible due to likely
-heavy parameterization costs.
+heavy parametrization costs.
 Recall how we used channels of
 convolutional layers
 to predict classes in :numref:`sec_nin`.

diff --git a/chapter_convolutional-neural-networks/conv-layer.md b/chapter_convolutional-neural-networks/conv-layer.md
@@ -90,14 +90,14 @@ $$
 
 Note that along each axis, the output size
 is slightly smaller than the input size.
-Because the kernel has width and height greater than one,
+Because the kernel has width and height greater than $1$,
 we can only properly compute the cross-correlation
 for locations where the kernel fits wholly within the image,
-the output size is given by the input size $n_h \times n_w$
-minus the size of the convolution kernel $k_h \times k_w$
+the output size is given by the input size $n_\text{h} \times n_\text{w}$
+minus the size of the convolution kernel $k_\text{h} \times k_\text{w}$
 via
 
-$$(n_h-k_h+1) \times (n_w-k_w+1).$$
+$$(n_\text{h}-k_\text{h}+1) \times (n_\text{w}-k_\text{w}+1).$$
 
 This is the case since we need enough space
 to "shift" the convolution kernel across the image.
@@ -242,11 +242,11 @@ class Conv2D(nn.Module):
 
 In
 $h \times w$ convolution
-or a $h \times w$ convolution kernel,
+or an $h \times w$ convolution kernel,
 the height and width of the convolution kernel are $h$ and $w$, respectively.
 We also refer to
-a convolutional layer with a $h \times w$
-convolution kernel simply as a $h \times w$ convolutional layer.
+a convolutional layer with an $h \times w$
+convolution kernel simply as an $h \times w$ convolutional layer.
 
 
 ## Object Edge Detection in Images
@@ -255,7 +255,7 @@ Let's take a moment to parse [**a simple application of a convolutional layer:
 detecting the edge of an object in an image**]
 by finding the location of the pixel change.
 First, we construct an "image" of $6\times 8$ pixels.
-The middle four columns are black (0) and the rest are white (1).
+The middle four columns are black ($0$) and the rest are white ($1$).
 
 ```{.python .input}
 %%tab mxnet, pytorch
@@ -281,8 +281,8 @@ X
 Next, we construct a kernel `K` with a height of 1 and a width of 2.
 When we perform the cross-correlation operation with the input,
 if the horizontally adjacent elements are the same,
-the output is 0. Otherwise, the output is non-zero.
-Note that this kernel is special case of a finite difference operator. At location $(i,j)$ it computes $x_{i,j} - x_{(i+1),j}$, i.e., it computes the difference between the values of horizontally adjacent pixels. This is a discrete approximation of the first derivative in the horizontal direction. After all, for a function $f(i,j)$ its derivative $-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$. Let's see how this works in practice.
+the output is 0. Otherwise, the output is nonzero.
+Note that this kernel is a special case of a finite difference operator. At location $(i,j)$ it computes $x_{i,j} - x_{(i+1),j}$, i.e., it computes the difference between the values of horizontally adjacent pixels. This is a discrete approximation of the first derivative in the horizontal direction. After all, for a function $f(i,j)$ its derivative $-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$. Let's see how this works in practice.
 
 ```{.python .input}
 %%tab all
@@ -291,9 +291,9 @@ K = d2l.tensor([[1.0, -1.0]])
 
 We are ready to perform the cross-correlation operation
 with arguments `X` (our input) and `K` (our kernel).
-As you can see, [**we detect 1 for the edge from white to black
-and -1 for the edge from black to white.**]
-All other outputs take value 0.
+As you can see, [**we detect $1$ for the edge from white to black
+and $-1$ for the edge from black to white.**]
+All other outputs take value $0$.
 
 ```{.python .input}
 %%tab all
@@ -478,9 +478,9 @@ perform
 either the strict convolution operations
 or the cross-correlation operations.
 
-To illustrate this, suppose that a convolutional layer performs *cross-correlation* and learns the kernel in :numref:`fig_correlation`, which is denoted as the matrix $\mathbf{K}$ here.
+To illustrate this, suppose that a convolutional layer performs *cross-correlation* and learns the kernel in :numref:`fig_correlation`, which is here denoted as the matrix $\mathbf{K}$.
 Assuming that other conditions remain unchanged,
-when this layer performs strict *convolution* instead,
+when this layer performs strict *convolution*,
 the learned kernel $\mathbf{K}'$ will be the same as $\mathbf{K}$
 after $\mathbf{K}'$ is
 flipped both horizontally and vertically.
@@ -493,10 +493,10 @@ the same output in :numref:`fig_correlation`
 (cross-correlation of the input and $\mathbf{K}$)
 will be obtained.
 
-In keeping with standard terminology with deep learning literature,
+In keeping with standard terminology in deep learning literature,
 we will continue to refer to the cross-correlation operation
 as a convolution even though, strictly-speaking, it is slightly different.
-Besides,
+Furthermore,
 we use the term *element* to refer to
 an entry (or component) of any tensor representing a layer representation or a convolution kernel.
 
@@ -543,8 +543,10 @@ needs a larger receptive field
 to detect input features over a broader area,
 we can build a deeper network.
 
-Receptive fields derive their name from neurophysiology. In a series of experiments :cite:`Hubel.Wiesel.1959,Hubel.Wiesel.1962,Hubel.Wiesel.1968` on a range of animals
-and different stimuli, Hubel and Wiesel explored the response of what is called the visual
+
+Receptive fields derive their name from neurophysiology.
+A series of experiments on a range of animals using different stimuli
+:cite:`Hubel.Wiesel.1959,Hubel.Wiesel.1962,Hubel.Wiesel.1968` explored the response of what is called the visual
 cortex on said stimuli. By and large they found that lower levels respond to edges and related
 shapes. Later on, :citet:`Field.1987` illustrated this effect on natural
 images with, what can only be called, convolutional kernels.
@@ -553,14 +555,13 @@ We reprint a key figure in :numref:`field_visual` to illustrate the striking sim
 ![Figure and caption taken from :citet:`Field.1987`: An example of coding with six different channels. (Left) Examples of the six types of sensor associated with each channel. (Right) Convolution of the image in (Middle) with the six sensors shown in (Left). The response of the individual sensors is determined by sampling these filtered images at a distance proportional to the size of the sensor (shown with dots). This diagram shows the response of only the even symmetric sensors.](../img/field-visual.png)
 :label:`field_visual`
 
-As it turns out, this relation even holds for the features computed by deeper layers of networks trained on image classification tasks, as demonstrated e.g.,
-in :citet:`Kuzovkin.Vicente.Petton.ea.2018`. Suffice it to say, convolutions have proven to be an incredibly powerful tool for computer vision, both in biology and in code. As such, it is not surprising (in hindsight) that they heralded the recent success in deep learning.
+As it turns out, this relation even holds for the features computed by deeper layers of networks trained on image classification tasks, as demonstrated in, for example, :citet:`Kuzovkin.Vicente.Petton.ea.2018`. Suffice it to say, convolutions have proven to be an incredibly powerful tool for computer vision, both in biology and in code. As such, it is not surprising (in hindsight) that they heralded the recent success in deep learning.
 
 ## Summary
 
-The core computation required for a convolutional layer is a cross-correlation operation. We saw that a simple nested for-loop is all that is required to compute its value. If we have multiple input and multiple output channels, we are  performing a matrix-matrix operation between channels. As can be seen, the computation is straightforward and, most importantly, highly *local*. This affords significant hardware optimization and many recent results in computer vision are only possible due to that. After all, it means that chip designers can invest into fast computation rather than memory, when it comes to optimizing for convolutions. While this may not lead to optimal designs for other applications, it opens the door to ubiquitous and affordable computer vision.
+The core computation required for a convolutional layer is a cross-correlation operation. We saw that a simple nested for-loop is all that is required to compute its value. If we have multiple input and multiple output channels, we are  performing a matrix--matrix operation between channels. As can be seen, the computation is straightforward and, most importantly, highly *local*. This affords significant hardware optimization and many recent results in computer vision are only possible because of that. After all, it means that chip designers can invest in fast computation rather than memory when it comes to optimizing for convolutions. While this may not lead to optimal designs for other applications, it does open the door to ubiquitous and affordable computer vision.
 
-In terms of convolutions themselves, they can be used for many purposes such as to detect edges and lines, to blur images, or to sharpen them. Most importantly, it is not necessary that the statistician (or engineer) invents suitable filters. Instead, we can simply *learn* them from data. This replaces feature engineering heuristics by evidence-based statistics. Lastly, and quite delightfully, these filters are not just advantageous for building deep networks but they also correspond to receptive fields and feature maps in the brain. This gives us confidence that we are on the right track.
+In terms of convolutions themselves, they can be used for many purposes, for example detecting edges and lines, blurring images, or sharpening them. Most importantly, it is not necessary that the statistician (or engineer) invents suitable filters. Instead, we can simply *learn* them from data. This replaces feature engineering heuristics by evidence-based statistics. Lastly, and quite delightfully, these filters are not just advantageous for building deep networks but they also correspond to receptive fields and feature maps in the brain. This gives us confidence that we are on the right track.
 
 ## Exercises
 
@@ -572,7 +573,7 @@ In terms of convolutions themselves, they can be used for many purposes such as
     1. Given a directional vector $\mathbf{v} = (v_1, v_2)$, derive an edge-detection kernel that detects
        edges orthogonal to $\mathbf{v}$, i.e., edges in the direction $(v_2, -v_1)$.
     1. Derive a finite difference operator for the second derivative. What is the minimum
-       size of the convolutional kernel associate with it? Which structures in images respond most strongly to it?
+       size of the convolutional kernel associated with it? Which structures in images respond most strongly to it?
     1. How would you design a blur kernel? Why might you want to use such a kernel?
     1. What is the minimum size of a kernel to obtain a derivative of order $d$?
 1. When you try to automatically find the gradient for the `Conv2D` class we created, what kind of error message do you see?

diff --git a/chapter_convolutional-neural-networks/index.md b/chapter_convolutional-neural-networks/index.md
@@ -1,19 +1,18 @@
 # Convolutional Neural Networks
 :label:`chap_cnn`
 
-Image data is represented as a two-dimensional grid of pixels, be it
+Image data is represented as a two-dimensional grid of pixels, be the image
 monochromatic or in color. Accordingly each pixel corresponds to one
 or multiple numerical values respectively. So far we ignored this rich
-structure and treated them as vectors of numbers by *flattening* the
-images, irrespective of the spatial relation between pixels. This
+structure and treated images as vectors of numbers by *flattening* them, irrespective of the spatial relation between pixels. This
 deeply unsatisfying approach was necessary in order to feed the
 resulting one-dimensional vectors through a fully connected MLP.
 
 Because these networks are invariant to the order of the features, we
 could get similar results regardless of whether we preserve an order
 corresponding to the spatial structure of the pixels or if we permute
 the columns of our design matrix before fitting the MLP's parameters.
-Preferably, we would leverage our prior knowledge that nearby pixels
+Ideally, we would leverage our prior knowledge that nearby pixels
 are typically related to each other, to build efficient models for
 learning from image data.