diff --git a/chapter_appendix-mathematics-for-deep-learning/information-theory.md b/chapter_appendix-mathematics-for-deep-learning/information-theory.md
index 913afe9b44..6185e01e0f 100644
--- a/chapter_appendix-mathematics-for-deep-learning/information-theory.md
+++ b/chapter_appendix-mathematics-for-deep-learning/information-theory.md
@@ -523,7 +523,7 @@ kl_q2p, differ_percentage
 
 If you are curious about applications of information theory in deep learning, here is a quick example. We define the true distribution $P$ with probability distribution $p(x)$, and the estimated distribution $Q$ with probability distribution $q(x)$, and we will use them in the rest of this section.
 
-Say we need to solve a binary classification problem based on given $n$ data examples {$x_1, \ldots, x_n$}. Assume that we encode $1$ and $0$ as the positive and negative class label $y_i$ respectively, and our neural network is parameterized by $\theta$. If we aim to find a best $\theta$ so that $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, it is natural to apply the maximum log-likelihood approach as was seen in :numref:`sec_maximum_likelihood`. To be specific, for true labels $y_i$ and predictions $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, the probability to be classified as positive is $\pi_i= p_{\theta}(y_i = 1 \mid x_i)$. Hence, the log-likelihood function would be
+Say we need to solve a binary classification problem based on given $n$ data examples {$x_1, \ldots, x_n$}. Assume that we encode $1$ and $0$ as the positive and negative class label $y_i$ respectively, and our neural network is parametrized by $\theta$. If we aim to find a best $\theta$ so that $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, it is natural to apply the maximum log-likelihood approach as was seen in :numref:`sec_maximum_likelihood`. To be specific, for true labels $y_i$ and predictions $\hat{y}_i= p_{\theta}(y_i \mid x_i)$, the probability to be classified as positive is $\pi_i= p_{\theta}(y_i = 1 \mid x_i)$. Hence, the log-likelihood function would be
 
 $$
 \begin{aligned}
@@ -624,7 +624,7 @@ $$ y_{ij} = \begin{cases}1 & j \in J; \\ 0 &\text{otherwise.}\end{cases}$$
 For instance, if a multi-class classification problem contains three classes $A$, $B$, and $C$, then the labels $\mathbf{y}_i$ can be encoded in {$A: (1, 0, 0); B: (0, 1, 0); C: (0, 0, 1)$}.
 
 
-Assume that our neural network is parameterized by $\theta$. For true label vectors $\mathbf{y}_i$ and predictions $$\hat{\mathbf{y}}_i= p_{\theta}(\mathbf{y}_i \mid \mathbf{x}_i) = \sum_{j=1}^k y_{ij} p_{\theta} (y_{ij}  \mid  \mathbf{x}_i).$$
+Assume that our neural network is parametrized by $\theta$. For true label vectors $\mathbf{y}_i$ and predictions $$\hat{\mathbf{y}}_i= p_{\theta}(\mathbf{y}_i \mid \mathbf{x}_i) = \sum_{j=1}^k y_{ij} p_{\theta} (y_{ij}  \mid  \mathbf{x}_i).$$
 
 Hence, the *cross-entropy loss* would be
 
diff --git a/chapter_appendix-mathematics-for-deep-learning/integral-calculus.md b/chapter_appendix-mathematics-for-deep-learning/integral-calculus.md
index 9f1f5dc576..1a127f44cd 100644
--- a/chapter_appendix-mathematics-for-deep-learning/integral-calculus.md
+++ b/chapter_appendix-mathematics-for-deep-learning/integral-calculus.md
@@ -472,7 +472,7 @@ $$
 ## Change of Variables in Multiple Integrals
 As with single variables in :eqref:`eq_change_var`, the ability to change variables inside a higher dimensional integral is a key tool.  Let's summarize the result without derivation.
 
-We need a function that reparameterizes our domain of integration.  We can take this to be $\phi : \mathbb{R}^n \rightarrow \mathbb{R}^n$, that is any function which takes in $n$ real variables and returns another $n$.  To keep the expressions clean, we will assume that $\phi$ is *injective* which is to say it never folds over itself ($\phi(\mathbf{x}) = \phi(\mathbf{y}) \implies \mathbf{x} = \mathbf{y}$).
+We need a function that reparametrizes our domain of integration.  We can take this to be $\phi : \mathbb{R}^n \rightarrow \mathbb{R}^n$, that is any function which takes in $n$ real variables and returns another $n$.  To keep the expressions clean, we will assume that $\phi$ is *injective* which is to say it never folds over itself ($\phi(\mathbf{x}) = \phi(\mathbf{y}) \implies \mathbf{x} = \mathbf{y}$).
 
 In this case, we can say that
 
diff --git a/chapter_attention-mechanisms-and-transformers/multihead-attention.md b/chapter_attention-mechanisms-and-transformers/multihead-attention.md
index 9279e0bd33..754bbd0870 100644
--- a/chapter_attention-mechanisms-and-transformers/multihead-attention.md
+++ b/chapter_attention-mechanisms-and-transformers/multihead-attention.md
@@ -115,7 +115,7 @@ than the simple weighted average can be expressed.
 In our implementation,
 we [**choose the scaled dot-product attention
 for each head**] of the multi-head attention.
-To avoid significant growth of computational cost and parameterization cost,
+To avoid significant growth of computational cost and parametrization cost,
 we set $p_q = p_k = p_v = p_o / h$.
 Note that $h$ heads can be computed in parallel
 if we set the number of outputs 
diff --git a/chapter_computer-vision/ssd.md b/chapter_computer-vision/ssd.md
index cb219adeeb..5f262c7901 100644
--- a/chapter_computer-vision/ssd.md
+++ b/chapter_computer-vision/ssd.md
@@ -92,7 +92,7 @@ are generated with
 each spatial position of these feature maps as their center,
 a total of $hwa$ anchor boxes need to be classified.
 This often makes classification with fully connected layers infeasible due to likely
-heavy parameterization costs.
+heavy parametrization costs.
 Recall how we used channels of
 convolutional layers
 to predict classes in :numref:`sec_nin`.
diff --git a/chapter_convolutional-neural-networks/conv-layer.md b/chapter_convolutional-neural-networks/conv-layer.md
index 0dfc1b25e9..7f58a96bc1 100644
--- a/chapter_convolutional-neural-networks/conv-layer.md
+++ b/chapter_convolutional-neural-networks/conv-layer.md
@@ -90,14 +90,14 @@ $$
 
 Note that along each axis, the output size
 is slightly smaller than the input size.
-Because the kernel has width and height greater than one,
+Because the kernel has width and height greater than $1$,
 we can only properly compute the cross-correlation
 for locations where the kernel fits wholly within the image,
-the output size is given by the input size $n_h \times n_w$
-minus the size of the convolution kernel $k_h \times k_w$
+the output size is given by the input size $n_\text{h} \times n_\text{w}$
+minus the size of the convolution kernel $k_\text{h} \times k_\text{w}$
 via
 
-$$(n_h-k_h+1) \times (n_w-k_w+1).$$
+$$(n_\text{h}-k_\text{h}+1) \times (n_\text{w}-k_\text{w}+1).$$
 
 This is the case since we need enough space
 to "shift" the convolution kernel across the image.
@@ -242,11 +242,11 @@ class Conv2D(nn.Module):
 
 In
 $h \times w$ convolution
-or a $h \times w$ convolution kernel,
+or an $h \times w$ convolution kernel,
 the height and width of the convolution kernel are $h$ and $w$, respectively.
 We also refer to
-a convolutional layer with a $h \times w$
-convolution kernel simply as a $h \times w$ convolutional layer.
+a convolutional layer with an $h \times w$
+convolution kernel simply as an $h \times w$ convolutional layer.
 
 
 ## Object Edge Detection in Images
@@ -255,7 +255,7 @@ Let's take a moment to parse [**a simple application of a convolutional layer:
 detecting the edge of an object in an image**]
 by finding the location of the pixel change.
 First, we construct an "image" of $6\times 8$ pixels.
-The middle four columns are black (0) and the rest are white (1).
+The middle four columns are black ($0$) and the rest are white ($1$).
 
 ```{.python .input}
 %%tab mxnet, pytorch
@@ -281,8 +281,8 @@ X
 Next, we construct a kernel `K` with a height of 1 and a width of 2.
 When we perform the cross-correlation operation with the input,
 if the horizontally adjacent elements are the same,
-the output is 0. Otherwise, the output is non-zero.
-Note that this kernel is special case of a finite difference operator. At location $(i,j)$ it computes $x_{i,j} - x_{(i+1),j}$, i.e., it computes the difference between the values of horizontally adjacent pixels. This is a discrete approximation of the first derivative in the horizontal direction. After all, for a function $f(i,j)$ its derivative $-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$. Let's see how this works in practice.
+the output is 0. Otherwise, the output is nonzero.
+Note that this kernel is a special case of a finite difference operator. At location $(i,j)$ it computes $x_{i,j} - x_{(i+1),j}$, i.e., it computes the difference between the values of horizontally adjacent pixels. This is a discrete approximation of the first derivative in the horizontal direction. After all, for a function $f(i,j)$ its derivative $-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$. Let's see how this works in practice.
 
 ```{.python .input}
 %%tab all
@@ -291,9 +291,9 @@ K = d2l.tensor([[1.0, -1.0]])
 
 We are ready to perform the cross-correlation operation
 with arguments `X` (our input) and `K` (our kernel).
-As you can see, [**we detect 1 for the edge from white to black
-and -1 for the edge from black to white.**]
-All other outputs take value 0.
+As you can see, [**we detect $1$ for the edge from white to black
+and $-1$ for the edge from black to white.**]
+All other outputs take value $0$.
 
 ```{.python .input}
 %%tab all
@@ -478,9 +478,9 @@ perform
 either the strict convolution operations
 or the cross-correlation operations.
 
-To illustrate this, suppose that a convolutional layer performs *cross-correlation* and learns the kernel in :numref:`fig_correlation`, which is denoted as the matrix $\mathbf{K}$ here.
+To illustrate this, suppose that a convolutional layer performs *cross-correlation* and learns the kernel in :numref:`fig_correlation`, which is here denoted as the matrix $\mathbf{K}$.
 Assuming that other conditions remain unchanged,
-when this layer performs strict *convolution* instead,
+when this layer performs strict *convolution*,
 the learned kernel $\mathbf{K}'$ will be the same as $\mathbf{K}$
 after $\mathbf{K}'$ is
 flipped both horizontally and vertically.
@@ -493,10 +493,10 @@ the same output in :numref:`fig_correlation`
 (cross-correlation of the input and $\mathbf{K}$)
 will be obtained.
 
-In keeping with standard terminology with deep learning literature,
+In keeping with standard terminology in deep learning literature,
 we will continue to refer to the cross-correlation operation
 as a convolution even though, strictly-speaking, it is slightly different.
-Besides,
+Furthermore,
 we use the term *element* to refer to
 an entry (or component) of any tensor representing a layer representation or a convolution kernel.
 
@@ -543,8 +543,10 @@ needs a larger receptive field
 to detect input features over a broader area,
 we can build a deeper network.
 
-Receptive fields derive their name from neurophysiology. In a series of experiments :cite:`Hubel.Wiesel.1959,Hubel.Wiesel.1962,Hubel.Wiesel.1968` on a range of animals
-and different stimuli, Hubel and Wiesel explored the response of what is called the visual
+
+Receptive fields derive their name from neurophysiology.
+A series of experiments on a range of animals using different stimuli
+:cite:`Hubel.Wiesel.1959,Hubel.Wiesel.1962,Hubel.Wiesel.1968` explored the response of what is called the visual
 cortex on said stimuli. By and large they found that lower levels respond to edges and related
 shapes. Later on, :citet:`Field.1987` illustrated this effect on natural
 images with, what can only be called, convolutional kernels.
@@ -553,14 +555,13 @@ We reprint a key figure in :numref:`field_visual` to illustrate the striking sim
 ![Figure and caption taken from :citet:`Field.1987`: An example of coding with six different channels. (Left) Examples of the six types of sensor associated with each channel. (Right) Convolution of the image in (Middle) with the six sensors shown in (Left). The response of the individual sensors is determined by sampling these filtered images at a distance proportional to the size of the sensor (shown with dots). This diagram shows the response of only the even symmetric sensors.](../img/field-visual.png)
 :label:`field_visual`
 
-As it turns out, this relation even holds for the features computed by deeper layers of networks trained on image classification tasks, as demonstrated e.g.,
-in :citet:`Kuzovkin.Vicente.Petton.ea.2018`. Suffice it to say, convolutions have proven to be an incredibly powerful tool for computer vision, both in biology and in code. As such, it is not surprising (in hindsight) that they heralded the recent success in deep learning.
+As it turns out, this relation even holds for the features computed by deeper layers of networks trained on image classification tasks, as demonstrated in, for example, :citet:`Kuzovkin.Vicente.Petton.ea.2018`. Suffice it to say, convolutions have proven to be an incredibly powerful tool for computer vision, both in biology and in code. As such, it is not surprising (in hindsight) that they heralded the recent success in deep learning.
 
 ## Summary
 
-The core computation required for a convolutional layer is a cross-correlation operation. We saw that a simple nested for-loop is all that is required to compute its value. If we have multiple input and multiple output channels, we are  performing a matrix-matrix operation between channels. As can be seen, the computation is straightforward and, most importantly, highly *local*. This affords significant hardware optimization and many recent results in computer vision are only possible due to that. After all, it means that chip designers can invest into fast computation rather than memory, when it comes to optimizing for convolutions. While this may not lead to optimal designs for other applications, it opens the door to ubiquitous and affordable computer vision.
+The core computation required for a convolutional layer is a cross-correlation operation. We saw that a simple nested for-loop is all that is required to compute its value. If we have multiple input and multiple output channels, we are  performing a matrix--matrix operation between channels. As can be seen, the computation is straightforward and, most importantly, highly *local*. This affords significant hardware optimization and many recent results in computer vision are only possible because of that. After all, it means that chip designers can invest in fast computation rather than memory when it comes to optimizing for convolutions. While this may not lead to optimal designs for other applications, it does open the door to ubiquitous and affordable computer vision.
 
-In terms of convolutions themselves, they can be used for many purposes such as to detect edges and lines, to blur images, or to sharpen them. Most importantly, it is not necessary that the statistician (or engineer) invents suitable filters. Instead, we can simply *learn* them from data. This replaces feature engineering heuristics by evidence-based statistics. Lastly, and quite delightfully, these filters are not just advantageous for building deep networks but they also correspond to receptive fields and feature maps in the brain. This gives us confidence that we are on the right track.
+In terms of convolutions themselves, they can be used for many purposes, for example detecting edges and lines, blurring images, or sharpening them. Most importantly, it is not necessary that the statistician (or engineer) invents suitable filters. Instead, we can simply *learn* them from data. This replaces feature engineering heuristics by evidence-based statistics. Lastly, and quite delightfully, these filters are not just advantageous for building deep networks but they also correspond to receptive fields and feature maps in the brain. This gives us confidence that we are on the right track.
 
 ## Exercises
 
@@ -572,7 +573,7 @@ In terms of convolutions themselves, they can be used for many purposes such as
     1. Given a directional vector $\mathbf{v} = (v_1, v_2)$, derive an edge-detection kernel that detects
        edges orthogonal to $\mathbf{v}$, i.e., edges in the direction $(v_2, -v_1)$.
     1. Derive a finite difference operator for the second derivative. What is the minimum
-       size of the convolutional kernel associate with it? Which structures in images respond most strongly to it?
+       size of the convolutional kernel associated with it? Which structures in images respond most strongly to it?
     1. How would you design a blur kernel? Why might you want to use such a kernel?
     1. What is the minimum size of a kernel to obtain a derivative of order $d$?
 1. When you try to automatically find the gradient for the `Conv2D` class we created, what kind of error message do you see?
diff --git a/chapter_convolutional-neural-networks/index.md b/chapter_convolutional-neural-networks/index.md
index af158831b8..ce22361eb9 100644
--- a/chapter_convolutional-neural-networks/index.md
+++ b/chapter_convolutional-neural-networks/index.md
@@ -1,11 +1,10 @@
 # Convolutional Neural Networks
 :label:`chap_cnn`
 
-Image data is represented as a two-dimensional grid of pixels, be it
+Image data is represented as a two-dimensional grid of pixels, be the image
 monochromatic or in color. Accordingly each pixel corresponds to one
 or multiple numerical values respectively. So far we ignored this rich
-structure and treated them as vectors of numbers by *flattening* the
-images, irrespective of the spatial relation between pixels. This
+structure and treated images as vectors of numbers by *flattening* them, irrespective of the spatial relation between pixels. This
 deeply unsatisfying approach was necessary in order to feed the
 resulting one-dimensional vectors through a fully connected MLP.
 
@@ -13,7 +12,7 @@ Because these networks are invariant to the order of the features, we
 could get similar results regardless of whether we preserve an order
 corresponding to the spatial structure of the pixels or if we permute
 the columns of our design matrix before fitting the MLP's parameters.
-Preferably, we would leverage our prior knowledge that nearby pixels
+Ideally, we would leverage our prior knowledge that nearby pixels
 are typically related to each other, to build efficient models for
 learning from image data.
 
diff --git a/chapter_convolutional-neural-networks/padding-and-strides.md b/chapter_convolutional-neural-networks/padding-and-strides.md
index 3ebe68b93e..6fcca6699c 100644
--- a/chapter_convolutional-neural-networks/padding-and-strides.md
+++ b/chapter_convolutional-neural-networks/padding-and-strides.md
@@ -10,9 +10,9 @@ Recall the example of a convolution in :numref:`fig_correlation`.
 The input had both a height and width of 3
 and the convolution kernel had both a height and width of 2,
 yielding an output representation with dimension $2\times2$.
-Assuming that the input shape is $n_h\times n_w$
-and the convolution kernel shape is $k_h\times k_w$,
-the output shape will be $(n_h-k_h+1) \times (n_w-k_w+1)$: 
+Assuming that the input shape is $n_\text{h}\times n_\text{w}$
+and the convolution kernel shape is $k_\text{h}\times k_\text{w}$,
+the output shape will be $(n_\text{h}-k_\text{h}+1) \times (n_\text{w}-k_\text{w}+1)$: 
 we can only shift the convolution kernel so far until it runs out
 of pixels to apply the convolution to. 
 
@@ -25,7 +25,7 @@ after applying many successive convolutions,
 we tend to wind up with outputs that are
 considerably smaller than our input.
 If we start with a $240 \times 240$ pixel image,
-$10$ layers of $5 \times 5$ convolutions
+ten layers of $5 \times 5$ convolutions
 reduce the image to $200 \times 200$ pixels,
 slicing off $30 \%$ of the image and with it
 obliterating any interesting information
@@ -70,8 +70,8 @@ is that we tend to lose pixels on the perimeter of our image. Consider :numref:`
 :label:`img_conv_reuse`
 
 Since we typically use small kernels,
-for any given convolution,
-we might only lose a few pixels,
+for any given convolution
+we might only lose a few pixels
 but this can add up as we apply
 many successive convolutional layers.
 One straightforward solution to this problem
@@ -86,26 +86,26 @@ The shaded portions are the first output element as well as the input and kernel
 ![Two-dimensional cross-correlation with padding.](../img/conv-pad.svg)
 :label:`img_conv_pad`
 
-In general, if we add a total of $p_h$ rows of padding
+In general, if we add a total of $p_\text{h}$ rows of padding
 (roughly half on top and half on bottom)
-and a total of $p_w$ columns of padding
+and a total of $p_\text{w}$ columns of padding
 (roughly half on the left and half on the right),
 the output shape will be
 
-$$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1).$$
+$$(n_\text{h}-k_\text{h}+p_\text{h}+1)\times(n_\text{w}-k_\text{w}+p_\text{w}+1).$$
 
 This means that the height and width of the output
-will increase by $p_h$ and $p_w$, respectively.
+will increase by $p_\text{h}$ and $p_\text{w}$, respectively.
 
-In many cases, we will want to set $p_h=k_h-1$ and $p_w=k_w-1$
+In many cases, we will want to set $p_\text{h}=k_\text{h}-1$ and $p_\text{w}=k_\text{w}-1$
 to give the input and output the same height and width.
 This will make it easier to predict the output shape of each layer
 when constructing the network.
-Assuming that $k_h$ is odd here,
-we will pad $p_h/2$ rows on both sides of the height.
-If $k_h$ is even, one possibility is to
-pad $\lceil p_h/2\rceil$ rows on the top of the input
-and $\lfloor p_h/2\rfloor$ rows on the bottom.
+Assuming that $k_\text{h}$ is odd here,
+we will pad $p_\text{h}/2$ rows on both sides of the height.
+If $k_\text{h}$ is even, one possibility is to
+pad $\lceil p_\text{h}/2\rceil$ rows on the top of the input
+and $\lfloor p_\text{h}/2\rfloor$ rows on the bottom.
 We will pad both sides of the width in the same way.
 
 CNNs commonly use convolution kernels
@@ -273,17 +273,17 @@ there is no output because the input element cannot fill the window
 ![Cross-correlation with strides of 3 and 2 for height and width, respectively.](../img/conv-stride.svg)
 :label:`img_conv_stride`
 
-In general, when the stride for the height is $s_h$
-and the stride for the width is $s_w$, the output shape is
+In general, when the stride for the height is $s_\text{h}$
+and the stride for the width is $s_\text{w}$, the output shape is
 
-$$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$$
+$$\lfloor(n_\text{h}-k_\text{h}+p_\text{h}+s_\text{h})/s_\text{h}\rfloor \times \lfloor(n_\text{w}-k_\text{w}+p_\text{w}+s_\text{w})/s_\text{w}\rfloor.$$
 
-If we set $p_h=k_h-1$ and $p_w=k_w-1$,
+If we set $p_\text{h}=k_\text{h}-1$ and $p_\text{w}=k_\text{w}-1$,
 then the output shape can be simplified to
-$\lfloor(n_h+s_h-1)/s_h\rfloor \times \lfloor(n_w+s_w-1)/s_w\rfloor$.
+$\lfloor(n_\text{h}+s_\text{h}-1)/s_\text{h}\rfloor \times \lfloor(n_\text{w}+s_\text{w}-1)/s_\text{w}\rfloor$.
 Going a step further, if the input height and width
 are divisible by the strides on the height and width,
-then the output shape will be $(n_h/s_h) \times (n_w/s_w)$.
+then the output shape will be $(n_\text{h}/s_\text{h}) \times (n_\text{w}/s_\text{w})$.
 
 Below, we [**set the strides on both the height and width to 2**],
 thus halving the input height and width.
@@ -341,16 +341,16 @@ comp_conv2d(conv2d, X).shape
 
 ## Summary and Discussion
 
-Padding can increase the height and width of the output. This is often used to give the output the same height and width as the input to avoid undesirable shrinkage of the output. Moreover, it ensures that all pixels are used equally frequently. Typically we pick symmetric padding on both sides of the input height and width. In this case we refer to $(p_h, p_w)$ padding. Most commonly we set $p_h = p_w$, in which case we simply state that we choose padding $p$. 
+Padding can increase the height and width of the output. This is often used to give the output the same height and width as the input to avoid undesirable shrinkage of the output. Moreover, it ensures that all pixels are used equally frequently. Typically we pick symmetric padding on both sides of the input height and width. In this case we refer to $(p_\text{h}, p_\text{w})$ padding. Most commonly we set $p_\text{h} = p_\text{w}$, in which case we simply state that we choose padding $p$. 
 
-A similar convention applies to strides. When horizontal stride $s_h$ and vertical stride $s_w$ match, we simply talk about stride $s$. The stride can reduce the resolution of the output, for example reducing the height and width of the output to only $1/n$ of the height and width of the input for $n > 1$. By default, the padding is 0 and the stride is 1. 
+A similar convention applies to strides. When horizontal stride $s_\text{h}$ and vertical stride $s_\text{w}$ match, we simply talk about stride $s$. The stride can reduce the resolution of the output, for example reducing the height and width of the output to only $1/n$ of the height and width of the input for $n > 1$. By default, the padding is 0 and the stride is 1. 
 
-So far all padding that we discussed simply extended images with zeros. This has significant computational benefit since it is trivial to accomplish. Moreover, operators can be engineered to take advantage of this padding implicitly without the need to allocate additional memory. At the same time, it allows CNNs to encode implicit position information within an image, simply by learning where the "whitespace" is. There are many alternatives to zero-padding. :citet:`Alsallakh.Kokhlikyan.Miglani.ea.2020` provided an extensive overview of alternatives (albeit without a clear case to use nonzero paddings unless artifacts occur). 
+So far all padding that we discussed simply extended images with zeros. This has significant computational benefit since it is trivial to accomplish. Moreover, operators can be engineered to take advantage of this padding implicitly without the need to allocate additional memory. At the same time, it allows CNNs to encode implicit position information within an image, simply by learning where the "whitespace" is. There are many alternatives to zero-padding. :citet:`Alsallakh.Kokhlikyan.Miglani.ea.2020` provided an extensive overview of those (albeit without a clear case for when to use nonzero paddings unless artifacts occur). 
 
 
 ## Exercises
 
-1. Given the last code example in this section with kernel size $(3, 5)$, padding $(0, 1)$, and stride $(3, 4)$, 
+1. Given the final code example in this section with kernel size $(3, 5)$, padding $(0, 1)$, and stride $(3, 4)$, 
    calculate the output shape to check if it is consistent with the experimental result.
 1. For audio signals, what does a stride of 2 correspond to?
 1. Implement mirror padding, i.e., padding where the border values are simply mirrored to extend tensors. 
diff --git a/chapter_convolutional-neural-networks/why-conv.md b/chapter_convolutional-neural-networks/why-conv.md
index 066865d7d7..989b85617e 100644
--- a/chapter_convolutional-neural-networks/why-conv.md
+++ b/chapter_convolutional-neural-networks/why-conv.md
@@ -15,11 +15,11 @@ but we do not assume any structure *a priori*
 concerning how the features interact.
 
 Sometimes, we truly lack knowledge to guide
-the construction of craftier architectures.
+the construction of fancier architectures.
 In these cases, an MLP
 may be the best that we can do.
 However, for high-dimensional perceptual data,
-such structure-less networks can grow unwieldy.
+such structureless networks can grow unwieldy.
 
 For instance, let's return to our running example
 of distinguishing cats from dogs.
@@ -116,8 +116,8 @@ with two-dimensional images $\mathbf{X}$ as inputs
 and their immediate hidden representations
 $\mathbf{H}$ similarly represented as matrices (they are two-dimensional tensors in code), where both $\mathbf{X}$ and $\mathbf{H}$ have the same shape.
 Let that sink in.
-We now conceive of not only the inputs but
-also the hidden representations as possessing spatial structure.
+We now imagine that not only the inputs but
+also the hidden representations possess spatial structure.
 
 Let $[\mathbf{X}]_{i, j}$ and $[\mathbf{H}]_{i, j}$ denote the pixel
 at location $(i,j)$
@@ -164,7 +164,7 @@ We are effectively weighting pixels at $(i+a, j+b)$
 in the vicinity of location $(i, j)$ with coefficients $[\mathbf{V}]_{a, b}$
 to obtain the value $[\mathbf{H}]_{i, j}$.
 Note that $[\mathbf{V}]_{a, b}$ needs many fewer coefficients than $[\mathsf{V}]_{i, j, a, b}$ since it
-no longer depends on the location within the image. Consequently, the number of parameters required is no longer $10^{12}$ but a much more reasonable $4 \cdot 10^6$: we still have the dependency on $a, b \in (-1000, 1000)$. In short, we have made significant progress. Time-delay neural networks (TDNNs) are some of the first examples to exploit this idea :cite:`Waibel.Hanazawa.Hinton.ea.1989`.
+no longer depends on the location within the image. Consequently, the number of parameters required is no longer $10^{12}$ but a much more reasonable $4 \times 10^6$: we still have the dependency on $a, b \in (-1000, 1000)$. In short, we have made significant progress. Time-delay neural networks (TDNNs) are some of the first examples to exploit this idea :cite:`Waibel.Hanazawa.Hinton.ea.1989`.
 
 ###  Locality
 
@@ -180,7 +180,7 @@ Equivalently, we can rewrite $[\mathbf{H}]_{i, j}$ as
 $$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$
 :eqlabel:`eq_conv-layer`
 
-This reduces the number of parameters from $4 \cdot 10^6$ to $4 \Delta^2$, where $\Delta$ is typically smaller than $10$. As such, we reduced the number of parameters by another 4 orders of magnitude. Note that :eqref:`eq_conv-layer`, in a nutshell, is what is called a *convolutional layer*. 
+This reduces the number of parameters from $4 \times 10^6$ to $4 \Delta^2$, where $\Delta$ is typically smaller than $10$. As such, we reduced the number of parameters by another four orders of magnitude. Note that :eqref:`eq_conv-layer`, is what is called, in a nutshell, a *convolutional layer*. 
 *Convolutional neural networks* (CNNs)
 are a special family of neural networks that contain convolutional layers.
 In the deep learning research community,
@@ -258,7 +258,7 @@ we should find a peak in the hidden layer representations.
 
 There is just one problem with this approach.
 So far, we blissfully ignored that images consist
-of 3 channels: red, green, and blue. 
+of three channels: red, green, and blue. 
 In sum, images are not two-dimensional objects
 but rather third-order tensors,
 characterized by a height, width, and channel,
@@ -282,7 +282,7 @@ a number of two-dimensional grids stacked on top of each other.
 As in the inputs, these are sometimes called *channels*.
 They are also sometimes called *feature maps*,
 as each provides a spatialized set
-of learned features to the subsequent layer.
+of learned features for the subsequent layer.
 Intuitively, you might imagine that at lower layers that are closer to inputs,
 some channels could become specialized to recognize edges while
 others could recognize textures.
@@ -295,8 +295,9 @@ $$[\mathsf{H}]_{i,j,d} = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta}
 :eqlabel:`eq_conv-layer-channels`
 
 where $d$ indexes the output channels in the hidden representations $\mathsf{H}$. The subsequent convolutional layer will go on to take a third-order tensor, $\mathsf{H}$, as input.
-Being more general,
-:eqref:`eq_conv-layer-channels` is
+We take
+:eqref:`eq_conv-layer-channels`,
+because of its generality, as
 the definition of a convolutional layer for multiple channels, where $\mathsf{V}$ is a kernel or filter of the layer.
 
 There are still many operations that we need to address.
@@ -311,11 +312,11 @@ We turn to these issues in the remainder of the chapter.
 
 ## Summary and Discussion
 
-In this section we derived the structure of convolutional neural networks from first principles. While it is unclear whether this is what led to the invention of CNNs, it is satisfying to know that they are the *right* choice when applying reasonable principles to how image processing and computer vision algorithms should operate, at least at lower levels. In particular, translation invariance in images implies that all patches of an image will be treated in the same manner. Locality means that only a small neighborhood of pixels will be used to compute the corresponding hidden representations. Some of the earliest references to CNNs are in the form of the Neocognitron :cite:`Fukushima.1982`. 
+In this section we derived the structure of convolutional neural networks from first principles. While it is unclear whether this was the route taken to the invention of CNNs, it is satisfying to know that they are the *right* choice when applying reasonable principles to how image processing and computer vision algorithms should operate, at least at lower levels. In particular, translation invariance in images implies that all patches of an image will be treated in the same manner. Locality means that only a small neighborhood of pixels will be used to compute the corresponding hidden representations. Some of the earliest references to CNNs are in the form of the Neocognitron :cite:`Fukushima.1982`. 
 
 A second principle that we encountered in our reasoning is how to reduce the number of parameters in a function class without limiting its expressive power, at least, whenever certain assumptions on the model hold. We saw a dramatic reduction of complexity as a result of this restriction, turning computationally and statistically infeasible problems into tractable models. 
 
-Adding channels allowed us to bring back some of the complexity that was lost due to the restrictions imposed on the convolutional kernel by locality and translation invariance. Note that channels are quite a natural addition beyond red, green, and blue. Many satellite 
+Adding channels allowed us to bring back some of the complexity that was lost due to the restrictions imposed on the convolutional kernel by locality and translation invariance. Note that it is quite natural to add channels other than just red, green, and blue. Many satellite 
 images, in particular for agriculture and meteorology, have tens to hundreds of channels, 
 generating hyperspectral images instead. They report data on many different wavelengths. In the following we will see how to use convolutions effectively to manipulate the dimensionality of the images they operate on, how to move from location-based to channel-based representations and how to deal with large numbers of categories efficiently. 
 
@@ -332,9 +333,7 @@ generating hyperspectral images instead. They report data on many different wave
 1. Why might translation invariance not be a good idea after all? Give an example. 
 1. Do you think that convolutional layers might also be applicable for text data?
    Which problems might you encounter with language?
-1. What happens with convolutions when an object is at the boundary of an image. 
+1. What happens with convolutions when an object is at the boundary of an image?
 1. Prove that the convolution is symmetric, i.e., $f * g = g * f$.
-1. Prove the convolution theorem, i.e., $f * g = \mathcal{F}^{-1}\left[\mathcal{F}[f] \cdot \mathcal{F}[g]\right]$. 
-   Can you use it to accelerate convolutions? 
 
 [Discussions](https://discuss.d2l.ai/t/64)
diff --git a/chapter_linear-classification/environment-and-distribution-shift.md b/chapter_linear-classification/environment-and-distribution-shift.md
index 2907f8a3bf..9877861034 100644
--- a/chapter_linear-classification/environment-and-distribution-shift.md
+++ b/chapter_linear-classification/environment-and-distribution-shift.md
@@ -420,7 +420,7 @@ Then the probability in a mixed dataset is given by
 $$P(z=1 \mid \mathbf{x}) = \frac{p(\mathbf{x})}{p(\mathbf{x})+q(\mathbf{x})} \text{ and hence } \frac{P(z=1 \mid \mathbf{x})}{P(z=-1 \mid \mathbf{x})} = \frac{p(\mathbf{x})}{q(\mathbf{x})}.$$
 
 Thus, if we use a logistic regression approach,
-where $P(z=1 \mid \mathbf{x})=\frac{1}{1+\exp(-h(\mathbf{x}))}$ ($h$ is a parameterized function),
+where $P(z=1 \mid \mathbf{x})=\frac{1}{1+\exp(-h(\mathbf{x}))}$ ($h$ is a parametrized function),
 it follows that
 
 $$
diff --git a/chapter_linear-classification/index.md b/chapter_linear-classification/index.md
index d1fb1901cc..9a27837da7 100644
--- a/chapter_linear-classification/index.md
+++ b/chapter_linear-classification/index.md
@@ -10,7 +10,7 @@ generating output, calculating the loss,
 taking gradients with respect to weights,
 and updating the model.
 However, the precise form of the targets,
-the parameterization of the output layer,
+the parametrization of the output layer,
 and the choice of loss function will adapt
 to suit the *classification* setting.
 
diff --git a/chapter_linear-classification/softmax-regression.md b/chapter_linear-classification/softmax-regression.md
index 62380ec93f..dce449ef4e 100644
--- a/chapter_linear-classification/softmax-regression.md
+++ b/chapter_linear-classification/softmax-regression.md
@@ -490,7 +490,7 @@ and hopefully enough to whet your appetite,
 we hardly dived deep here.
 Among other things, we skipped over computational considerations.
 Specifically, for any fully connected layer with $d$ inputs and $q$ outputs,
-the parameterization and computational cost is $\mathcal{O}(dq)$,
+the parametrization and computational cost is $\mathcal{O}(dq)$,
 which can be prohibitively high in practice.
 Fortunately, this cost of transforming $d$ inputs into $q$ outputs
 can be reduced through approximation and compression.
diff --git a/chapter_linear-regression/index.md b/chapter_linear-regression/index.md
index 6671275565..94192a92c0 100644
--- a/chapter_linear-regression/index.md
+++ b/chapter_linear-regression/index.md
@@ -7,7 +7,7 @@ for which the inputs connect directly to the outputs.
 This will prove important for a few reasons.
 First, rather than getting distracted by complicated architectures,
 we can focus on the basics of neural network training,
-including parameterizing the output layer, handling data,
+including parametrizing the output layer, handling data,
 specifying a loss function, and training the model.
 Second, this class of shallow networks happens
 to comprise the set of linear models,
diff --git a/chapter_multilayer-perceptrons/generalization-deep.md b/chapter_multilayer-perceptrons/generalization-deep.md
index 9394c6d16f..3256402b65 100644
--- a/chapter_multilayer-perceptrons/generalization-deep.md
+++ b/chapter_multilayer-perceptrons/generalization-deep.md
@@ -215,7 +215,7 @@ and the performance of the different predictors
 will depend on how compatible the assumptions
 are with the observed data.
 
-In a sense, because neural networks are over-parameterized,
+In a sense, because neural networks are over-parametrized,
 possessing many more parameters than are needed to fit the training data,
 they tend to *interpolate* the training data (fitting it perfectly)
 and thus behave, in some ways, more like nonparametric models.
@@ -233,7 +233,7 @@ While current neural tangent kernel models may not fully explain
 the behavior of modern deep networks,
 their success as an analytical tool
 underscores the usefulness of nonparametric modeling
-for understanding the behavior of over-parameterized deep networks.
+for understanding the behavior of over-parametrized deep networks.
 
 
 ## Early Stopping
@@ -337,7 +337,7 @@ remains similarly mysterious.
 
 Unlike classical linear models,
 which tend to have fewer parameters than examples,
-deep networks tend to be over-parameterized,
+deep networks tend to be over-parametrized,
 and for most tasks are capable
 of perfectly fitting the training set.
 This *interpolation regime* challenges
diff --git a/chapter_multilayer-perceptrons/mlp.md b/chapter_multilayer-perceptrons/mlp.md
index d2d931f63d..746124f8db 100644
--- a/chapter_multilayer-perceptrons/mlp.md
+++ b/chapter_multilayer-perceptrons/mlp.md
@@ -404,7 +404,7 @@ of vanishing gradients that plagued
 previous versions of neural networks (more on this later).
 
 Note that there are many variants to the ReLU function,
-including the *parameterized ReLU* (*pReLU*) function :cite:`He.Zhang.Ren.ea.2015`.
+including the *parametrized ReLU* (*pReLU*) function :cite:`He.Zhang.Ren.ea.2015`.
 This variation adds a linear term to ReLU,
 so some information still gets through,
 even when the argument is negative:
diff --git a/chapter_multilayer-perceptrons/numerical-stability-and-init.md b/chapter_multilayer-perceptrons/numerical-stability-and-init.md
index cc0a205769..a59d5ae244 100644
--- a/chapter_multilayer-perceptrons/numerical-stability-and-init.md
+++ b/chapter_multilayer-perceptrons/numerical-stability-and-init.md
@@ -64,7 +64,7 @@ from jax import grad, vmap
 Consider a deep network with $L$ layers,
 input $\mathbf{x}$ and output $\mathbf{o}$.
 With each layer $l$ defined by a transformation $f_l$
-parameterized by weights $\mathbf{W}^{(l)}$,
+parametrized by weights $\mathbf{W}^{(l)}$,
 whose hidden layer output is $\mathbf{h}^{(l)}$ (let $\mathbf{h}^{(0)} = \mathbf{x}$),
 our network can be expressed as:
 
diff --git a/chapter_optimization/gd.md b/chapter_optimization/gd.md
index 3e75ba5224..70c64c15b5 100644
--- a/chapter_optimization/gd.md
+++ b/chapter_optimization/gd.md
@@ -329,7 +329,7 @@ $$\mathbf{x} \leftarrow \mathbf{x} - \eta \mathrm{diag}(\mathbf{H})^{-1} \nabla
 
 
 While this is not quite as good as the full Newton's method, it is still much better than not using it.
-To see why this might be a good idea consider a situation where one variable denotes height in millimeters and the other one denotes height in kilometers. Assuming that for both the natural scale is in meters, we have a terrible mismatch in parameterizations. Fortunately, using preconditioning removes this. Effectively preconditioning with gradient descent amounts to selecting a different learning rate for each variable (coordinate of vector $\mathbf{x}$).
+To see why this might be a good idea consider a situation where one variable denotes height in millimeters and the other one denotes height in kilometers. Assuming that for both the natural scale is in meters, we have a terrible mismatch in parametrizations. Fortunately, using preconditioning removes this. Effectively preconditioning with gradient descent amounts to selecting a different learning rate for each variable (coordinate of vector $\mathbf{x}$).
 As we will see later, preconditioning drives some of the innovation in stochastic gradient descent optimization algorithms.
 
 
diff --git a/chapter_optimization/optimization-intro.md b/chapter_optimization/optimization-intro.md
index 5aa61a7116..1509290762 100644
--- a/chapter_optimization/optimization-intro.md
+++ b/chapter_optimization/optimization-intro.md
@@ -220,7 +220,7 @@ As we saw, optimization for deep learning is full of challenges. Fortunately the
 * Minimizing the training error does *not* guarantee that we find the best set of parameters to minimize the generalization error.
 * The optimization problems may have many local minima.
 * The problem may have even more saddle points, as generally the problems are not convex.
-* Vanishing gradients can cause optimization to stall. Often a reparameterization of the problem helps. Good initialization of the parameters can be beneficial, too.
+* Vanishing gradients can cause optimization to stall. Often a reparametrization of the problem helps. Good initialization of the parameters can be beneficial, too.
 
 
 ## Exercises
diff --git a/chapter_recurrent-neural-networks/rnn.md b/chapter_recurrent-neural-networks/rnn.md
index f0f09ff954..18f7d915e5 100644
--- a/chapter_recurrent-neural-networks/rnn.md
+++ b/chapter_recurrent-neural-networks/rnn.md
@@ -129,7 +129,7 @@ of the output layer.
 It is worth mentioning that
 even at different time steps,
 RNNs always use these model parameters.
-Therefore, the parameterization cost of an RNN
+Therefore, the parametrization cost of an RNN
 does not grow as the number of time steps increases.
 
 :numref:`fig_rnn` illustrates the computational logic of an RNN at three adjacent time steps.