Matrix Calculus for Machine Learning and Beyond

This is the course page for an 18.063 Matrix Calculus at MIT taught in January 2025 (IAP) by Professors Alan Edelman and Steven G. Johnson.

For past versions of this course, see Matrix Calculus in IAP 2023 (OCW) on OpenCourseWare (also on github, with videos on YouTube). See also Matrix Calculus in IAP 2022 (OCW) (also on github), and Matrix Calculus 2024 (github); the previous years used the temporary 18.S096 "special subject" course number.

Lectures: MWF time 11am–1pm, Jan 13–Jan 31 in room 4-370; lecture recordings to be posted (MIT only). 3 units, 2 problem sets due Jan 24 and Feb 31 — submitted electronically via Canvas, no exams. TA/grader: TBD.

Course Notes: Draft notes from IAP 2024. Other materials to be posted.

Piazza forum: Online discussions at Piazza.

Description:

We all know that calculus courses such as 18.01 and 18.02 are univariate and vector calculus, respectively. Modern applications such as machine learning and large-scale optimization require the next big step, "matrix calculus" and calculus on arbitrary vector spaces.

This class covers a coherent approach to matrix calculus showing techniques that allow you to think of a matrix holistically (not just as an array of scalars), generalize and compute derivatives of important matrix factorizations and many other complicated-looking operations, and understand how differentiation formulas must be re-imagined in large-scale computing. We will discuss reverse/adjoint/backpropagation differentiation, custom vector-Jacobian products, and how modern automatic differentiation is more computer science than calculus (it is neither symbolic formulas nor finite differences).

Prerequisites: Linear Algebra such as 18.06 and multivariate calculus such as 18.02.

Course will involve simple numerical computations using the Julia language. Ideally install it on your own computer following these instructions, but as a fallback you can run it in the cloud here:

Topics:

Here are some of the planned topics:

Derivatives as linear operators and linear approximation on arbitrary vector spaces: beyond gradients and Jacobians.
Derivatives of functions with matrix inputs and/or outputs (e.g. matrix inverses and determinants). Kronecker products and matrix "vectorization".
Derivatives of matrix factorizations (e.g. eigenvalues/SVD) and derivatives with constraints (e.g. orthogonal matrices).
Multidimensional chain rules, and the significance of right-to-left ("forward") vs. left-to-right ("reverse") composition. Chain rules on computational graphs (e.g. neural networks).
Forward- and reverse-mode manual and automatic multivariate differentiation.
Adjoint methods (vJp/pullback rules) for derivatives of solutions of linear, nonlinear, and differential equations.
Application to nonlinear root-finding and optimization. Multidimensional Newton and steepest–descent methods.
Applications in engineering/scientific optimization and machine learning.
Second derivatives, Hessian matrices, quadratic approximations, and quasi-Newton methods.

Lecture 1 (Jan 13)

part 1: overview (slides)
part 2: derivatives as linear operators: matrix functions, gradients, product and chain rule
video (MIT only)

Re-thinking derivatives as linear operators: f(x+dx)-f(x)=df=f′(x)[dx] — f′ is the linear operator that gives the change df in the output from a "tiny" change dx in the inputs, to first order in dx (i.e. dropping higher-order terms). When we have a vector function f(x)∈ℝᵐ of vector inputs x∈ℝⁿ, then f'(x) is a linear operator that takes n inputs to m outputs, which we can think of as an m×n matrix called the Jacobian matrix (typically covered only superficially in 18.02).

In the same way, we can define derivatives of matrix-valued operators as linear operators on matrices. For example, f(X)=X² gives f'(X)[dX] = X dX + dX X. Or f(X) = X⁻¹ gives f'(X)[dX] = –X⁻¹ dX X⁻¹. These are perfectly good linear operators acting on matrices dX, even though they are not written in the form (Jacobian matrix)×(column vector)! (We could rewrite them in the latter form by reshaping the inputs dX and the outputs df into column vectors, more formally by choosing a basis, and we will later cover how this process can be made more elegant using Kronecker products. But for the most part it is neither necessary nor desirable to express all linear operators as Jacobian matrices in this way.)

Reviewed the (easy) derivations of the sum rule d(f+g)=df+dg and the product rule d(fg) = (df)g+f(dg), directly from the definition of f(x+dx)-f(x)=df=f′(x)[dx], dropping higher-order terms.

Discussed the chain rule for f(g(x)) (f'(x)=g'(h(x))h'(x), where this is a composition of two linear operations, performing h' then g' — g'h' ≠ h'g'!). For functions from vectors to vectors, the chain rule is simply the product of Jacobians. Moreover, as soon as you compose 3 or more functions, it can a make a huge difference whether you multiply the Jacobians from left-to-right ("reverse-mode", or "backpropagation", or "adjoint differentiation") or right-to-left ("forward-mode"). Showed, for example, that if you have many inputs but a single output (as is common in machine learning and other types of optimization problem), that it is vastly more efficient to multiply left-to-right than right-to-left, and such "backpropagation algorithms" are a key factor in the practicality of large-scale optimization.

Further reading: Draft Course Notes (link above), chapters 1 and 2. matrixcalculus.org (linked in the slides) is a fun site to play with derivatives of matrix and vector functions. The Matrix Cookbook has a lot of formulas for these derivatives, but no derivations. Some notes on vector and matrix differentiation were posted for 6.S087 from IAP 2021.

Further reading (gradients): A fancy name for a row vector is a "covector" or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.

Further reading (chain rule): The terms "forward-mode" and "reverse-mode" differentiation are most prevalent in automatic differentiation (AD), which will will cover later in this course. You can find many, many articles online about backpropagation in neural networks. There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a pushforward, whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a pullback. This video on the principles of AD in Julia by Dr. Mohamed Tarek also starts with a similar left-to-right (reverse) vs right-to-left (forward) viewpoint and goes into how it translates to Julia code, and how you define custom chain-rule steps for Julia AD. In other fields, "reverse mode" is sometimes called an "adjoint method": see the notes on adjoint methods and slides from 18.335 (video).

Further reading (fancier math): the perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I'm calling "fancy") presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The "little-o notation" o(δx) we're using here for "infinitesimal asymptotics" is closely related to the asymptotic notation used in computer science, but in computer science people are typically taking the limit as the argument (often called "n") becomes very large instead of very small.

Lecture 2 (Jan 15)

part 1: matrix Jacobians via vectorization; notes: 2×2 Matrix Jacobians (html) (pluto notebook source code) (jupyter notebook). Course notes: chapter 3.
example of reverse-mode/adjoint differentiation for optimizing g(p) = f(A(p)⁻¹b), showing why a linear-operator formula like d(A⁻¹)=–A⁻¹ dA A⁻¹ is actually perfectly practical and usable, and why evaluating the chain rule outputs-to-inputs is so practically important. Last few slides of lecture 1.
video (MIT only)

Further reading: Wikipedia has a useful list of properties of the matrix trace. The "matrix dot product" introduced today is also called the Frobenius inner product, and the corresponding norm ("length" of the matrix viewed as a vector) is the Frobenius norm. When you "flatten" a matrix A by stacking its columns into a single vector, the result is called vec(A), and many important linear operations on matrices can be expressed as Kronecker products. The Matrix Cookbook has lots of formulas for derivatives of matrix functions. See the notes on adjoint methods and slides from 18.335 (video).

Lecture 3 (Jan 17)

generalized gradients and inner products — handwritten notes and course notes chapter 5
- also norms and derivatives: why a norm of the input and output are needed to define a derivative
- more on handling units: when the components of the vector are quantities different units, defining the inner product (and hence the norm) requires dimensional weight factors to scale the quantities. (Using standard gradient / inner product implicitly uses weights given by whatever units you are using.) A change of variables (to nondimensionalize the problem) is equivalent (for steepest descent) to a nondimensionalization of the inner-product/norm, but the former is typically easier for use with off-the-shelf optimization software. Usually, you want to use units/scaling so that all your quantities have similar scales, otherwise steepest descent may converge very slowly!
The gradient of the determinant is ∇(det A) = det(A)A⁻ᵀ (course notes chapter 7)
an amazing trick by Mathias (1996): f([A dA; 0I A]) = [f(A) f′(A)[dA]; 0I f(A)] (in Julia notation) for any analytic/smooth function f(A) acting on square matrices. (e.g. matrix powers/polynomials, matrix exponentials, etcetera).
video (MIT only)
pset 1 (due Feb 24)

Generalizing gradients to scalar functions f(x) for x in arbitrary vector spaces x ∈ V. The key thing is that we need not just a vector space, but an inner product x⋅y (a "dot product", also denoted ⟨x,y⟩ or ⟨x|y⟩); V is then formally called a Hilbert space. Then, for any scalar function, since df=f'(x)[dx] is a linear operator mapping dx∈V to scalars df∈ℝ (a "linear form"), it turns out that it must be a dot product of dx with "something", and we call that "something" the gradient! That is, once we define a dot product, then for any scalar function f(x) we can define ∇f by f'(x)[dx]=∇f⋅dx. So ∇f is always something with the same "shape" as x (the steepest-ascent direction).

Talked about the general requirements for an inner product: linearity, positivity, and (conjugate) symmetry (and also mentioned the Cauchy–Schwarz inequality, which follows from these properties). Gave some examples of inner products, such as the familiar Euclidean inner product xᵀy or a weighted inner product. Defined the most obvious inner product of m×n matrices: the Frobenius inner product A⋅B=sum(A .* B)=trace(AᵀB)=vec(A)ᵀvec(B), the sum of the products of the matrix entries. This also gives us the "Frobenius norm" ‖A‖²=A⋅A=trace(AᵀA)=‖vec(A)‖², the square root of the sum of the squares of the entries. Using this, we can now take the derivatives of various scalar functions of matrices, e.g. we considered

f(A)=tr(A) ⥰ ∇f = I
f(A)=‖A‖ ⥰ ∇f = A/‖A‖
f(A)=xᵀAy ⥰ ∇f = xyᵀ (for constant x, y)
f(A)=det(A) ⥰ ∇f = det(A)(A⁻¹)ᵀ = transpose of the adjugate of A

Also talked about the definition of a norm (which can be obtained from an inner product if you have one, but can also be defined by itself), and why a norm is necessary to define a derivative: it is embedded in the definition of what a higher-order term o(δx) means. (Although there are many possible norms, in finite dimensions all norms are equivalent up to constant factors, so the definition of a derivative does not depend on the choice of norm.)

Made precise and derived (with the help of Cauchy–Schwarz) the well known fact that ∇f is the steepest-ascent direction, for any scalar-valued function on a vector space with an inner product (any Hilbert space), in the norm corresponding to that inner product. That is, if you take a step δx with a fixed length ‖δx‖=s, the greatest increase in f(x) to first order is found in a direction parallel to ∇f.

Closed with a sketch of an amazing formula by Mathias for the derivatives of smooth functions from square matrices to square matrices, which you will investigate more for homework: for a sufficiently smooth function f(A) from square matrices to square matrices, it turns out that:

$$f(\begin{bmatrix} A & \delta A \\ & A \end{bmatrix}) = \begin{bmatrix} f(A) & f'(A)[\delta A] \\ & f(A) \end{bmatrix} \, .$$

(This is exact for any δA, even if it is not small!) You will investigate this more closely for homework.

Further reading (∇det): Course notes, chapter 7. There are lots of discussions of the derivative of a determinant online, involving the "adjugate" matrix det(A)A⁻¹. Not as well documented is that the gradient of the determinant is the cofactor matrix widely used for the Laplace expansion of a determinant. The formula for the derivative of log(det A) is also nice, and logs of determinants appear in surprisingly many applications (from statistics to quantum field theory). The Matrix Cookbook contains many of these formulas, but no derivations. A nice application of d(det(A)) is solving for eigenvalues λ by applying Newton's method to det(A-λI)=0, and more generally one can solve det(M(λ))=0 for any function Μ(λ) — the resulting roots λ are called nonlinear eigenvalues (if M is nonlinear in λ), and one can apply Newton's method using the determinant-derivative formula here.

Further reading (Mathias formula): The earliest derivation of this formula seems to be by Mathias (1996). A generalization is proved in Higham (2008): if $A$ is an $n \times n$ matrix, the function $f(A)$ need only be $2n-1$ times differentiable (or even only once differentiable if $A$ is normal), instead of requiring it to be analytic (have a Taylor series). The Mathias formula is very closely related to the power law and other rules for 2×2 Jordan blocks / generalized eigenvectors of defective matrices, as well as to the matrix representation of dual numbers.

Lecture 4 (Jan 22)

part 1: forward-mode automatic differentiation (AD) via dual numbers (Julia notebook) - course notes, chapter 8
part 2: reverse-mode differentiation examples - course notes, section 6.3
- slides on nonlinear root-finding, optimization, and adjoint-method differentiation slides - talked about reverse-mode "implicit differentiation" involving implicit functions defined by solutions of nonlinear systems of equations
- Reverse-mode gradients for neural networks: handwritten backpropagation notes
video (MIT only)

Further reading on forward AD: Course notes, chapter 8. Googling "automatic differentiation" will turn up many, many resources — this is a huge practical field these days. ForwardDiff.jl (described detail by this paper) in Julia uses dual number arithmetic similar to lecture to compute derivatives; see also this AMS review article, or google "dual number automatic differentiation" for many other reviews. Adrian Hill posted some nice lecture notes on automatic differentiation (Julia-based) for an ML course at TU Berlin (Summer 2023). TaylorDiff.jl extends this to higher-order derivatives.

Further reading on adjoint methods Course notes, section 6.3. A useful review of topology-optimization methods in engineering (where "every pixel/voxel" of a structure is a degree of freedom) can be found in Sigmund and Maute (2013). See the notes on adjoint methods and slides from 18.335 (video). "Implicit" differentiation, for passing the chain rule through implicit functions (usually in reverse mode) via the implicit function theorem, has become an increasingly prominent topic in AD; see e.g. the paper "Efficient and modular implicit differentiation" (Blondel et al., 2021), and the ImplicitDifferentiation.jl package in Julia. A non-obvious example of an implicit function is the solution of a "nested" optimization problem x(p) = argminₓ g(p,x), called a bilevel optimization problem, which (locally) satisfies the nonlinear equations ∇ₓg=0 (or more generally the KKT conditions), and these can be differentiated via the implicit function; see e.g. Amos and Kolter (2017).

Further reading on backpropagation for NNs: Strang (2019) section VII.3 and 18.065 OCW lecture 27. You can find many, many articles online about backpropagation in neural networks. Backpropagation for neural networks is closely related to backpropagation/adjoint methods for recurrence relations (course notes), and on computational graphs (blog post). We will return to computational graphs in a future lecture.

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
notes		notes
psets		psets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matrix Calculus for Machine Learning and Beyond

Lecture 1 (Jan 13)

Lecture 2 (Jan 15)

Lecture 3 (Jan 17)

Lecture 4 (Jan 22)

About

Releases

Packages

Contributors 6

Languages

mitmath/matrixcalc

Folders and files

Latest commit

History

Repository files navigation

Matrix Calculus for Machine Learning and Beyond

Lecture 1 (Jan 13)

Lecture 2 (Jan 15)

Lecture 3 (Jan 17)

Lecture 4 (Jan 22)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages