From e71d847e6f3f2ae9e38bbb2b7b355675ebcf6f71 Mon Sep 17 00:00:00 2001
From: Christopher Paciorek <paciorek@stat.berkeley.edu>
Date: Thu, 2 Nov 2023 10:45:22 -0700
Subject: [PATCH] a bit more edits to unit 10

---
 _freeze/units/unit10-linalg/execute-results/html.json | 4 ++--
 _freeze/units/unit10-linalg/execute-results/tex.json  | 4 ++--
 units/unit10-linalg.qmd                               | 6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/_freeze/units/unit10-linalg/execute-results/html.json b/_freeze/units/unit10-linalg/execute-results/html.json
index 69590e4..b040609 100644
--- a/_freeze/units/unit10-linalg/execute-results/html.json
+++ b/_freeze/units/unit10-linalg/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "04df3b9264ae4187721edd4da62e08c5",
+  "hash": "d1d0a6f17e0930b47afaca760a06ac18",
   "result": {
-    "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n  pdf:\n    documentclass: article\n    margin-left: 30mm\n    margin-right: 30mm\n    toc: true\n  html:\n    theme: cosmo\n    css: ../styles.css\n    toc: true\n    code-copy: true\n    code-block-background: true\nexecute:\n  freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n-   Gentle: Numerical Linear Algebra for Applications in Statistics\n    (available via UC Library Search) (my notes here are based primarily\n    on this source) [Gentle-NLA]\n    -   Gentle: Matrix Algebra also has much of this material.\n-   Gentle: Computational Statistics [Gentle-CS]\n-   Lange: Numerical Analysis for Statisticians\n-   Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n  - Video 1. Ill-conditioned problems, part 1\n  - Video 2. Ill-conditioned problems, part 2\n  - Video 3. Triangular systems of equations\n  - Video 4. Solving systems of equations via LU, part 1\n  - Video 5. Solving systems of equations via LU, part 2\n  - Video 6. Solving systems of equations via LU, part 3\n  - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1.  How to think about the computational order (number of computations\n    involved) of a problem\n2.  How to choose a computational approach to a given linear algebra\n    calculation you need to do.\n3.  An understanding of how issues with computer numbers (Unit 8) affect\n    linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B)  \nA @ B        # alternative\nA.dot(B)     # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1.  We can find the ordering that reduces computation the most if the\n    individual matrices are not square.\n2.  $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n    $x^{\\top}Ax$, is a scalar, and this is equal to\n    $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n    helpful to be able to go back and forth between a scalar and a trace\n    in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n-   having fewer rows or columns;\n-   being diagonal, triangular or sparse in some way,\n-   being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $n<p$, then we have at most $n$ linearly\nindependent covariates (the rank is at most $n$). In this case we have\nmultiple possible solutions and the system is ill-determined\n(under-determined). Similarly, if $q<p$ and $n\\geq p$, the rank is again\nless than $p$ and we have multiple possible solutions. Of course we\nusually have $n>p$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1.  What does it mean when one or more eigenvalue (i.e.,\n    $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2.  Suppose I have an eigenvalue that is very small and I set it to\n    zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3.  Now let's consider the inverse of a covariance matrix, known as the\n    precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n    does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n    $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots &  & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n &  & \\vdots &  & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots &  &  & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\  &  & 1 & -4 & 6 & -4 & 1 \\\\  &  &  & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0]        # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n       3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3]   # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals   # variances\nevals[3] = 0      # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n-8.881784197001252e-16\n```\n:::\n:::\n\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1.  Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n            for(j=1:m){\n                b_i = b_i + a_{ij} x_j\n            }\n        }\n```\n\n2.  Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n        }\n        for(j=1:m){\n            for(i = 1:n){\n                b_i = b_i + a_{ij} x_j  \n            }\n        }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n    return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011)  # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1))  # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-3.61355412e+04,  3.61987534e+01, -8.91405553e-03])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000              # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n    np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505,   770.        ,     9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10                  # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n    np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487,  7.7       ,  1.95366513])\n```\n:::\n:::\n\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n|     Name   |     Representation   |  Restrictions |   Properties   |  Uses       |\n|  ----------| ----------------------| ------------| --------------------|-------------|\n|   LU           |       $A_{nn}= L_{nn}U_{nn}$     |    $A$ generally square |  $L$ lower triangular; $U$ upper triangular |  solving equations; inversion |  \n|   QR           |   $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n|      Cholesky    | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n|     Eigen decomposition    | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n|     SVD    |    $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1.  $x_{n}=b_{n}/A_{nn}$\n2.  Now for $k<n$, use the already computed\n    $\\{x_{n},x_{n-1},\\ldots,x_{k+1}\\}$ to calculate\n    $x_{k}=\\frac{b_{k}-\\sum_{j=k+1}^{n}x_{j}A_{kj}}{A_{kk}}$.\n3.  Repeat for all rows.\n\nHow many multiplies and adds are done? Solving lower triangular systems\nis very similar and involves the same number of calculations.\n\nIn Scipy, we can use `linalg.solve_triangular` to solve triangular systems.\nThe `trans` argument indicates whether to solve $Ax=b$ or $A^{\\top}x=b$\n(so one wouldn't need to transpose before solving).\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy as sp\nnp.random.seed(1)\nn = 20\nX = np.random.normal(size = (n,n))\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X  # produce a positive def. matrix \n\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\nU = L.T\n\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\nout2 = np.linalg.inv(L) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n\n```{.python .cell-code}\nout3 =  sp.linalg.solve_triangular(U, b, lower=False)\nout4 = np.linalg.inv(U) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n:::\n\n\n\n\n**To reiterate the distinction between matrix inversion and solving a\nsystem of equations, when we write $U^{-1}b$, what we mean on a computer\nis to carry out the above algorithm, not to find the inverse and then\nmultiply.**\n\nHere's a good reason why.\n\n\n\n::: {.cell hash='unit10-linalg_cache/html/unnamed-chunk-8_f07e7d0a3c06f8d614d02f9fd20f019d'}\n\n```{.python .cell-code}\nimport time\n\nnp.random.seed(1)\nn = 5000\nX = np.random.normal(size = (n,n))\n\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\n\nt0 = time.time()\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.032052040100097656\n```\n:::\n\n```{.python .cell-code}\nt0 = time.time()\nout2 = np.linalg.inv(L) @ b\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.630963087081909\n```\n:::\n:::\n\n\n\nThat assumes you have $L$, but we'll see in a bit that even when one\naccounts for the creation of $L$, you don't want to invert matrices in\norder to solve systems of equations.\n\n## Gaussian elimination (LU decomposition)\n\nGaussian elimination is a standard way of directly computing a solution\nfor $Ax=b$. It is equivalent to the LU decomposition. LU is primarily\ndone with square matrices, but not always. Also LU decompositions do\nexist for some singular matrices.\n\nThe idea of Gaussian elimination is to convert the problem to a\ntriangular system. In class, we'll walk through Gaussian elimination in\ndetail and see how it relates to the LU decomposition. I'll describe it\nmore briefly here. Following what we learned in algebra when we have\nmultiple equations, we preserve the solution, $x$, when we add multiples\nof rows (i.e., add multiples of equations) together. This amounts to\ndoing $L_{1}Ax=L_{1}b$ for a lower-triangular matrix $L_{1}$ that\nproduces all zeroes in the first column of $L_{1}A$ except for the first\nrow. We proceed to zero out values below the diagonal for the other\ncolumns of $A$. The result is\n$L_{n-1}\\cdots L_{1}Ax\\equiv Ux=L_{n-1}\\cdots L_{1}b\\equiv b^{*}$ where\n$U$ is upper triangular. This is the forward reduction step of Gaussian\nelimination. Then the backward elimination step solves $Ux=b^{*}$.\n\nIf we're just looking for the solution of the system, we don't need the\nlower-triangular factor $L=(L_{n-1}\\cdots L_{1})^{-1}$ in $A=LU$, but it\nturns out to have a simple form that is computed as we go along, it is\nunit lower triangular and the values below the diagonal are the negative\nof the values below the diagonals in $L_{1},\\ldots,L_{n-1}$ (note that\neach $L_{j}$ has non-zeroes below the diagonal only in the $j$th\ncolumn). As a side note related to storage, it turns out that as we\nproceed, we can store the elements of $L$ and $U$ in the original $A$\nmatrix, except for the implicit 1s on the diagonal of $L$.\n\nIn class, we'll work out the computational complexity of the LU and see\nthat it is $O(n^{3})$.\n\nIf we look at `help(np.linalg.solve)` in Python, we see that it uses *_gesv*. A\nGoogle search indicates that this is a Lapack routine that does the LU\ndecomposition with partial pivoting and row interchanges (see below on\nwhat these are), so numpy is using the algorithm we've just discussed.\n\nWe can also explicitly get the LU decomposition in Python with\n`scipy.linalg.lu()`, though for most use cases, what we want to\ndo is solve a system of equations. (In R, one can't easily get\nthe explicit LU decomposition, though `solve()` in R does use the LU.\n\nOne additional complexity is that we want to avoid dividing by very\nsmall values to avoid introducing numerical inaccuracy (we would get\nlarge values that might overwhelm whatever they are being added to, and\nsmall errors in the divisor will have large effects on the result). This\ncan be done on the fly by interchanging equations to use the equation\n(row) that produces the largest value to divide by. For example in the\nfirst step, we would switch the first equation (first row) for whichever\nof the remaining equations has the largest value in the first column.\nThis is called partial pivoting. The divisors are called pivots.\nComplete pivoting also considers interchanging columns, and while\ntheoretically better, partial pivoting is generally sufficient and\nrequires fewer computations. Partial pivoting can be expressed as\nmultiplying along the way by permutation matrices,\n$P_{1},\\ldots P_{n-1}$ that switch rows. One can show with some work\nthat based on pivoting, we have $PA=LU$, where $P=P_{n-1}\\cdots P_{1}$.\nIn the demo code, we'll see a toy example of the impact of pivoting.\n\nFinally $|PA|=|P||A|=|L||U|=|U|$ (why?) so $|A|=|U|/|P|$ and since the\ndeterminant of each permutation matrix, $P_{j}$ is -1 (except when\n$P_{j}=I$ because we don't need to switch rows), we just need to\nmultiply by minus one if there is an odd number of permutations. Or if\nwe know the matrix is non-negative definite, we just take the absolute\nvalue of $|U|$. So Gaussian elimination provides a fast stable way to\nfind the determinant.\n\n## When would we explicitly invert a matrix?\n\nIn some cases (as we discussed in class) you actually need the inverse as\nthe output (e.g., for estimating standard errors). \n\nBut if you are just needing the inverse to multiply it by a vector or another\nmatrix, you're better off\nusing a decomposition. Basically, one can work out that the cost of computing\nthe inverse and doing subsequent multiplication with it is rather more \nthan computing the decomposition and doing subsequent multiplications with it,\nregardless of how many columns there are in the matrix you are multiplying by.\n\nWith numpy, that may mean computing the decomposition and then \ncarrying out the steps to do the multiplication using the decomposition\nor using `np.linalg.solve(A, B)` which, as mentioned above, uses the LU behind the scenes. \nOne would not do  `np.linalg.inv(A) @ B`.\n\n## Cholesky decomposition\n\nWhen $A$ is p.d., we can use the Cholesky decomposition to solve a\nsystem of equations. Positive definite matrices can be decomposed as\n$U^{\\top}U=A$ where $U$ is upper triangular. $U$ is called a square root\nmatrix and is unique (apart from the sign, which we fix by requiring the\ndiagonals to be positive). One algorithm for computing $U$ is:\n\n1.  $U_{11}=\\sqrt{A_{11}}$\n2.  For $j=2,\\ldots,n$, $U_{1j}=A_{1j}/U_{11}$\n3.  For $i=2,\\ldots,n$,\n    -   $U_{ii}=\\sqrt{A_{ii}-\\sum_{k=1}^{i-1}U_{ki}^{2}}$\n    -   if $i<n$, then for $j=i+1,\\ldots,n$:\n        $U_{ij}=(A_{ij}-\\sum_{k=1}^{i-1}U_{ki}U_{kj})/U_{ii}$\n\nWe can then solve a system of equations as: $U^{-1}(U^{\\top-1}b)$.\n\nConfusingly, while numpy's `cholesky` gives $L = U^\\top$, scipy \ncan return either $L$ or $U$ but defaults to $U$.\n\nHere are two ways we can use the Cholesky to solve a system of equations:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nU = sp.linalg.cholesky(A)\nsp.linalg.solve_triangular(L.T, \n          sp.linalg.solve_triangular(U, b, lower=False, trans='T'),\n          lower=False)\n\nU, lower = sp.linalg.cho_factor(A)\nsp.linalg.cho_solve((U, lower), b)\n```\n:::\n\n\n\n\nThe Cholesky has some nice advantages over the LU: (1) while both are\n$O(n^{3})$, the Cholesky involves only half as many computations,\n$n^{3}/6+O(n^{2})$ and (2) the Cholesky factorization has only\n$(n^{2}+n)/2$ unique values compared to $n^{2}+n$ for the LU. Of course\nthe LU is more broadly applicable. The Cholesky does require computation\nof square roots, but it turns out this is not too intensive. There is\nalso a method for finding the Cholesky without square roots.\n\n#### Uses of the Cholesky\n\nThe standard algorithm for generating $y\\sim\\mathcal{N}(0,A)$ is:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nL = sp.linalg.cholesky(A, lower=True)\ny = L @ np.random.normal(size = n)\n```\n:::\n\n\n\n**Question**: where will most of the time in this two-step calculation\nbe spent?\n\nIf a regression design matrix, $X$, is full rank, then $X^{\\top}X$ is\npositive definite, so we could find\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$ using either the Cholesky or\nGaussian elimination. \n\nHowever, for OLS, it turns out that the standard approach is to work with $X$\nusing the QR decomposition rather than working with $X^{\\top}X$; working\nwith $X$ is more numerically stable, though in most situations without\nextreme collinearity, either of the approaches will be fine.\n\nWe could also consider the GLS estimator, which accounts for dependence\nin the errors: $\\hat{\\beta}=(X^{\\top}\\Sigma^{-1}X)^{-1}X^{\\top}\\Sigma^{-1}Y$ \n\n**Challenge**: write efficient R code to carry out\nthe GLS solution using the Cholesky factorization.\n\n\n#### Numerical issues with eigendecompositions and Cholesky decompositions for positive definite matrices \n\nMonahan comments that in general Gaussian elimination and the Cholesky\ndecomposition are very stable. However, in the Cholesky case, if the\nmatrix is very ill-conditioned we can get $A_{ii}-\\sum_{k}U_{ki}^{2}$\nbeing negative and then the algorithm stops when we try to take the\nsquare root. In this case, the Cholesky decomposition does not exist\nnumerically although it exists mathematically. It's not all that hard to\nproduce such a matrix, particularly when working with high-dimensional\ncovariance matrices with large correlations.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlocs = np.random.uniform(size = 100)\nrho = .1\ndists = np.abs(locs[:, np.newaxis] - locs)\nC = np.exp(-dists**2/rho**2)\ne = np.linalg.eig(C)\nnp.sort(e[0])[::-1][96:100]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-4.16702596e-16+0.j, -5.38621518e-16+0.j, -6.26206506e-16+0.j,\n       -6.98611709e-16+0.j])\n```\n:::\n\n```{.python .cell-code}\ntry:\n    L = np.linalg.cholesky(C)\nexcept Exception as error:\n    print(error)\n  \n```\n\n::: {.cell-output .cell-output-stdout}\n```\nMatrix is not positive definite\n```\n:::\n\n```{.python .cell-code}\nvals = np.abs(e[0])\nnp.max(vals)/np.min(vals)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n3.135151918122864e+18\n```\n:::\n:::\n\n\n\nI don't see a way to use pivoting with the Cholesky in Python, but in R, \none can do `chol(C, pivot = TRUE)`.\n\nWe can think about the accuracy here as follows. Suppose we have a\nmatrix whose diagonal elements (i.e., the variances) are order of\nmagnitude 1 and that the true value of a $U_{ii}$ is less than\n$1\\times10^{-16}$. From the given $A_{ii}$ we are subtracting\n$\\sum_{k}U_{ki}^{2}$ and trying to calculate this very small number but\nwe know that we can only represent the values $A_{ii}$ and\n$\\sum_{k}U_{ki}^{2}$ accurately to 16 places, so the difference is\ngarbage starting in the 17th position and could well be negative. Now\nrealize that $\\sum_{k}U_{ki}^{2}$ is the result of a potentially large\nset of arithmetic operations, and is likely represented accurately to\nfewer than 16 places. Now if the true value of $U_{ii}$ is smaller than\nthe accuracy to which $\\sum_{k}U_{ki}^{2}$ is represented, we can get a\ndifference that is negative.\n\nNote that when the Cholesky fails, we can still compute an\neigendecomposition, but we have negative numeric eigenvalues. Even if\nall the eigenvalues are numerically positive (or equivalently, we're\nable to get the Cholesky), errors in small eigenvalues near machine\nprecision could have large effects when we work with the inverse of the\nmatrix. This is what happens when we have columns of the $X$ matrix\nnearly collinear. We cannot statistically distinguish the effect of two\n(or more) covariates, and this plays out numerically in terms of\nunstable results.\n\nA strategy when working with mathematically but not numerically positive\ndefinite $A$ is to set eigenvalues or singular values to zero when they\nget very small, which amounts to using a pseudo-inverse and setting to\nzero any linear combinations with very small variance. We can also use\npivoting with the Cholesky and accumulate zeroes in the last $n-q$ rows\n(for cases where we try to take the square root of a negative number),\ncorresponding to the columns of $A$ that are numerically linearly\ndependent. \n\n## QR decomposition\n\n### Introduction\n\nThe QR decomposition is available for any matrix, $X=QR$, with $Q$\northogonal and $R$ upper triangular. If $X$ is non-square, $n\\times p$\nwith $n>p$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p<n$, we definitely need $Q_{p}$ so\nthat the last $n-p$ rows of $R$ are zero and we can then discard them\nwhen just using the skinny QR.\n\n#### QR Method 2: Rotations\n\nA *Givens* rotation matrix rotates a vector in a two-dimensional\nsubspace to be axis oriented with respect to one of the two dimensions\nby changing the value of the other dimension. E.g. we can create\n$\\tilde{x}=(x_{1},\\ldots,\\tilde{x}_{p},\\ldots,0,\\ldots x_{n})$ from\n$x=(x_{1,}\\ldots,x_{p},\\ldots,x_{q},\\ldots,x_{n})$ using a matrix\nmultiplication: $\\tilde{x}=Qx$. $Q$ is orthogonal but not symmetric.\n\nWe can use a series of Givens rotations to do the QR but unless it is\ndone carefully, more computations are needed than with Householder\nreflections. The basic story is that we apply a series of Givens\nrotations to $X$ such that we zero out the lower triangular elements.\n\n$$\n\\begin{aligned}\nR & = & Q_{pn}\\cdots Q_{23}Q_{1n}\\cdots Q_{13}Q_{12} X \\\\\nQ & = & (Q_{pn}\\cdots Q_{12})^{\\top}\\end{aligned}.\n$$\n\n \nNote that we create\nthe $n-p$ zero rows in $R$ (because the calculations affect the upper\ntriangle of $R$), but we can then ignore those rows and the\ncorresponding columns of $Q$.\n\n#### QR Method 3: Gram-Schmidt Orthogonalization\n\nGram-Schmidt involves finding a set of orthonormal vectors to span the\nsame space as a set of LIN vectors, $x_{1},\\ldots,x_{p}$. If we take the\nLIN vectors to be the columns of $X$, so that we are discussing the\ncolumn space of $X$, then G-S yields the QR decomposition. Here's the\nalgorithm:\n\n1.  $\\tilde{x}_{1}=\\frac{x_{1}}{\\|x_{1}\\|}$ (normalize the first vector)\n\n2.  Orthogonalize the remaining vectors with respect to $\\tilde{x}_{1}$:\n\n    1.  $\\tilde{x}_{2}=\\frac{x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}}{\\|x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}\\|}$,\n        which orthogonalizes with respect to $\\tilde{x}_{1}$ and\n        normalizes. Note that\n        $\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}=\\langle\\tilde{x}_{1},x_{2}\\rangle\\tilde{x}_{1}$.\n        So we are finding a scaling, $c\\tilde{x}_{1}$, where $c$ is\n        based on the inner product, to remove the variation in the\n        $x_{1}$ direction from $x_{2}$.\n\n    2.  For $k>2$, find interim vectors, $x_{k}^{(2)}$, by\n        orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3.  Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n    the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n    orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n    get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see <http://arxiv.org/pdf/0808.2664v1.pdf> for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX  =  \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $p<m$.\nThe Eckart-Minsky-Young theorem shows that the truncated SVD\nminimizes the Frobenius norm of $A-\\tilde{A}$ over all possible rank-$p$ approximations.\nAs an example if we have a large image of dimension\n$n\\times m$, we could hold a compressed version by a rank-$p$\napproximation using the SVD. The SVD is used a lot in clustering\nproblems. For example, the Netflix prize was won based on a variant of\nSVD (in fact all of the top methods used variants on SVD, I believe).\n\nHere's another way to think about the SVD in terms of transformations and bases.\nApplying the SVD to a vector, $ UDV^{top}x$, carries out the following steps:\n\n - $V^{top}x$ expresses $x$ in terms of weights for the columns of $V$.\n - Multiplying the result by $D$ scales/stretches the weights.\n - Multiplying by $U$ produces the result, which is a weighted combination of columns of $U$, spanning the column-space of $U$.\n\nSo applying the SVD transforms $x$ from the column space of $V$ to the column space of $U$.\n\n#### Computation\n\nThe basic algorithm (Golub-Reinsch) is similar to the QR method for the\neigendecomposition. We use a series of Householder transformations on\nthe left and right to reduce $A$ to an upper bidiagonal matrix,\n$A^{(0)}$. The post-multiplications (the transformations on the right)\ngenerate the zeros in the upper triangle. (An upper bidiagonal matrix is\none with non-zeroes only on the diagonal and first subdiagonal above the\ndiagonal). Then the algorithm produces a series of upper bidiagonal\nmatrices, $A^{(0)}$, $A^{(1)},$ etc. that converge to a diagonal matrix,\n$D$ . Each step is carried out by a sequence of Givens transformations:\n\n$$\n\\begin{aligned}\nA^{(j+1)} & = & R_{m-2}^{\\top} R_{m-3}^{\\top} \\cdots R_{0}^{\\top} A^{(j)} T_{0} T_{1} \\cdots T_{m-2} \\\\\n & = & RA^{(j)} T \n\\end{aligned}.\n$$ \n\n\nThis eventually gives $A^{(...)}=D$ and\nby construction, $U$ (the product of the pre-multiplied Householder\nmatrices and the $R$ matrices) and $V$ (the product of the\npost-multiplied Householder matrices and the $T$ matrices) are\northogonal. The result is then transformed by a diagonal matrix to make\nthe elements of $D$ non-negative and by permutation matrices to order\nthe elements of $D$ in nonincreasing order.\n\n#### Computation for large tall-skinny matrices\n\nThe SVD can also be generated from a QR decomposition. Take $X=QR$ and\nthen do an SVD on the $R$ matrix to get $X=QUDV^{\\top}=U^{*}DV^{\\top}$.\nThis is particularly helpful for the case when $X$ is tall and skinny\n(suppose $X$ is $n\\times p$ with $n\\gg p$), because we can do the\ntall-skinny QR, and the resulting SVD on $R$ is easy computationally if\n$p$ is manageable.\n\n# 6. Computation\n\n## Linear algebra in Python\n\nSpeedups and storage savings can be obtained by working with matrices\nstored in special formats when the matrices have special structure.\nE.g., we might store a symmetric matrix as a full matrix but only use\nthe upper or lower triangle. Banded matrices and block diagonal matrices\nare other common formats. Banded matrices are all zero except for\n$A_{i,i+c_{k}}$ for some small number of integers, $c_{k}$. Viewed as an\nimage, these have bands. The bands are known as co-diagonals.\n\nNote that for many matrix decompositions, you can change whether all of\nthe aspects of the decomposition are returned, or just some, which may\nspeed calculations.\n\nScipy provides functionality for working with matrices in various ways,\nincluding the [`scipy.sparse` module](https://docs.scipy.org/doc/scipy/reference/sparse.html),\nwhich provides support for\nstructured sparse matrices such as triangular and diagonal matrices\nas well as unstructured sparse matrices using various standard representations.\n\nSome useful packages in R for matrices are *Matrix*, *spam*, and\n*bdsmatrix*. *Matrix* can represent a variety of rectangular matrices,\nincluding triangular, orthogonal, diagonal, etc. and provides methods\nfor various matrix calculations that are specific to the matrix type.\n*spam* handles general sparse matrices with fast matrix calculations, in\nparticular a fast Cholesky decomposition. *bdsmatrix* focuses on\nblock-diagonal matrices, which arise frequently in contexts where there\nis clustering that induces within-cluster correlation and cross-cluster\nindependence.\n\nIn general, matrix operations in Python and R go to compiled C or Fortran code\nwithout much intermediate Python or R  code, so they can actually be pretty\nefficient and are based on the best algorithms developed by numerical\nexperts. The core libraries that are used are LAPACK and BLAS (the\nLinear Algebra PACKage and the Basic Linear Algebra Subroutines). As\nwe've discussed in the parallelization unit, one way to speed up code\nthat relies heavily on linear algebra is to make sure you have a BLAS\nlibrary tuned to your machine. These include OpenBLAS (open source), Intel's MKL, AMD's ACML, and Apple's vecLib. \n\nIf you use Conda, numpy will generally be linked against MKL or OpenBLAS (this will\ndepend on the locations online of the packages being installed, i.e., the *channel(s)* used). With pip,\nnumpy will generally be linked against OpenBLAS.\nit's possible to install numpy so that it uses OpenBLAS.\nR can be linked to the shared object library file\n(*.so* file or *.dylib* on a Mac) for a fast BLAS. These BLAS\nlibraries are also available in threaded versions that farm out the\ncalculations across multiple cores or processors that share memory.\nMore details are available in [this SCF documentation](https://statistics.berkeley.edu/computing/faqs/linear-algebra-and-parallelized-linear-algebra-using-blas).\n\nBLAS routines do vector operations (level 1), matrix-vector operations\n(level 2), and dense matrix-matrix operations (level 3). Often the name\nof the routine has as its first letter \"d\", \"s\", \"c\" to indicate the\nroutine is double precision, single precision, or complex. LAPACK builds\non BLAS to implement standard linear algebra routines such as\neigendecomposition, solutions of linear systems, a variety of\nfactorizations, etc.\n\n## Sparse matrices\n\nAs an example of exploiting sparsity, we can use a standard format (CSR = compressed sparse row)\nin the Scipy `sparse` module:\n\nConsider the matrix to be row-major and store\nthe non-zero elements in order in an array called `data`. Then create a\narray called *indptr* that stores the position of the first element of\neach row. Finally, have a array, *indices* that tells the column\nidentity of each element. \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy.sparse as sparse\nmat = np.array([[0,0,1,0,10],[0,0,0,100,0],[0,0,0,0,0],[1000,0,0,0,0]])\nmat = sparse.csr_array(mat)\nmat.data\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([   1,   10,  100, 1000])\n```\n:::\n\n```{.python .cell-code}\nmat.indices  # column indices\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([2, 4, 3, 0], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat.indptr   # row pointers\n\n## Ideally don't first construct the dense matrix if it is large.\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0, 2, 3, 3, 4], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat2 = sparse.csr_array((mat.data, mat.indices, mat.indptr))\nmat2.toarray()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([[   0,    0,    1,    0,   10],\n       [   0,    0,    0,  100,    0],\n       [   0,    0,    0,    0,    0],\n       [1000,    0,    0,    0,    0]])\n```\n:::\n:::\n\n\n\nThat's also how things are done in the `spam` package in R.\n\nWe can do a fast matrix multiply, $x = Ab$, as follows in pseudo-code:\n\n```\n    for(i in 1:nrows(A)){\n        x[i] = 0\n        # should also check that row is not empty...\n        for(j in (rowpointers[i]:(rowpointers[i+1]-1)) {\n            x[i] = x[i] + entries[j] * b[colindices[j]]\n        }   \n    }\n```\n\nHow many computations have we done? Only $k$ multiplies and $O(k)$\nadditions where $k$ is the number of non-zero elements of $A$. Compare\nthis to the usual $O(n^{2})$ for dense multiplication.\n\nNote that for the Cholesky of a sparse matrix, if the sparsity pattern\nis fixed, but the entries change, one can precompute an optimal\nre-ordering that retains as much sparsity in $U$ as possible. Then\nmultiple Cholesky decompositions can be done more quickly as the entries\nchange.\n\n#### Banded matrices\n\nSuppose we have a banded matrix $A$ where the lower bandwidth is $p$,\nnamely $A_{ij}=0$ for $i>j+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n-   Start with an initial approximation, $x^{(0)}$.\n-   Hold all but $x_{1}^{(0)}$ constant and solve to find\n    $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n-   Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n    finding $x^{(1)}$.\n-   Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n    criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n    or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n-   Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n    error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n    set $k=0$.\n\n-   Then iterate:\n\n    -   $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n        (choose step size so next error will be orthogonal to current\n        direction - which we can express in terms of the residual, which\n        is easily computable)\n\n    -   $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n    -   $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n    -   $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n        (choose next direction by conjugate Gram-Schmidt, starting with\n        $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n        to previous directions, but it turns out that $r_{(k+1)}$ is\n        already $A$-orthogonal to all but $d_{k}$).\n\n-   Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n",
+    "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n  pdf:\n    documentclass: article\n    margin-left: 30mm\n    margin-right: 30mm\n    toc: true\n  html:\n    theme: cosmo\n    css: ../styles.css\n    toc: true\n    code-copy: true\n    code-block-background: true\nexecute:\n  freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n-   Gentle: Numerical Linear Algebra for Applications in Statistics\n    (available via UC Library Search) (my notes here are based primarily\n    on this source) [Gentle-NLA]\n    -   Gentle: Matrix Algebra also has much of this material.\n-   Gentle: Computational Statistics [Gentle-CS]\n-   Lange: Numerical Analysis for Statisticians\n-   Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n  - Video 1. Ill-conditioned problems, part 1\n  - Video 2. Ill-conditioned problems, part 2\n  - Video 3. Triangular systems of equations\n  - Video 4. Solving systems of equations via LU, part 1\n  - Video 5. Solving systems of equations via LU, part 2\n  - Video 6. Solving systems of equations via LU, part 3\n  - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1.  How to think about the computational order (number of computations\n    involved) of a problem\n2.  How to choose a computational approach to a given linear algebra\n    calculation you need to do.\n3.  An understanding of how issues with computer numbers (Unit 8) affect\n    linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B)  \nA @ B        # alternative\nA.dot(B)     # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1.  We can find the ordering that reduces computation the most if the\n    individual matrices are not square.\n2.  $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n    $x^{\\top}Ax$, is a scalar, and this is equal to\n    $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n    helpful to be able to go back and forth between a scalar and a trace\n    in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n-   having fewer rows or columns;\n-   being diagonal, triangular or sparse in some way,\n-   being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $n<p$, then we have at most $n$ linearly\nindependent covariates (the rank is at most $n$). In this case we have\nmultiple possible solutions and the system is ill-determined\n(under-determined). Similarly, if $q<p$ and $n\\geq p$, the rank is again\nless than $p$ and we have multiple possible solutions. Of course we\nusually have $n>p$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1.  What does it mean when one or more eigenvalue (i.e.,\n    $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2.  Suppose I have an eigenvalue that is very small and I set it to\n    zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3.  Now let's consider the inverse of a covariance matrix, known as the\n    precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n    does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n    $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots &  & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n &  & \\vdots &  & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots &  &  & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\  &  & 1 & -4 & 6 & -4 & 1 \\\\  &  &  & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0]        # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n       3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3]   # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals   # variances\nevals[3] = 0      # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.440892098500626e-16\n```\n:::\n:::\n\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1.  Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n            for(j=1:m){\n                b_i = b_i + a_{ij} x_j\n            }\n        }\n```\n\n2.  Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n        }\n        for(j=1:m){\n            for(i = 1:n){\n                b_i = b_i + a_{ij} x_j  \n            }\n        }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n    return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011)  # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1))  # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-1.02879965e+03,  1.16747888e+00, -1.75251012e-04])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000              # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n    np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505,   770.        ,     9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10                  # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n    np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487,  7.7       ,  1.95366513])\n```\n:::\n:::\n\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n|     Name   |     Representation   |  Restrictions |   Properties   |  Uses       |\n|  ----------| ----------------------| ------------| --------------------|-------------|\n|   LU           |       $A_{nn}= L_{nn}U_{nn}$     |    $A$ generally square |  $L$ lower triangular; $U$ upper triangular |  solving equations; inversion |  \n|   QR           |   $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n|      Cholesky    | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n|     Eigen decomposition    | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n|     SVD    |    $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1.  $x_{n}=b_{n}/A_{nn}$\n2.  Now for $k<n$, use the already computed\n    $\\{x_{n},x_{n-1},\\ldots,x_{k+1}\\}$ to calculate\n    $x_{k}=\\frac{b_{k}-\\sum_{j=k+1}^{n}x_{j}A_{kj}}{A_{kk}}$.\n3.  Repeat for all rows.\n\nHow many multiplies and adds are done? Solving lower triangular systems\nis very similar and involves the same number of calculations.\n\nIn Scipy, we can use `linalg.solve_triangular` to solve triangular systems.\nThe `trans` argument indicates whether to solve $Ax=b$ or $A^{\\top}x=b$\n(so one wouldn't need to transpose before solving).\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy as sp\nnp.random.seed(1)\nn = 20\nX = np.random.normal(size = (n,n))\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X  # produce a positive def. matrix \n\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\nU = L.T\n\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\nout2 = np.linalg.inv(L) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n\n```{.python .cell-code}\nout3 =  sp.linalg.solve_triangular(U, b, lower=False)\nout4 = np.linalg.inv(U) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n:::\n\n\n\n\n**To reiterate the distinction between matrix inversion and solving a\nsystem of equations, when we write $U^{-1}b$, what we mean on a computer\nis to carry out the above algorithm, not to find the inverse and then\nmultiply.**\n\nHere's a good reason why.\n\n\n\n::: {.cell hash='unit10-linalg_cache/html/unnamed-chunk-8_f07e7d0a3c06f8d614d02f9fd20f019d'}\n\n```{.python .cell-code}\nimport time\n\nnp.random.seed(1)\nn = 5000\nX = np.random.normal(size = (n,n))\n\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\n\nt0 = time.time()\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.032052040100097656\n```\n:::\n\n```{.python .cell-code}\nt0 = time.time()\nout2 = np.linalg.inv(L) @ b\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.630963087081909\n```\n:::\n:::\n\n\n\nThat assumes you have $L$, but we'll see in a bit that even when one\naccounts for the creation of $L$, you don't want to invert matrices in\norder to solve systems of equations.\n\n## Gaussian elimination (LU decomposition)\n\nGaussian elimination is a standard way of directly computing a solution\nfor $Ax=b$. It is equivalent to the LU decomposition. LU is primarily\ndone with square matrices, but not always. Also LU decompositions do\nexist for some singular matrices.\n\nThe idea of Gaussian elimination is to convert the problem to a\ntriangular system. In class, we'll walk through Gaussian elimination in\ndetail and see how it relates to the LU decomposition. I'll describe it\nmore briefly here. Following what we learned in algebra when we have\nmultiple equations, we preserve the solution, $x$, when we add multiples\nof rows (i.e., add multiples of equations) together. This amounts to\ndoing $L_{1}Ax=L_{1}b$ for a lower-triangular matrix $L_{1}$ that\nproduces all zeroes in the first column of $L_{1}A$ except for the first\nrow. We proceed to zero out values below the diagonal for the other\ncolumns of $A$. The result is\n$L_{n-1}\\cdots L_{1}Ax\\equiv Ux=L_{n-1}\\cdots L_{1}b\\equiv b^{*}$ where\n$U$ is upper triangular. This is the forward reduction step of Gaussian\nelimination. Then the backward elimination step solves $Ux=b^{*}$.\n\nIf we're just looking for the solution of the system, we don't need the\nlower-triangular factor $L=(L_{n-1}\\cdots L_{1})^{-1}$ in $A=LU$, but it\nturns out to have a simple form that is computed as we go along, it is\nunit lower triangular and the values below the diagonal are the negative\nof the values below the diagonals in $L_{1},\\ldots,L_{n-1}$ (note that\neach $L_{j}$ has non-zeroes below the diagonal only in the $j$th\ncolumn). As a side note related to storage, it turns out that as we\nproceed, we can store the elements of $L$ and $U$ in the original $A$\nmatrix, except for the implicit 1s on the diagonal of $L$.\n\nIn class, we'll work out the computational complexity of the LU and see\nthat it is $O(n^{3})$.\n\nIf we look at `help(np.linalg.solve)` in Python, we see that it uses *_gesv*. A\nGoogle search indicates that this is a Lapack routine that does the LU\ndecomposition with partial pivoting and row interchanges (see below on\nwhat these are), so numpy is using the algorithm we've just discussed.\n\nWe can also explicitly get the LU decomposition in Python with\n`scipy.linalg.lu()`, though for most use cases, what we want to\ndo is solve a system of equations. (In R, one can't easily get\nthe explicit LU decomposition, though `solve()` in R does use the LU.\n\nOne additional complexity is that we want to avoid dividing by very\nsmall values to avoid introducing numerical inaccuracy (we would get\nlarge values that might overwhelm whatever they are being added to, and\nsmall errors in the divisor will have large effects on the result). This\ncan be done on the fly by interchanging equations to use the equation\n(row) that produces the largest value to divide by. For example in the\nfirst step, we would switch the first equation (first row) for whichever\nof the remaining equations has the largest value in the first column.\nThis is called partial pivoting. The divisors are called pivots.\nComplete pivoting also considers interchanging columns, and while\ntheoretically better, partial pivoting is generally sufficient and\nrequires fewer computations. Partial pivoting can be expressed as\nmultiplying along the way by permutation matrices,\n$P_{1},\\ldots P_{n-1}$ that switch rows. One can show with some work\nthat based on pivoting, we have $PA=LU$, where $P=P_{n-1}\\cdots P_{1}$.\nIn the demo code, we'll see a toy example of the impact of pivoting.\n\nFinally $|PA|=|P||A|=|L||U|=|U|$ (why?) so $|A|=|U|/|P|$ and since the\ndeterminant of each permutation matrix, $P_{j}$ is -1 (except when\n$P_{j}=I$ because we don't need to switch rows), we just need to\nmultiply by minus one if there is an odd number of permutations. Or if\nwe know the matrix is non-negative definite, we just take the absolute\nvalue of $|U|$. So Gaussian elimination provides a fast stable way to\nfind the determinant.\n\n## When would we explicitly invert a matrix?\n\nIn some cases (as we discussed in class) you actually need the inverse as\nthe output (e.g., for estimating standard errors). \n\nBut if you are just needing the inverse to multiply it by a vector or another\nmatrix, you're better off\nusing a decomposition. Basically, one can work out that the cost of computing\nthe inverse and doing subsequent multiplication with it is rather more \nthan computing the decomposition and doing subsequent multiplications with it,\nregardless of how many columns there are in the matrix you are multiplying by.\n\nWith numpy, that may mean computing the decomposition and then \ncarrying out the steps to do the multiplication using the decomposition\nor using `np.linalg.solve(A, B)` which, as mentioned above, uses the LU behind the scenes. \nOne would not do  `np.linalg.inv(A) @ B`.\n\n## Cholesky decomposition\n\nWhen $A$ is p.d., we can use the Cholesky decomposition to solve a\nsystem of equations. Positive definite matrices can be decomposed as\n$U^{\\top}U=A$ where $U$ is upper triangular. $U$ is called a square root\nmatrix and is unique (apart from the sign, which we fix by requiring the\ndiagonals to be positive). One algorithm for computing $U$ is:\n\n1.  $U_{11}=\\sqrt{A_{11}}$\n2.  For $j=2,\\ldots,n$, $U_{1j}=A_{1j}/U_{11}$\n3.  For $i=2,\\ldots,n$,\n    -   $U_{ii}=\\sqrt{A_{ii}-\\sum_{k=1}^{i-1}U_{ki}^{2}}$\n    -   if $i<n$, then for $j=i+1,\\ldots,n$:\n        $U_{ij}=(A_{ij}-\\sum_{k=1}^{i-1}U_{ki}U_{kj})/U_{ii}$\n\nWe can then solve a system of equations as: $U^{-1}(U^{\\top-1}b)$.\n\nConfusingly, while numpy's `cholesky` gives $L = U^\\top$, scipy's\n`choleskey` can return either $L$ or $U$ but defaults to $U$.\n\nHere are two ways we can use the Cholesky to solve a system of equations:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nU = sp.linalg.cholesky(A)\nsp.linalg.solve_triangular(U, \n          sp.linalg.solve_triangular(U, b, lower=False, trans='T'),\n          lower=False)\n\nU, lower = sp.linalg.cho_factor(A)\nsp.linalg.cho_solve((U, lower), b)\n```\n:::\n\n\n\n\nThe Cholesky has some nice advantages over the LU: (1) while both are\n$O(n^{3})$, the Cholesky involves only half as many computations,\n$n^{3}/6+O(n^{2})$ and (2) the Cholesky factorization has only\n$(n^{2}+n)/2$ unique values compared to $n^{2}+n$ for the LU. Of course\nthe LU is more broadly applicable. The Cholesky does require computation\nof square roots, but it turns out this is not too intensive. There is\nalso a method for finding the Cholesky without square roots.\n\n#### Uses of the Cholesky\n\nThe standard algorithm for generating $y\\sim\\mathcal{N}(0,A)$ is:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nL = sp.linalg.cholesky(A, lower=True)\ny = L @ np.random.normal(size = n)\n```\n:::\n\n\n\n**Question**: where will most of the time in this two-step calculation\nbe spent?\n\nIf a regression design matrix, $X$, is full rank, then $X^{\\top}X$ is\npositive definite, so we could find\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$ using either the Cholesky or\nGaussian elimination. \n\nHowever, for OLS, it turns out that the standard approach is to work with $X$\nusing the QR decomposition rather than working with $X^{\\top}X$; working\nwith $X$ is more numerically stable, though in most situations without\nextreme collinearity, either of the approaches will be fine.\n\nWe could also consider the GLS estimator, which accounts for dependence\nin the errors: $\\hat{\\beta}=(X^{\\top}\\Sigma^{-1}X)^{-1}X^{\\top}\\Sigma^{-1}Y$ \n\n**Challenge**: write efficient R code to carry out\nthe GLS solution using the Cholesky factorization.\n\n\n#### Numerical issues with eigendecompositions and Cholesky decompositions for positive definite matrices \n\nMonahan comments that in general Gaussian elimination and the Cholesky\ndecomposition are very stable. However, in the Cholesky case, if the\nmatrix is very ill-conditioned we can get $A_{ii}-\\sum_{k}U_{ki}^{2}$\nbeing negative and then the algorithm stops when we try to take the\nsquare root. In this case, the Cholesky decomposition does not exist\nnumerically although it exists mathematically. It's not all that hard to\nproduce such a matrix, particularly when working with high-dimensional\ncovariance matrices with large correlations.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlocs = np.random.uniform(size = 100)\nrho = .1\ndists = np.abs(locs[:, np.newaxis] - locs)\nC = np.exp(-dists**2/rho**2)\ne = np.linalg.eig(C)\nnp.sort(e[0])[::-1][96:100]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-4.16702596e-16+0.j, -5.38621518e-16+0.j, -6.26206506e-16+0.j,\n       -6.98611709e-16+0.j])\n```\n:::\n\n```{.python .cell-code}\ntry:\n    L = np.linalg.cholesky(C)\nexcept Exception as error:\n    print(error)\n  \n```\n\n::: {.cell-output .cell-output-stdout}\n```\nMatrix is not positive definite\n```\n:::\n\n```{.python .cell-code}\nvals = np.abs(e[0])\nnp.max(vals)/np.min(vals)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n3.135151918122864e+18\n```\n:::\n:::\n\n\n\nI don't see a way to use pivoting with the Cholesky in Python, but in R, \none can do `chol(C, pivot = TRUE)`.\n\nWe can think about the accuracy here as follows. Suppose we have a\nmatrix whose diagonal elements (i.e., the variances) are order of\nmagnitude 1 and that the true value of a $U_{ii}$ is less than\n$1\\times10^{-16}$. From the given $A_{ii}$ we are subtracting\n$\\sum_{k}U_{ki}^{2}$ and trying to calculate this very small number but\nwe know that we can only represent the values $A_{ii}$ and\n$\\sum_{k}U_{ki}^{2}$ accurately to 16 places, so the difference is\ngarbage starting in the 17th position and could well be negative. Now\nrealize that $\\sum_{k}U_{ki}^{2}$ is the result of a potentially large\nset of arithmetic operations, and is likely represented accurately to\nfewer than 16 places. Now if the true value of $U_{ii}$ is smaller than\nthe accuracy to which $\\sum_{k}U_{ki}^{2}$ is represented, we can get a\ndifference that is negative.\n\nNote that when the Cholesky fails, we can still compute an\neigendecomposition, but we have negative numeric eigenvalues. Even if\nall the eigenvalues are numerically positive (or equivalently, we're\nable to get the Cholesky), errors in small eigenvalues near machine\nprecision could have large effects when we work with the inverse of the\nmatrix. This is what happens when we have columns of the $X$ matrix\nnearly collinear. We cannot statistically distinguish the effect of two\n(or more) covariates, and this plays out numerically in terms of\nunstable results.\n\nA strategy when working with mathematically but not numerically positive\ndefinite $A$ is to set eigenvalues or singular values to zero when they\nget very small, which amounts to using a pseudo-inverse and setting to\nzero any linear combinations with very small variance. We can also use\npivoting with the Cholesky and accumulate zeroes in the last $n-q$ rows\n(for cases where we try to take the square root of a negative number),\ncorresponding to the columns of $A$ that are numerically linearly\ndependent. \n\n## QR decomposition\n\n### Introduction\n\nThe QR decomposition is available for any matrix, $X=QR$, with $Q$\northogonal and $R$ upper triangular. If $X$ is non-square, $n\\times p$\nwith $n>p$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p<n$, we definitely need $Q_{p}$ so\nthat the last $n-p$ rows of $R$ are zero and we can then discard them\nwhen just using the skinny QR.\n\n#### QR Method 2: Rotations\n\nA *Givens* rotation matrix rotates a vector in a two-dimensional\nsubspace to be axis oriented with respect to one of the two dimensions\nby changing the value of the other dimension. E.g. we can create\n$\\tilde{x}=(x_{1},\\ldots,\\tilde{x}_{p},\\ldots,0,\\ldots x_{n})$ from\n$x=(x_{1,}\\ldots,x_{p},\\ldots,x_{q},\\ldots,x_{n})$ using a matrix\nmultiplication: $\\tilde{x}=Qx$. $Q$ is orthogonal but not symmetric.\n\nWe can use a series of Givens rotations to do the QR but unless it is\ndone carefully, more computations are needed than with Householder\nreflections. The basic story is that we apply a series of Givens\nrotations to $X$ such that we zero out the lower triangular elements.\n\n$$\n\\begin{aligned}\nR & = & Q_{pn}\\cdots Q_{23}Q_{1n}\\cdots Q_{13}Q_{12} X \\\\\nQ & = & (Q_{pn}\\cdots Q_{12})^{\\top}\\end{aligned}.\n$$\n\n \nNote that we create\nthe $n-p$ zero rows in $R$ (because the calculations affect the upper\ntriangle of $R$), but we can then ignore those rows and the\ncorresponding columns of $Q$.\n\n#### QR Method 3: Gram-Schmidt Orthogonalization\n\nGram-Schmidt involves finding a set of orthonormal vectors to span the\nsame space as a set of LIN vectors, $x_{1},\\ldots,x_{p}$. If we take the\nLIN vectors to be the columns of $X$, so that we are discussing the\ncolumn space of $X$, then G-S yields the QR decomposition. Here's the\nalgorithm:\n\n1.  $\\tilde{x}_{1}=\\frac{x_{1}}{\\|x_{1}\\|}$ (normalize the first vector)\n\n2.  Orthogonalize the remaining vectors with respect to $\\tilde{x}_{1}$:\n\n    1.  $\\tilde{x}_{2}=\\frac{x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}}{\\|x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}\\|}$,\n        which orthogonalizes with respect to $\\tilde{x}_{1}$ and\n        normalizes. Note that\n        $\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}=\\langle\\tilde{x}_{1},x_{2}\\rangle\\tilde{x}_{1}$.\n        So we are finding a scaling, $c\\tilde{x}_{1}$, where $c$ is\n        based on the inner product, to remove the variation in the\n        $x_{1}$ direction from $x_{2}$.\n\n    2.  For $k>2$, find interim vectors, $x_{k}^{(2)}$, by\n        orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3.  Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n    the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n    orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n    get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see <http://arxiv.org/pdf/0808.2664v1.pdf> for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX  =  \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $p<m$.\nThe Eckart-Minsky-Young theorem shows that the truncated SVD\nminimizes the Frobenius norm of $A-\\tilde{A}$ over all possible rank-$p$ approximations.\nAs an example if we have a large image of dimension\n$n\\times m$, we could hold a compressed version by a rank-$p$\napproximation using the SVD. The SVD is used a lot in clustering\nproblems. For example, the Netflix prize was won based on a variant of\nSVD (in fact all of the top methods used variants on SVD, I believe).\n\nHere's another way to think about the SVD in terms of transformations and bases.\nApplying the SVD to a vector, $ UDV^{top}x$, carries out the following steps:\n\n - $V^{top}x$ expresses $x$ in terms of weights for the columns of $V$.\n - Multiplying the result by $D$ scales/stretches the weights.\n - Multiplying by $U$ produces the result, which is a weighted combination of columns of $U$, spanning the column-space of $U$.\n\nSo applying the SVD transforms $x$ from the column space of $V$ to the column space of $U$.\n\n#### Computation\n\nThe basic algorithm (Golub-Reinsch) is similar to the QR method for the\neigendecomposition. We use a series of Householder transformations on\nthe left and right to reduce $A$ to an upper bidiagonal matrix,\n$A^{(0)}$. The post-multiplications (the transformations on the right)\ngenerate the zeros in the upper triangle. (An upper bidiagonal matrix is\none with non-zeroes only on the diagonal and first subdiagonal above the\ndiagonal). Then the algorithm produces a series of upper bidiagonal\nmatrices, $A^{(0)}$, $A^{(1)},$ etc. that converge to a diagonal matrix,\n$D$ . Each step is carried out by a sequence of Givens transformations:\n\n$$\n\\begin{aligned}\nA^{(j+1)} & = & R_{m-2}^{\\top} R_{m-3}^{\\top} \\cdots R_{0}^{\\top} A^{(j)} T_{0} T_{1} \\cdots T_{m-2} \\\\\n & = & RA^{(j)} T \n\\end{aligned}.\n$$ \n\n\nThis eventually gives $A^{(...)}=D$ and\nby construction, $U$ (the product of the pre-multiplied Householder\nmatrices and the $R$ matrices) and $V$ (the product of the\npost-multiplied Householder matrices and the $T$ matrices) are\northogonal. The result is then transformed by a diagonal matrix to make\nthe elements of $D$ non-negative and by permutation matrices to order\nthe elements of $D$ in nonincreasing order.\n\n#### Computation for large tall-skinny matrices\n\nThe SVD can also be generated from a QR decomposition. Take $X=QR$ and\nthen do an SVD on the $R$ matrix to get $X=QUDV^{\\top}=U^{*}DV^{\\top}$.\nThis is particularly helpful for the case when $X$ is tall and skinny\n(suppose $X$ is $n\\times p$ with $n\\gg p$), because we can do the\ntall-skinny QR, and the resulting SVD on $R$ is easy computationally if\n$p$ is manageable.\n\n# 6. Computation\n\n## Linear algebra in Python\n\nSpeedups and storage savings can be obtained by working with matrices\nstored in special formats when the matrices have special structure.\nE.g., we might store a symmetric matrix as a full matrix but only use\nthe upper or lower triangle. Banded matrices and block diagonal matrices\nare other common formats. Banded matrices are all zero except for\n$A_{i,i+c_{k}}$ for some small number of integers, $c_{k}$. Viewed as an\nimage, these have bands. The bands are known as co-diagonals.\n\nNote that for many matrix decompositions, you can change whether all of\nthe aspects of the decomposition are returned, or just some, which may\nspeed calculations.\n\nScipy provides functionality for working with matrices in various ways,\nincluding the [`scipy.sparse` module](https://docs.scipy.org/doc/scipy/reference/sparse.html),\nwhich provides support for\nstructured sparse matrices such as triangular and diagonal matrices\nas well as unstructured sparse matrices using various standard representations.\n\nSome useful packages in R for matrices are *Matrix*, *spam*, and\n*bdsmatrix*. *Matrix* can represent a variety of rectangular matrices,\nincluding triangular, orthogonal, diagonal, etc. and provides methods\nfor various matrix calculations that are specific to the matrix type.\n*spam* handles general sparse matrices with fast matrix calculations, in\nparticular a fast Cholesky decomposition. *bdsmatrix* focuses on\nblock-diagonal matrices, which arise frequently in contexts where there\nis clustering that induces within-cluster correlation and cross-cluster\nindependence.\n\nIn general, matrix operations in Python and R go to compiled C or Fortran code\nwithout much intermediate Python or R  code, so they can actually be pretty\nefficient and are based on the best algorithms developed by numerical\nexperts. The core libraries that are used are LAPACK and BLAS (the\nLinear Algebra PACKage and the Basic Linear Algebra Subroutines). As\nwe've discussed in the parallelization unit, one way to speed up code\nthat relies heavily on linear algebra is to make sure you have a BLAS\nlibrary tuned to your machine. These include OpenBLAS (open source), Intel's MKL, AMD's ACML, and Apple's vecLib. \n\nIf you use Conda, numpy will generally be linked against MKL or OpenBLAS (this will\ndepend on the locations online of the packages being installed, i.e., the *channel(s)* used). With pip,\nnumpy will generally be linked against OpenBLAS.\nit's possible to install numpy so that it uses OpenBLAS.\nR can be linked to the shared object library file\n(*.so* file or *.dylib* on a Mac) for a fast BLAS. These BLAS\nlibraries are also available in threaded versions that farm out the\ncalculations across multiple cores or processors that share memory.\nMore details are available in [this SCF documentation](https://statistics.berkeley.edu/computing/faqs/linear-algebra-and-parallelized-linear-algebra-using-blas).\n\nBLAS routines do vector operations (level 1), matrix-vector operations\n(level 2), and dense matrix-matrix operations (level 3). Often the name\nof the routine has as its first letter \"d\", \"s\", \"c\" to indicate the\nroutine is double precision, single precision, or complex. LAPACK builds\non BLAS to implement standard linear algebra routines such as\neigendecomposition, solutions of linear systems, a variety of\nfactorizations, etc.\n\n## Sparse matrices\n\nAs an example of exploiting sparsity, we can use a standard format (CSR = compressed sparse row)\nin the Scipy `sparse` module:\n\nConsider the matrix to be row-major and store\nthe non-zero elements in order in an array called `data`. Then create a\narray called *indptr* that stores the position of the first element of\neach row. Finally, have a array, *indices* that tells the column\nidentity of each element. \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy.sparse as sparse\nmat = np.array([[0,0,1,0,10],[0,0,0,100,0],[0,0,0,0,0],[1000,0,0,0,0]])\nmat = sparse.csr_array(mat)\nmat.data\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([   1,   10,  100, 1000])\n```\n:::\n\n```{.python .cell-code}\nmat.indices  # column indices\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([2, 4, 3, 0], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat.indptr   # row pointers\n\n## Ideally don't first construct the dense matrix if it is large.\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0, 2, 3, 3, 4], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat2 = sparse.csr_array((mat.data, mat.indices, mat.indptr))\nmat2.toarray()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([[   0,    0,    1,    0,   10],\n       [   0,    0,    0,  100,    0],\n       [   0,    0,    0,    0,    0],\n       [1000,    0,    0,    0,    0]])\n```\n:::\n:::\n\n\n\nThat's also how things are done in the `spam` package in R.\n\nWe can do a fast matrix multiply, $x = Ab$, as follows in pseudo-code:\n\n```\n    for(i in 1:nrows(A)){\n        x[i] = 0\n        # should also check that row is not empty...\n        for(j in (rowpointers[i]:(rowpointers[i+1]-1)) {\n            x[i] = x[i] + entries[j] * b[colindices[j]]\n        }   \n    }\n```\n\nHow many computations have we done? Only $k$ multiplies and $O(k)$\nadditions where $k$ is the number of non-zero elements of $A$. Compare\nthis to the usual $O(n^{2})$ for dense multiplication.\n\nNote that for the Cholesky of a sparse matrix, if the sparsity pattern\nis fixed, but the entries change, one can precompute an optimal\nre-ordering that retains as much sparsity in $U$ as possible. Then\nmultiple Cholesky decompositions can be done more quickly as the entries\nchange.\n\n#### Banded matrices\n\nSuppose we have a banded matrix $A$ where the lower bandwidth is $p$,\nnamely $A_{ij}=0$ for $i>j+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n-   Start with an initial approximation, $x^{(0)}$.\n-   Hold all but $x_{1}^{(0)}$ constant and solve to find\n    $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n-   Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n    finding $x^{(1)}$.\n-   Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n    criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n    or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n-   Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n    error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n    set $k=0$.\n\n-   Then iterate:\n\n    -   $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n        (choose step size so next error will be orthogonal to current\n        direction - which we can express in terms of the residual, which\n        is easily computable)\n\n    -   $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n    -   $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n    -   $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n        (choose next direction by conjugate Gram-Schmidt, starting with\n        $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n        to previous directions, but it turns out that $r_{(k+1)}$ is\n        already $A$-orthogonal to all but $d_{k}$).\n\n-   Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/units/unit10-linalg/execute-results/tex.json b/_freeze/units/unit10-linalg/execute-results/tex.json
index 9658164..a6242b7 100644
--- a/_freeze/units/unit10-linalg/execute-results/tex.json
+++ b/_freeze/units/unit10-linalg/execute-results/tex.json
@@ -1,7 +1,7 @@
 {
-  "hash": "04df3b9264ae4187721edd4da62e08c5",
+  "hash": "d1d0a6f17e0930b47afaca760a06ac18",
   "result": {
-    "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n  pdf:\n    documentclass: article\n    margin-left: 30mm\n    margin-right: 30mm\n    toc: true\n  html:\n    theme: cosmo\n    css: ../styles.css\n    toc: true\n    code-copy: true\n    code-block-background: true\nexecute:\n  freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n-   Gentle: Numerical Linear Algebra for Applications in Statistics\n    (available via UC Library Search) (my notes here are based primarily\n    on this source) [Gentle-NLA]\n    -   Gentle: Matrix Algebra also has much of this material.\n-   Gentle: Computational Statistics [Gentle-CS]\n-   Lange: Numerical Analysis for Statisticians\n-   Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n  - Video 1. Ill-conditioned problems, part 1\n  - Video 2. Ill-conditioned problems, part 2\n  - Video 3. Triangular systems of equations\n  - Video 4. Solving systems of equations via LU, part 1\n  - Video 5. Solving systems of equations via LU, part 2\n  - Video 6. Solving systems of equations via LU, part 3\n  - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1.  How to think about the computational order (number of computations\n    involved) of a problem\n2.  How to choose a computational approach to a given linear algebra\n    calculation you need to do.\n3.  An understanding of how issues with computer numbers (Unit 8) affect\n    linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B)  \nA @ B        # alternative\nA.dot(B)     # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1.  We can find the ordering that reduces computation the most if the\n    individual matrices are not square.\n2.  $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n    $x^{\\top}Ax$, is a scalar, and this is equal to\n    $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n    helpful to be able to go back and forth between a scalar and a trace\n    in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n-   having fewer rows or columns;\n-   being diagonal, triangular or sparse in some way,\n-   being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $n<p$, then we have at most $n$ linearly\nindependent covariates (the rank is at most $n$). In this case we have\nmultiple possible solutions and the system is ill-determined\n(under-determined). Similarly, if $q<p$ and $n\\geq p$, the rank is again\nless than $p$ and we have multiple possible solutions. Of course we\nusually have $n>p$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1.  What does it mean when one or more eigenvalue (i.e.,\n    $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2.  Suppose I have an eigenvalue that is very small and I set it to\n    zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3.  Now let's consider the inverse of a covariance matrix, known as the\n    precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n    does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n    $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots &  & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n &  & \\vdots &  & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots &  &  & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\  &  & 1 & -4 & 6 & -4 & 1 \\\\  &  &  & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0]        # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n       3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3]   # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals   # variances\nevals[3] = 0      # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n-2.220446049250313e-16\n```\n:::\n:::\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1.  Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n            for(j=1:m){\n                b_i = b_i + a_{ij} x_j\n            }\n        }\n```\n\n2.  Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n        }\n        for(j=1:m){\n            for(i = 1:n){\n                b_i = b_i + a_{ij} x_j  \n            }\n        }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n    return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011)  # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1))  # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-7.08730861e+03,  7.17362346e+00, -1.66371671e-03])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000              # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n    np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505,   770.        ,     9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10                  # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n    np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487,  7.7       ,  1.95366513])\n```\n:::\n:::\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n|     Name   |     Representation   |  Restrictions |   Properties   |  Uses       |\n|  ----------| ----------------------| ------------| --------------------|-------------|\n|   LU           |       $A_{nn}= L_{nn}U_{nn}$     |    $A$ generally square |  $L$ lower triangular; $U$ upper triangular |  solving equations; inversion |  \n|   QR           |   $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n|      Cholesky    | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n|     Eigen decomposition    | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n|     SVD    |    $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1.  $x_{n}=b_{n}/A_{nn}$\n2.  Now for $k<n$, use the already computed\n    $\\{x_{n},x_{n-1},\\ldots,x_{k+1}\\}$ to calculate\n    $x_{k}=\\frac{b_{k}-\\sum_{j=k+1}^{n}x_{j}A_{kj}}{A_{kk}}$.\n3.  Repeat for all rows.\n\nHow many multiplies and adds are done? Solving lower triangular systems\nis very similar and involves the same number of calculations.\n\nIn Scipy, we can use `linalg.solve_triangular` to solve triangular systems.\nThe `trans` argument indicates whether to solve $Ax=b$ or $A^{\\top}x=b$\n(so one wouldn't need to transpose before solving).\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy as sp\nnp.random.seed(1)\nn = 20\nX = np.random.normal(size = (n,n))\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X  # produce a positive def. matrix \n\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\nU = L.T\n\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\nout2 = np.linalg.inv(L) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n\n```{.python .cell-code}\nout3 =  sp.linalg.solve_triangular(U, b, lower=False)\nout4 = np.linalg.inv(U) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n:::\n\n\n\n**To reiterate the distinction between matrix inversion and solving a\nsystem of equations, when we write $U^{-1}b$, what we mean on a computer\nis to carry out the above algorithm, not to find the inverse and then\nmultiply.**\n\nHere's a good reason why.\n\n\n::: {.cell hash='unit10-linalg_cache/pdf/unnamed-chunk-8_d2f53708c01a1b243471187f7af12edf'}\n\n```{.python .cell-code}\nimport time\n\nnp.random.seed(1)\nn = 5000\nX = np.random.normal(size = (n,n))\n\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\n\nt0 = time.time()\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.04602313041687012\n```\n:::\n\n```{.python .cell-code}\nt0 = time.time()\nout2 = np.linalg.inv(L) @ b\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n3.9984264373779297\n```\n:::\n:::\n\n\nThat assumes you have $L$, but we'll see in a bit that even when one\naccounts for the creation of $L$, you don't want to invert matrices in\norder to solve systems of equations.\n\n## Gaussian elimination (LU decomposition)\n\nGaussian elimination is a standard way of directly computing a solution\nfor $Ax=b$. It is equivalent to the LU decomposition. LU is primarily\ndone with square matrices, but not always. Also LU decompositions do\nexist for some singular matrices.\n\nThe idea of Gaussian elimination is to convert the problem to a\ntriangular system. In class, we'll walk through Gaussian elimination in\ndetail and see how it relates to the LU decomposition. I'll describe it\nmore briefly here. Following what we learned in algebra when we have\nmultiple equations, we preserve the solution, $x$, when we add multiples\nof rows (i.e., add multiples of equations) together. This amounts to\ndoing $L_{1}Ax=L_{1}b$ for a lower-triangular matrix $L_{1}$ that\nproduces all zeroes in the first column of $L_{1}A$ except for the first\nrow. We proceed to zero out values below the diagonal for the other\ncolumns of $A$. The result is\n$L_{n-1}\\cdots L_{1}Ax\\equiv Ux=L_{n-1}\\cdots L_{1}b\\equiv b^{*}$ where\n$U$ is upper triangular. This is the forward reduction step of Gaussian\nelimination. Then the backward elimination step solves $Ux=b^{*}$.\n\nIf we're just looking for the solution of the system, we don't need the\nlower-triangular factor $L=(L_{n-1}\\cdots L_{1})^{-1}$ in $A=LU$, but it\nturns out to have a simple form that is computed as we go along, it is\nunit lower triangular and the values below the diagonal are the negative\nof the values below the diagonals in $L_{1},\\ldots,L_{n-1}$ (note that\neach $L_{j}$ has non-zeroes below the diagonal only in the $j$th\ncolumn). As a side note related to storage, it turns out that as we\nproceed, we can store the elements of $L$ and $U$ in the original $A$\nmatrix, except for the implicit 1s on the diagonal of $L$.\n\nIn class, we'll work out the computational complexity of the LU and see\nthat it is $O(n^{3})$.\n\nIf we look at `help(np.linalg.solve)` in Python, we see that it uses *_gesv*. A\nGoogle search indicates that this is a Lapack routine that does the LU\ndecomposition with partial pivoting and row interchanges (see below on\nwhat these are), so numpy is using the algorithm we've just discussed.\n\nWe can also explicitly get the LU decomposition in Python with\n`scipy.linalg.lu()`, though for most use cases, what we want to\ndo is solve a system of equations. (In R, one can't easily get\nthe explicit LU decomposition, though `solve()` in R does use the LU.\n\nOne additional complexity is that we want to avoid dividing by very\nsmall values to avoid introducing numerical inaccuracy (we would get\nlarge values that might overwhelm whatever they are being added to, and\nsmall errors in the divisor will have large effects on the result). This\ncan be done on the fly by interchanging equations to use the equation\n(row) that produces the largest value to divide by. For example in the\nfirst step, we would switch the first equation (first row) for whichever\nof the remaining equations has the largest value in the first column.\nThis is called partial pivoting. The divisors are called pivots.\nComplete pivoting also considers interchanging columns, and while\ntheoretically better, partial pivoting is generally sufficient and\nrequires fewer computations. Partial pivoting can be expressed as\nmultiplying along the way by permutation matrices,\n$P_{1},\\ldots P_{n-1}$ that switch rows. One can show with some work\nthat based on pivoting, we have $PA=LU$, where $P=P_{n-1}\\cdots P_{1}$.\nIn the demo code, we'll see a toy example of the impact of pivoting.\n\nFinally $|PA|=|P||A|=|L||U|=|U|$ (why?) so $|A|=|U|/|P|$ and since the\ndeterminant of each permutation matrix, $P_{j}$ is -1 (except when\n$P_{j}=I$ because we don't need to switch rows), we just need to\nmultiply by minus one if there is an odd number of permutations. Or if\nwe know the matrix is non-negative definite, we just take the absolute\nvalue of $|U|$. So Gaussian elimination provides a fast stable way to\nfind the determinant.\n\n## When would we explicitly invert a matrix?\n\nIn some cases (as we discussed in class) you actually need the inverse as\nthe output (e.g., for estimating standard errors). \n\nBut if you are just needing the inverse to multiply it by a vector or another\nmatrix, you're better off\nusing a decomposition. Basically, one can work out that the cost of computing\nthe inverse and doing subsequent multiplication with it is rather more \nthan computing the decomposition and doing subsequent multiplications with it,\nregardless of how many columns there are in the matrix you are multiplying by.\n\nWith numpy, that may mean computing the decomposition and then \ncarrying out the steps to do the multiplication using the decomposition\nor using `np.linalg.solve(A, B)` which, as mentioned above, uses the LU behind the scenes. \nOne would not do  `np.linalg.inv(A) @ B`.\n\n## Cholesky decomposition\n\nWhen $A$ is p.d., we can use the Cholesky decomposition to solve a\nsystem of equations. Positive definite matrices can be decomposed as\n$U^{\\top}U=A$ where $U$ is upper triangular. $U$ is called a square root\nmatrix and is unique (apart from the sign, which we fix by requiring the\ndiagonals to be positive). One algorithm for computing $U$ is:\n\n1.  $U_{11}=\\sqrt{A_{11}}$\n2.  For $j=2,\\ldots,n$, $U_{1j}=A_{1j}/U_{11}$\n3.  For $i=2,\\ldots,n$,\n    -   $U_{ii}=\\sqrt{A_{ii}-\\sum_{k=1}^{i-1}U_{ki}^{2}}$\n    -   if $i<n$, then for $j=i+1,\\ldots,n$:\n        $U_{ij}=(A_{ij}-\\sum_{k=1}^{i-1}U_{ki}U_{kj})/U_{ii}$\n\nWe can then solve a system of equations as: $U^{-1}(U^{\\top-1}b)$.\n\nConfusingly, while numpy's `cholesky` gives $L = U^\\top$, scipy \ncan return either $L$ or $U$ but defaults to $U$.\n\nHere are two ways we can use the Cholesky to solve a system of equations:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nU = sp.linalg.cholesky(A)\nsp.linalg.solve_triangular(L.T, \n          sp.linalg.solve_triangular(U, b, lower=False, trans='T'),\n          lower=False)\n\nU, lower = sp.linalg.cho_factor(A)\nsp.linalg.cho_solve((U, lower), b)\n```\n:::\n\n\n\nThe Cholesky has some nice advantages over the LU: (1) while both are\n$O(n^{3})$, the Cholesky involves only half as many computations,\n$n^{3}/6+O(n^{2})$ and (2) the Cholesky factorization has only\n$(n^{2}+n)/2$ unique values compared to $n^{2}+n$ for the LU. Of course\nthe LU is more broadly applicable. The Cholesky does require computation\nof square roots, but it turns out this is not too intensive. There is\nalso a method for finding the Cholesky without square roots.\n\n#### Uses of the Cholesky\n\nThe standard algorithm for generating $y\\sim\\mathcal{N}(0,A)$ is:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nL = sp.linalg.cholesky(A, lower=True)\ny = L @ np.random.normal(size = n)\n```\n:::\n\n\n**Question**: where will most of the time in this two-step calculation\nbe spent?\n\nIf a regression design matrix, $X$, is full rank, then $X^{\\top}X$ is\npositive definite, so we could find\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$ using either the Cholesky or\nGaussian elimination. \n\nHowever, for OLS, it turns out that the standard approach is to work with $X$\nusing the QR decomposition rather than working with $X^{\\top}X$; working\nwith $X$ is more numerically stable, though in most situations without\nextreme collinearity, either of the approaches will be fine.\n\nWe could also consider the GLS estimator, which accounts for dependence\nin the errors: $\\hat{\\beta}=(X^{\\top}\\Sigma^{-1}X)^{-1}X^{\\top}\\Sigma^{-1}Y$ \n\n**Challenge**: write efficient R code to carry out\nthe GLS solution using the Cholesky factorization.\n\n\n#### Numerical issues with eigendecompositions and Cholesky decompositions for positive definite matrices \n\nMonahan comments that in general Gaussian elimination and the Cholesky\ndecomposition are very stable. However, in the Cholesky case, if the\nmatrix is very ill-conditioned we can get $A_{ii}-\\sum_{k}U_{ki}^{2}$\nbeing negative and then the algorithm stops when we try to take the\nsquare root. In this case, the Cholesky decomposition does not exist\nnumerically although it exists mathematically. It's not all that hard to\nproduce such a matrix, particularly when working with high-dimensional\ncovariance matrices with large correlations.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlocs = np.random.uniform(size = 100)\nrho = .1\ndists = np.abs(locs[:, np.newaxis] - locs)\nC = np.exp(-dists**2/rho**2)\ne = np.linalg.eig(C)\nnp.sort(e[0])[::-1][96:100]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-4.16702596e-16+0.j, -5.38621518e-16+0.j, -6.26206506e-16+0.j,\n       -6.98611709e-16+0.j])\n```\n:::\n\n```{.python .cell-code}\ntry:\n    L = np.linalg.cholesky(C)\nexcept Exception as error:\n    print(error)\n  \n```\n\n::: {.cell-output .cell-output-stdout}\n```\nMatrix is not positive definite\n```\n:::\n\n```{.python .cell-code}\nvals = np.abs(e[0])\nnp.max(vals)/np.min(vals)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n3.135151918122864e+18\n```\n:::\n:::\n\n\nI don't see a way to use pivoting with the Cholesky in Python, but in R, \none can do `chol(C, pivot = TRUE)`.\n\nWe can think about the accuracy here as follows. Suppose we have a\nmatrix whose diagonal elements (i.e., the variances) are order of\nmagnitude 1 and that the true value of a $U_{ii}$ is less than\n$1\\times10^{-16}$. From the given $A_{ii}$ we are subtracting\n$\\sum_{k}U_{ki}^{2}$ and trying to calculate this very small number but\nwe know that we can only represent the values $A_{ii}$ and\n$\\sum_{k}U_{ki}^{2}$ accurately to 16 places, so the difference is\ngarbage starting in the 17th position and could well be negative. Now\nrealize that $\\sum_{k}U_{ki}^{2}$ is the result of a potentially large\nset of arithmetic operations, and is likely represented accurately to\nfewer than 16 places. Now if the true value of $U_{ii}$ is smaller than\nthe accuracy to which $\\sum_{k}U_{ki}^{2}$ is represented, we can get a\ndifference that is negative.\n\nNote that when the Cholesky fails, we can still compute an\neigendecomposition, but we have negative numeric eigenvalues. Even if\nall the eigenvalues are numerically positive (or equivalently, we're\nable to get the Cholesky), errors in small eigenvalues near machine\nprecision could have large effects when we work with the inverse of the\nmatrix. This is what happens when we have columns of the $X$ matrix\nnearly collinear. We cannot statistically distinguish the effect of two\n(or more) covariates, and this plays out numerically in terms of\nunstable results.\n\nA strategy when working with mathematically but not numerically positive\ndefinite $A$ is to set eigenvalues or singular values to zero when they\nget very small, which amounts to using a pseudo-inverse and setting to\nzero any linear combinations with very small variance. We can also use\npivoting with the Cholesky and accumulate zeroes in the last $n-q$ rows\n(for cases where we try to take the square root of a negative number),\ncorresponding to the columns of $A$ that are numerically linearly\ndependent. \n\n## QR decomposition\n\n### Introduction\n\nThe QR decomposition is available for any matrix, $X=QR$, with $Q$\northogonal and $R$ upper triangular. If $X$ is non-square, $n\\times p$\nwith $n>p$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p<n$, we definitely need $Q_{p}$ so\nthat the last $n-p$ rows of $R$ are zero and we can then discard them\nwhen just using the skinny QR.\n\n#### QR Method 2: Rotations\n\nA *Givens* rotation matrix rotates a vector in a two-dimensional\nsubspace to be axis oriented with respect to one of the two dimensions\nby changing the value of the other dimension. E.g. we can create\n$\\tilde{x}=(x_{1},\\ldots,\\tilde{x}_{p},\\ldots,0,\\ldots x_{n})$ from\n$x=(x_{1,}\\ldots,x_{p},\\ldots,x_{q},\\ldots,x_{n})$ using a matrix\nmultiplication: $\\tilde{x}=Qx$. $Q$ is orthogonal but not symmetric.\n\nWe can use a series of Givens rotations to do the QR but unless it is\ndone carefully, more computations are needed than with Householder\nreflections. The basic story is that we apply a series of Givens\nrotations to $X$ such that we zero out the lower triangular elements.\n\n$$\n\\begin{aligned}\nR & = & Q_{pn}\\cdots Q_{23}Q_{1n}\\cdots Q_{13}Q_{12} X \\\\\nQ & = & (Q_{pn}\\cdots Q_{12})^{\\top}\\end{aligned}.\n$$\n\n \nNote that we create\nthe $n-p$ zero rows in $R$ (because the calculations affect the upper\ntriangle of $R$), but we can then ignore those rows and the\ncorresponding columns of $Q$.\n\n#### QR Method 3: Gram-Schmidt Orthogonalization\n\nGram-Schmidt involves finding a set of orthonormal vectors to span the\nsame space as a set of LIN vectors, $x_{1},\\ldots,x_{p}$. If we take the\nLIN vectors to be the columns of $X$, so that we are discussing the\ncolumn space of $X$, then G-S yields the QR decomposition. Here's the\nalgorithm:\n\n1.  $\\tilde{x}_{1}=\\frac{x_{1}}{\\|x_{1}\\|}$ (normalize the first vector)\n\n2.  Orthogonalize the remaining vectors with respect to $\\tilde{x}_{1}$:\n\n    1.  $\\tilde{x}_{2}=\\frac{x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}}{\\|x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}\\|}$,\n        which orthogonalizes with respect to $\\tilde{x}_{1}$ and\n        normalizes. Note that\n        $\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}=\\langle\\tilde{x}_{1},x_{2}\\rangle\\tilde{x}_{1}$.\n        So we are finding a scaling, $c\\tilde{x}_{1}$, where $c$ is\n        based on the inner product, to remove the variation in the\n        $x_{1}$ direction from $x_{2}$.\n\n    2.  For $k>2$, find interim vectors, $x_{k}^{(2)}$, by\n        orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3.  Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n    the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n    orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n    get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see <http://arxiv.org/pdf/0808.2664v1.pdf> for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX  =  \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $p<m$.\nThe Eckart-Minsky-Young theorem shows that the truncated SVD\nminimizes the Frobenius norm of $A-\\tilde{A}$ over all possible rank-$p$ approximations.\nAs an example if we have a large image of dimension\n$n\\times m$, we could hold a compressed version by a rank-$p$\napproximation using the SVD. The SVD is used a lot in clustering\nproblems. For example, the Netflix prize was won based on a variant of\nSVD (in fact all of the top methods used variants on SVD, I believe).\n\nHere's another way to think about the SVD in terms of transformations and bases.\nApplying the SVD to a vector, $ UDV^{top}x$, carries out the following steps:\n\n - $V^{top}x$ expresses $x$ in terms of weights for the columns of $V$.\n - Multiplying the result by $D$ scales/stretches the weights.\n - Multiplying by $U$ produces the result, which is a weighted combination of columns of $U$, spanning the column-space of $U$.\n\nSo applying the SVD transforms $x$ from the column space of $V$ to the column space of $U$.\n\n#### Computation\n\nThe basic algorithm (Golub-Reinsch) is similar to the QR method for the\neigendecomposition. We use a series of Householder transformations on\nthe left and right to reduce $A$ to an upper bidiagonal matrix,\n$A^{(0)}$. The post-multiplications (the transformations on the right)\ngenerate the zeros in the upper triangle. (An upper bidiagonal matrix is\none with non-zeroes only on the diagonal and first subdiagonal above the\ndiagonal). Then the algorithm produces a series of upper bidiagonal\nmatrices, $A^{(0)}$, $A^{(1)},$ etc. that converge to a diagonal matrix,\n$D$ . Each step is carried out by a sequence of Givens transformations:\n\n$$\n\\begin{aligned}\nA^{(j+1)} & = & R_{m-2}^{\\top} R_{m-3}^{\\top} \\cdots R_{0}^{\\top} A^{(j)} T_{0} T_{1} \\cdots T_{m-2} \\\\\n & = & RA^{(j)} T \n\\end{aligned}.\n$$ \n\n\nThis eventually gives $A^{(...)}=D$ and\nby construction, $U$ (the product of the pre-multiplied Householder\nmatrices and the $R$ matrices) and $V$ (the product of the\npost-multiplied Householder matrices and the $T$ matrices) are\northogonal. The result is then transformed by a diagonal matrix to make\nthe elements of $D$ non-negative and by permutation matrices to order\nthe elements of $D$ in nonincreasing order.\n\n#### Computation for large tall-skinny matrices\n\nThe SVD can also be generated from a QR decomposition. Take $X=QR$ and\nthen do an SVD on the $R$ matrix to get $X=QUDV^{\\top}=U^{*}DV^{\\top}$.\nThis is particularly helpful for the case when $X$ is tall and skinny\n(suppose $X$ is $n\\times p$ with $n\\gg p$), because we can do the\ntall-skinny QR, and the resulting SVD on $R$ is easy computationally if\n$p$ is manageable.\n\n# 6. Computation\n\n## Linear algebra in Python\n\nSpeedups and storage savings can be obtained by working with matrices\nstored in special formats when the matrices have special structure.\nE.g., we might store a symmetric matrix as a full matrix but only use\nthe upper or lower triangle. Banded matrices and block diagonal matrices\nare other common formats. Banded matrices are all zero except for\n$A_{i,i+c_{k}}$ for some small number of integers, $c_{k}$. Viewed as an\nimage, these have bands. The bands are known as co-diagonals.\n\nNote that for many matrix decompositions, you can change whether all of\nthe aspects of the decomposition are returned, or just some, which may\nspeed calculations.\n\nScipy provides functionality for working with matrices in various ways,\nincluding the [`scipy.sparse` module](https://docs.scipy.org/doc/scipy/reference/sparse.html),\nwhich provides support for\nstructured sparse matrices such as triangular and diagonal matrices\nas well as unstructured sparse matrices using various standard representations.\n\nSome useful packages in R for matrices are *Matrix*, *spam*, and\n*bdsmatrix*. *Matrix* can represent a variety of rectangular matrices,\nincluding triangular, orthogonal, diagonal, etc. and provides methods\nfor various matrix calculations that are specific to the matrix type.\n*spam* handles general sparse matrices with fast matrix calculations, in\nparticular a fast Cholesky decomposition. *bdsmatrix* focuses on\nblock-diagonal matrices, which arise frequently in contexts where there\nis clustering that induces within-cluster correlation and cross-cluster\nindependence.\n\nIn general, matrix operations in Python and R go to compiled C or Fortran code\nwithout much intermediate Python or R  code, so they can actually be pretty\nefficient and are based on the best algorithms developed by numerical\nexperts. The core libraries that are used are LAPACK and BLAS (the\nLinear Algebra PACKage and the Basic Linear Algebra Subroutines). As\nwe've discussed in the parallelization unit, one way to speed up code\nthat relies heavily on linear algebra is to make sure you have a BLAS\nlibrary tuned to your machine. These include OpenBLAS (open source), Intel's MKL, AMD's ACML, and Apple's vecLib. \n\nIf you use Conda, numpy will generally be linked against MKL or OpenBLAS (this will\ndepend on the locations online of the packages being installed, i.e., the *channel(s)* used). With pip,\nnumpy will generally be linked against OpenBLAS.\nit's possible to install numpy so that it uses OpenBLAS.\nR can be linked to the shared object library file\n(*.so* file or *.dylib* on a Mac) for a fast BLAS. These BLAS\nlibraries are also available in threaded versions that farm out the\ncalculations across multiple cores or processors that share memory.\nMore details are available in [this SCF documentation](https://statistics.berkeley.edu/computing/faqs/linear-algebra-and-parallelized-linear-algebra-using-blas).\n\nBLAS routines do vector operations (level 1), matrix-vector operations\n(level 2), and dense matrix-matrix operations (level 3). Often the name\nof the routine has as its first letter \"d\", \"s\", \"c\" to indicate the\nroutine is double precision, single precision, or complex. LAPACK builds\non BLAS to implement standard linear algebra routines such as\neigendecomposition, solutions of linear systems, a variety of\nfactorizations, etc.\n\n## Sparse matrices\n\nAs an example of exploiting sparsity, we can use a standard format (CSR = compressed sparse row)\nin the Scipy `sparse` module:\n\nConsider the matrix to be row-major and store\nthe non-zero elements in order in an array called `data`. Then create a\narray called *indptr* that stores the position of the first element of\neach row. Finally, have a array, *indices* that tells the column\nidentity of each element. \n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy.sparse as sparse\nmat = np.array([[0,0,1,0,10],[0,0,0,100,0],[0,0,0,0,0],[1000,0,0,0,0]])\nmat = sparse.csr_array(mat)\nmat.data\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([   1,   10,  100, 1000])\n```\n:::\n\n```{.python .cell-code}\nmat.indices  # column indices\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([2, 4, 3, 0], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat.indptr   # row pointers\n\n## Ideally don't first construct the dense matrix if it is large.\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0, 2, 3, 3, 4], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat2 = sparse.csr_array((mat.data, mat.indices, mat.indptr))\nmat2.toarray()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([[   0,    0,    1,    0,   10],\n       [   0,    0,    0,  100,    0],\n       [   0,    0,    0,    0,    0],\n       [1000,    0,    0,    0,    0]])\n```\n:::\n:::\n\n\nThat's also how things are done in the `spam` package in R.\n\nWe can do a fast matrix multiply, $x = Ab$, as follows in pseudo-code:\n\n```\n    for(i in 1:nrows(A)){\n        x[i] = 0\n        # should also check that row is not empty...\n        for(j in (rowpointers[i]:(rowpointers[i+1]-1)) {\n            x[i] = x[i] + entries[j] * b[colindices[j]]\n        }   \n    }\n```\n\nHow many computations have we done? Only $k$ multiplies and $O(k)$\nadditions where $k$ is the number of non-zero elements of $A$. Compare\nthis to the usual $O(n^{2})$ for dense multiplication.\n\nNote that for the Cholesky of a sparse matrix, if the sparsity pattern\nis fixed, but the entries change, one can precompute an optimal\nre-ordering that retains as much sparsity in $U$ as possible. Then\nmultiple Cholesky decompositions can be done more quickly as the entries\nchange.\n\n#### Banded matrices\n\nSuppose we have a banded matrix $A$ where the lower bandwidth is $p$,\nnamely $A_{ij}=0$ for $i>j+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n-   Start with an initial approximation, $x^{(0)}$.\n-   Hold all but $x_{1}^{(0)}$ constant and solve to find\n    $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n-   Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n    finding $x^{(1)}$.\n-   Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n    criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n    or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n-   Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n    error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n    set $k=0$.\n\n-   Then iterate:\n\n    -   $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n        (choose step size so next error will be orthogonal to current\n        direction - which we can express in terms of the residual, which\n        is easily computable)\n\n    -   $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n    -   $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n    -   $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n        (choose next direction by conjugate Gram-Schmidt, starting with\n        $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n        to previous directions, but it turns out that $r_{(k+1)}$ is\n        already $A$-orthogonal to all but $d_{k}$).\n\n-   Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n",
+    "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n  pdf:\n    documentclass: article\n    margin-left: 30mm\n    margin-right: 30mm\n    toc: true\n  html:\n    theme: cosmo\n    css: ../styles.css\n    toc: true\n    code-copy: true\n    code-block-background: true\nexecute:\n  freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n-   Gentle: Numerical Linear Algebra for Applications in Statistics\n    (available via UC Library Search) (my notes here are based primarily\n    on this source) [Gentle-NLA]\n    -   Gentle: Matrix Algebra also has much of this material.\n-   Gentle: Computational Statistics [Gentle-CS]\n-   Lange: Numerical Analysis for Statisticians\n-   Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n  - Video 1. Ill-conditioned problems, part 1\n  - Video 2. Ill-conditioned problems, part 2\n  - Video 3. Triangular systems of equations\n  - Video 4. Solving systems of equations via LU, part 1\n  - Video 5. Solving systems of equations via LU, part 2\n  - Video 6. Solving systems of equations via LU, part 3\n  - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1.  How to think about the computational order (number of computations\n    involved) of a problem\n2.  How to choose a computational approach to a given linear algebra\n    calculation you need to do.\n3.  An understanding of how issues with computer numbers (Unit 8) affect\n    linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B)  \nA @ B        # alternative\nA.dot(B)     # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1.  We can find the ordering that reduces computation the most if the\n    individual matrices are not square.\n2.  $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n    $x^{\\top}Ax$, is a scalar, and this is equal to\n    $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n    helpful to be able to go back and forth between a scalar and a trace\n    in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n-   having fewer rows or columns;\n-   being diagonal, triangular or sparse in some way,\n-   being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $n<p$, then we have at most $n$ linearly\nindependent covariates (the rank is at most $n$). In this case we have\nmultiple possible solutions and the system is ill-determined\n(under-determined). Similarly, if $q<p$ and $n\\geq p$, the rank is again\nless than $p$ and we have multiple possible solutions. Of course we\nusually have $n>p$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1.  What does it mean when one or more eigenvalue (i.e.,\n    $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2.  Suppose I have an eigenvalue that is very small and I set it to\n    zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3.  Now let's consider the inverse of a covariance matrix, known as the\n    precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n    does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n    $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots &  & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n &  & \\vdots &  & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots &  &  & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\  &  & 1 & -4 & 6 & -4 & 1 \\\\  &  &  & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0]        # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n       3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3]   # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals   # variances\nevals[3] = 0      # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n6.661338147750939e-16\n```\n:::\n:::\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1.  Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n            for(j=1:m){\n                b_i = b_i + a_{ij} x_j\n            }\n        }\n```\n\n2.  Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n        for(i=1:n){ \n            b_i = 0\n        }\n        for(j=1:m){\n            for(i = 1:n){\n                b_i = b_i + a_{ij} x_j  \n            }\n        }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n    return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011)  # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1))  # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([ 1.60671997e+04, -1.59483225e+01,  4.10859375e-03])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000              # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n    np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505,   770.        ,     9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10                  # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n    np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487,  7.7       ,  1.95366513])\n```\n:::\n:::\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n|     Name   |     Representation   |  Restrictions |   Properties   |  Uses       |\n|  ----------| ----------------------| ------------| --------------------|-------------|\n|   LU           |       $A_{nn}= L_{nn}U_{nn}$     |    $A$ generally square |  $L$ lower triangular; $U$ upper triangular |  solving equations; inversion |  \n|   QR           |   $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n|      Cholesky    | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n|     Eigen decomposition    | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n|     SVD    |    $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1.  $x_{n}=b_{n}/A_{nn}$\n2.  Now for $k<n$, use the already computed\n    $\\{x_{n},x_{n-1},\\ldots,x_{k+1}\\}$ to calculate\n    $x_{k}=\\frac{b_{k}-\\sum_{j=k+1}^{n}x_{j}A_{kj}}{A_{kk}}$.\n3.  Repeat for all rows.\n\nHow many multiplies and adds are done? Solving lower triangular systems\nis very similar and involves the same number of calculations.\n\nIn Scipy, we can use `linalg.solve_triangular` to solve triangular systems.\nThe `trans` argument indicates whether to solve $Ax=b$ or $A^{\\top}x=b$\n(so one wouldn't need to transpose before solving).\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy as sp\nnp.random.seed(1)\nn = 20\nX = np.random.normal(size = (n,n))\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X  # produce a positive def. matrix \n\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\nU = L.T\n\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\nout2 = np.linalg.inv(L) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n\n```{.python .cell-code}\nout3 =  sp.linalg.solve_triangular(U, b, lower=False)\nout4 = np.linalg.inv(U) @ b\nnp.allclose(out1, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n:::\n\n\n\n**To reiterate the distinction between matrix inversion and solving a\nsystem of equations, when we write $U^{-1}b$, what we mean on a computer\nis to carry out the above algorithm, not to find the inverse and then\nmultiply.**\n\nHere's a good reason why.\n\n\n::: {.cell hash='unit10-linalg_cache/pdf/unnamed-chunk-8_d2f53708c01a1b243471187f7af12edf'}\n\n```{.python .cell-code}\nimport time\n\nnp.random.seed(1)\nn = 5000\nX = np.random.normal(size = (n,n))\n\n## R has the `crossprod` function, which would be more efficient\n## than having to transpose, but numpy doesn't seem to have an equivalent.\nA = X.T @ X\nb = np.random.normal(size = n)\nL = np.linalg.cholesky(A) # L is lower-triangular\n\nt0 = time.time()\nout1 = sp.linalg.solve_triangular(L, b, lower=True)\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.04602313041687012\n```\n:::\n\n```{.python .cell-code}\nt0 = time.time()\nout2 = np.linalg.inv(L) @ b\ntime.time() - t0\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n3.9984264373779297\n```\n:::\n:::\n\n\nThat assumes you have $L$, but we'll see in a bit that even when one\naccounts for the creation of $L$, you don't want to invert matrices in\norder to solve systems of equations.\n\n## Gaussian elimination (LU decomposition)\n\nGaussian elimination is a standard way of directly computing a solution\nfor $Ax=b$. It is equivalent to the LU decomposition. LU is primarily\ndone with square matrices, but not always. Also LU decompositions do\nexist for some singular matrices.\n\nThe idea of Gaussian elimination is to convert the problem to a\ntriangular system. In class, we'll walk through Gaussian elimination in\ndetail and see how it relates to the LU decomposition. I'll describe it\nmore briefly here. Following what we learned in algebra when we have\nmultiple equations, we preserve the solution, $x$, when we add multiples\nof rows (i.e., add multiples of equations) together. This amounts to\ndoing $L_{1}Ax=L_{1}b$ for a lower-triangular matrix $L_{1}$ that\nproduces all zeroes in the first column of $L_{1}A$ except for the first\nrow. We proceed to zero out values below the diagonal for the other\ncolumns of $A$. The result is\n$L_{n-1}\\cdots L_{1}Ax\\equiv Ux=L_{n-1}\\cdots L_{1}b\\equiv b^{*}$ where\n$U$ is upper triangular. This is the forward reduction step of Gaussian\nelimination. Then the backward elimination step solves $Ux=b^{*}$.\n\nIf we're just looking for the solution of the system, we don't need the\nlower-triangular factor $L=(L_{n-1}\\cdots L_{1})^{-1}$ in $A=LU$, but it\nturns out to have a simple form that is computed as we go along, it is\nunit lower triangular and the values below the diagonal are the negative\nof the values below the diagonals in $L_{1},\\ldots,L_{n-1}$ (note that\neach $L_{j}$ has non-zeroes below the diagonal only in the $j$th\ncolumn). As a side note related to storage, it turns out that as we\nproceed, we can store the elements of $L$ and $U$ in the original $A$\nmatrix, except for the implicit 1s on the diagonal of $L$.\n\nIn class, we'll work out the computational complexity of the LU and see\nthat it is $O(n^{3})$.\n\nIf we look at `help(np.linalg.solve)` in Python, we see that it uses *_gesv*. A\nGoogle search indicates that this is a Lapack routine that does the LU\ndecomposition with partial pivoting and row interchanges (see below on\nwhat these are), so numpy is using the algorithm we've just discussed.\n\nWe can also explicitly get the LU decomposition in Python with\n`scipy.linalg.lu()`, though for most use cases, what we want to\ndo is solve a system of equations. (In R, one can't easily get\nthe explicit LU decomposition, though `solve()` in R does use the LU.\n\nOne additional complexity is that we want to avoid dividing by very\nsmall values to avoid introducing numerical inaccuracy (we would get\nlarge values that might overwhelm whatever they are being added to, and\nsmall errors in the divisor will have large effects on the result). This\ncan be done on the fly by interchanging equations to use the equation\n(row) that produces the largest value to divide by. For example in the\nfirst step, we would switch the first equation (first row) for whichever\nof the remaining equations has the largest value in the first column.\nThis is called partial pivoting. The divisors are called pivots.\nComplete pivoting also considers interchanging columns, and while\ntheoretically better, partial pivoting is generally sufficient and\nrequires fewer computations. Partial pivoting can be expressed as\nmultiplying along the way by permutation matrices,\n$P_{1},\\ldots P_{n-1}$ that switch rows. One can show with some work\nthat based on pivoting, we have $PA=LU$, where $P=P_{n-1}\\cdots P_{1}$.\nIn the demo code, we'll see a toy example of the impact of pivoting.\n\nFinally $|PA|=|P||A|=|L||U|=|U|$ (why?) so $|A|=|U|/|P|$ and since the\ndeterminant of each permutation matrix, $P_{j}$ is -1 (except when\n$P_{j}=I$ because we don't need to switch rows), we just need to\nmultiply by minus one if there is an odd number of permutations. Or if\nwe know the matrix is non-negative definite, we just take the absolute\nvalue of $|U|$. So Gaussian elimination provides a fast stable way to\nfind the determinant.\n\n## When would we explicitly invert a matrix?\n\nIn some cases (as we discussed in class) you actually need the inverse as\nthe output (e.g., for estimating standard errors). \n\nBut if you are just needing the inverse to multiply it by a vector or another\nmatrix, you're better off\nusing a decomposition. Basically, one can work out that the cost of computing\nthe inverse and doing subsequent multiplication with it is rather more \nthan computing the decomposition and doing subsequent multiplications with it,\nregardless of how many columns there are in the matrix you are multiplying by.\n\nWith numpy, that may mean computing the decomposition and then \ncarrying out the steps to do the multiplication using the decomposition\nor using `np.linalg.solve(A, B)` which, as mentioned above, uses the LU behind the scenes. \nOne would not do  `np.linalg.inv(A) @ B`.\n\n## Cholesky decomposition\n\nWhen $A$ is p.d., we can use the Cholesky decomposition to solve a\nsystem of equations. Positive definite matrices can be decomposed as\n$U^{\\top}U=A$ where $U$ is upper triangular. $U$ is called a square root\nmatrix and is unique (apart from the sign, which we fix by requiring the\ndiagonals to be positive). One algorithm for computing $U$ is:\n\n1.  $U_{11}=\\sqrt{A_{11}}$\n2.  For $j=2,\\ldots,n$, $U_{1j}=A_{1j}/U_{11}$\n3.  For $i=2,\\ldots,n$,\n    -   $U_{ii}=\\sqrt{A_{ii}-\\sum_{k=1}^{i-1}U_{ki}^{2}}$\n    -   if $i<n$, then for $j=i+1,\\ldots,n$:\n        $U_{ij}=(A_{ij}-\\sum_{k=1}^{i-1}U_{ki}U_{kj})/U_{ii}$\n\nWe can then solve a system of equations as: $U^{-1}(U^{\\top-1}b)$.\n\nConfusingly, while numpy's `cholesky` gives $L = U^\\top$, scipy's\n`choleskey` can return either $L$ or $U$ but defaults to $U$.\n\nHere are two ways we can use the Cholesky to solve a system of equations:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nU = sp.linalg.cholesky(A)\nsp.linalg.solve_triangular(U, \n          sp.linalg.solve_triangular(U, b, lower=False, trans='T'),\n          lower=False)\n\nU, lower = sp.linalg.cho_factor(A)\nsp.linalg.cho_solve((U, lower), b)\n```\n:::\n\n\n\nThe Cholesky has some nice advantages over the LU: (1) while both are\n$O(n^{3})$, the Cholesky involves only half as many computations,\n$n^{3}/6+O(n^{2})$ and (2) the Cholesky factorization has only\n$(n^{2}+n)/2$ unique values compared to $n^{2}+n$ for the LU. Of course\nthe LU is more broadly applicable. The Cholesky does require computation\nof square roots, but it turns out this is not too intensive. There is\nalso a method for finding the Cholesky without square roots.\n\n#### Uses of the Cholesky\n\nThe standard algorithm for generating $y\\sim\\mathcal{N}(0,A)$ is:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nL = sp.linalg.cholesky(A, lower=True)\ny = L @ np.random.normal(size = n)\n```\n:::\n\n\n**Question**: where will most of the time in this two-step calculation\nbe spent?\n\nIf a regression design matrix, $X$, is full rank, then $X^{\\top}X$ is\npositive definite, so we could find\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$ using either the Cholesky or\nGaussian elimination. \n\nHowever, for OLS, it turns out that the standard approach is to work with $X$\nusing the QR decomposition rather than working with $X^{\\top}X$; working\nwith $X$ is more numerically stable, though in most situations without\nextreme collinearity, either of the approaches will be fine.\n\nWe could also consider the GLS estimator, which accounts for dependence\nin the errors: $\\hat{\\beta}=(X^{\\top}\\Sigma^{-1}X)^{-1}X^{\\top}\\Sigma^{-1}Y$ \n\n**Challenge**: write efficient R code to carry out\nthe GLS solution using the Cholesky factorization.\n\n\n#### Numerical issues with eigendecompositions and Cholesky decompositions for positive definite matrices \n\nMonahan comments that in general Gaussian elimination and the Cholesky\ndecomposition are very stable. However, in the Cholesky case, if the\nmatrix is very ill-conditioned we can get $A_{ii}-\\sum_{k}U_{ki}^{2}$\nbeing negative and then the algorithm stops when we try to take the\nsquare root. In this case, the Cholesky decomposition does not exist\nnumerically although it exists mathematically. It's not all that hard to\nproduce such a matrix, particularly when working with high-dimensional\ncovariance matrices with large correlations.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlocs = np.random.uniform(size = 100)\nrho = .1\ndists = np.abs(locs[:, np.newaxis] - locs)\nC = np.exp(-dists**2/rho**2)\ne = np.linalg.eig(C)\nnp.sort(e[0])[::-1][96:100]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-4.16702596e-16+0.j, -5.38621518e-16+0.j, -6.26206506e-16+0.j,\n       -6.98611709e-16+0.j])\n```\n:::\n\n```{.python .cell-code}\ntry:\n    L = np.linalg.cholesky(C)\nexcept Exception as error:\n    print(error)\n  \n```\n\n::: {.cell-output .cell-output-stdout}\n```\nMatrix is not positive definite\n```\n:::\n\n```{.python .cell-code}\nvals = np.abs(e[0])\nnp.max(vals)/np.min(vals)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n3.135151918122864e+18\n```\n:::\n:::\n\n\nI don't see a way to use pivoting with the Cholesky in Python, but in R, \none can do `chol(C, pivot = TRUE)`.\n\nWe can think about the accuracy here as follows. Suppose we have a\nmatrix whose diagonal elements (i.e., the variances) are order of\nmagnitude 1 and that the true value of a $U_{ii}$ is less than\n$1\\times10^{-16}$. From the given $A_{ii}$ we are subtracting\n$\\sum_{k}U_{ki}^{2}$ and trying to calculate this very small number but\nwe know that we can only represent the values $A_{ii}$ and\n$\\sum_{k}U_{ki}^{2}$ accurately to 16 places, so the difference is\ngarbage starting in the 17th position and could well be negative. Now\nrealize that $\\sum_{k}U_{ki}^{2}$ is the result of a potentially large\nset of arithmetic operations, and is likely represented accurately to\nfewer than 16 places. Now if the true value of $U_{ii}$ is smaller than\nthe accuracy to which $\\sum_{k}U_{ki}^{2}$ is represented, we can get a\ndifference that is negative.\n\nNote that when the Cholesky fails, we can still compute an\neigendecomposition, but we have negative numeric eigenvalues. Even if\nall the eigenvalues are numerically positive (or equivalently, we're\nable to get the Cholesky), errors in small eigenvalues near machine\nprecision could have large effects when we work with the inverse of the\nmatrix. This is what happens when we have columns of the $X$ matrix\nnearly collinear. We cannot statistically distinguish the effect of two\n(or more) covariates, and this plays out numerically in terms of\nunstable results.\n\nA strategy when working with mathematically but not numerically positive\ndefinite $A$ is to set eigenvalues or singular values to zero when they\nget very small, which amounts to using a pseudo-inverse and setting to\nzero any linear combinations with very small variance. We can also use\npivoting with the Cholesky and accumulate zeroes in the last $n-q$ rows\n(for cases where we try to take the square root of a negative number),\ncorresponding to the columns of $A$ that are numerically linearly\ndependent. \n\n## QR decomposition\n\n### Introduction\n\nThe QR decomposition is available for any matrix, $X=QR$, with $Q$\northogonal and $R$ upper triangular. If $X$ is non-square, $n\\times p$\nwith $n>p$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p<n$, we definitely need $Q_{p}$ so\nthat the last $n-p$ rows of $R$ are zero and we can then discard them\nwhen just using the skinny QR.\n\n#### QR Method 2: Rotations\n\nA *Givens* rotation matrix rotates a vector in a two-dimensional\nsubspace to be axis oriented with respect to one of the two dimensions\nby changing the value of the other dimension. E.g. we can create\n$\\tilde{x}=(x_{1},\\ldots,\\tilde{x}_{p},\\ldots,0,\\ldots x_{n})$ from\n$x=(x_{1,}\\ldots,x_{p},\\ldots,x_{q},\\ldots,x_{n})$ using a matrix\nmultiplication: $\\tilde{x}=Qx$. $Q$ is orthogonal but not symmetric.\n\nWe can use a series of Givens rotations to do the QR but unless it is\ndone carefully, more computations are needed than with Householder\nreflections. The basic story is that we apply a series of Givens\nrotations to $X$ such that we zero out the lower triangular elements.\n\n$$\n\\begin{aligned}\nR & = & Q_{pn}\\cdots Q_{23}Q_{1n}\\cdots Q_{13}Q_{12} X \\\\\nQ & = & (Q_{pn}\\cdots Q_{12})^{\\top}\\end{aligned}.\n$$\n\n \nNote that we create\nthe $n-p$ zero rows in $R$ (because the calculations affect the upper\ntriangle of $R$), but we can then ignore those rows and the\ncorresponding columns of $Q$.\n\n#### QR Method 3: Gram-Schmidt Orthogonalization\n\nGram-Schmidt involves finding a set of orthonormal vectors to span the\nsame space as a set of LIN vectors, $x_{1},\\ldots,x_{p}$. If we take the\nLIN vectors to be the columns of $X$, so that we are discussing the\ncolumn space of $X$, then G-S yields the QR decomposition. Here's the\nalgorithm:\n\n1.  $\\tilde{x}_{1}=\\frac{x_{1}}{\\|x_{1}\\|}$ (normalize the first vector)\n\n2.  Orthogonalize the remaining vectors with respect to $\\tilde{x}_{1}$:\n\n    1.  $\\tilde{x}_{2}=\\frac{x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}}{\\|x_{2}-\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}\\|}$,\n        which orthogonalizes with respect to $\\tilde{x}_{1}$ and\n        normalizes. Note that\n        $\\tilde{x}_{1}^{\\top}x_{2}\\tilde{x}_{1}=\\langle\\tilde{x}_{1},x_{2}\\rangle\\tilde{x}_{1}$.\n        So we are finding a scaling, $c\\tilde{x}_{1}$, where $c$ is\n        based on the inner product, to remove the variation in the\n        $x_{1}$ direction from $x_{2}$.\n\n    2.  For $k>2$, find interim vectors, $x_{k}^{(2)}$, by\n        orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3.  Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n    the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n    orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n    get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see <http://arxiv.org/pdf/0808.2664v1.pdf> for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX  =  \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $p<m$.\nThe Eckart-Minsky-Young theorem shows that the truncated SVD\nminimizes the Frobenius norm of $A-\\tilde{A}$ over all possible rank-$p$ approximations.\nAs an example if we have a large image of dimension\n$n\\times m$, we could hold a compressed version by a rank-$p$\napproximation using the SVD. The SVD is used a lot in clustering\nproblems. For example, the Netflix prize was won based on a variant of\nSVD (in fact all of the top methods used variants on SVD, I believe).\n\nHere's another way to think about the SVD in terms of transformations and bases.\nApplying the SVD to a vector, $ UDV^{top}x$, carries out the following steps:\n\n - $V^{top}x$ expresses $x$ in terms of weights for the columns of $V$.\n - Multiplying the result by $D$ scales/stretches the weights.\n - Multiplying by $U$ produces the result, which is a weighted combination of columns of $U$, spanning the column-space of $U$.\n\nSo applying the SVD transforms $x$ from the column space of $V$ to the column space of $U$.\n\n#### Computation\n\nThe basic algorithm (Golub-Reinsch) is similar to the QR method for the\neigendecomposition. We use a series of Householder transformations on\nthe left and right to reduce $A$ to an upper bidiagonal matrix,\n$A^{(0)}$. The post-multiplications (the transformations on the right)\ngenerate the zeros in the upper triangle. (An upper bidiagonal matrix is\none with non-zeroes only on the diagonal and first subdiagonal above the\ndiagonal). Then the algorithm produces a series of upper bidiagonal\nmatrices, $A^{(0)}$, $A^{(1)},$ etc. that converge to a diagonal matrix,\n$D$ . Each step is carried out by a sequence of Givens transformations:\n\n$$\n\\begin{aligned}\nA^{(j+1)} & = & R_{m-2}^{\\top} R_{m-3}^{\\top} \\cdots R_{0}^{\\top} A^{(j)} T_{0} T_{1} \\cdots T_{m-2} \\\\\n & = & RA^{(j)} T \n\\end{aligned}.\n$$ \n\n\nThis eventually gives $A^{(...)}=D$ and\nby construction, $U$ (the product of the pre-multiplied Householder\nmatrices and the $R$ matrices) and $V$ (the product of the\npost-multiplied Householder matrices and the $T$ matrices) are\northogonal. The result is then transformed by a diagonal matrix to make\nthe elements of $D$ non-negative and by permutation matrices to order\nthe elements of $D$ in nonincreasing order.\n\n#### Computation for large tall-skinny matrices\n\nThe SVD can also be generated from a QR decomposition. Take $X=QR$ and\nthen do an SVD on the $R$ matrix to get $X=QUDV^{\\top}=U^{*}DV^{\\top}$.\nThis is particularly helpful for the case when $X$ is tall and skinny\n(suppose $X$ is $n\\times p$ with $n\\gg p$), because we can do the\ntall-skinny QR, and the resulting SVD on $R$ is easy computationally if\n$p$ is manageable.\n\n# 6. Computation\n\n## Linear algebra in Python\n\nSpeedups and storage savings can be obtained by working with matrices\nstored in special formats when the matrices have special structure.\nE.g., we might store a symmetric matrix as a full matrix but only use\nthe upper or lower triangle. Banded matrices and block diagonal matrices\nare other common formats. Banded matrices are all zero except for\n$A_{i,i+c_{k}}$ for some small number of integers, $c_{k}$. Viewed as an\nimage, these have bands. The bands are known as co-diagonals.\n\nNote that for many matrix decompositions, you can change whether all of\nthe aspects of the decomposition are returned, or just some, which may\nspeed calculations.\n\nScipy provides functionality for working with matrices in various ways,\nincluding the [`scipy.sparse` module](https://docs.scipy.org/doc/scipy/reference/sparse.html),\nwhich provides support for\nstructured sparse matrices such as triangular and diagonal matrices\nas well as unstructured sparse matrices using various standard representations.\n\nSome useful packages in R for matrices are *Matrix*, *spam*, and\n*bdsmatrix*. *Matrix* can represent a variety of rectangular matrices,\nincluding triangular, orthogonal, diagonal, etc. and provides methods\nfor various matrix calculations that are specific to the matrix type.\n*spam* handles general sparse matrices with fast matrix calculations, in\nparticular a fast Cholesky decomposition. *bdsmatrix* focuses on\nblock-diagonal matrices, which arise frequently in contexts where there\nis clustering that induces within-cluster correlation and cross-cluster\nindependence.\n\nIn general, matrix operations in Python and R go to compiled C or Fortran code\nwithout much intermediate Python or R  code, so they can actually be pretty\nefficient and are based on the best algorithms developed by numerical\nexperts. The core libraries that are used are LAPACK and BLAS (the\nLinear Algebra PACKage and the Basic Linear Algebra Subroutines). As\nwe've discussed in the parallelization unit, one way to speed up code\nthat relies heavily on linear algebra is to make sure you have a BLAS\nlibrary tuned to your machine. These include OpenBLAS (open source), Intel's MKL, AMD's ACML, and Apple's vecLib. \n\nIf you use Conda, numpy will generally be linked against MKL or OpenBLAS (this will\ndepend on the locations online of the packages being installed, i.e., the *channel(s)* used). With pip,\nnumpy will generally be linked against OpenBLAS.\nit's possible to install numpy so that it uses OpenBLAS.\nR can be linked to the shared object library file\n(*.so* file or *.dylib* on a Mac) for a fast BLAS. These BLAS\nlibraries are also available in threaded versions that farm out the\ncalculations across multiple cores or processors that share memory.\nMore details are available in [this SCF documentation](https://statistics.berkeley.edu/computing/faqs/linear-algebra-and-parallelized-linear-algebra-using-blas).\n\nBLAS routines do vector operations (level 1), matrix-vector operations\n(level 2), and dense matrix-matrix operations (level 3). Often the name\nof the routine has as its first letter \"d\", \"s\", \"c\" to indicate the\nroutine is double precision, single precision, or complex. LAPACK builds\non BLAS to implement standard linear algebra routines such as\neigendecomposition, solutions of linear systems, a variety of\nfactorizations, etc.\n\n## Sparse matrices\n\nAs an example of exploiting sparsity, we can use a standard format (CSR = compressed sparse row)\nin the Scipy `sparse` module:\n\nConsider the matrix to be row-major and store\nthe non-zero elements in order in an array called `data`. Then create a\narray called *indptr* that stores the position of the first element of\neach row. Finally, have a array, *indices* that tells the column\nidentity of each element. \n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport scipy.sparse as sparse\nmat = np.array([[0,0,1,0,10],[0,0,0,100,0],[0,0,0,0,0],[1000,0,0,0,0]])\nmat = sparse.csr_array(mat)\nmat.data\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([   1,   10,  100, 1000])\n```\n:::\n\n```{.python .cell-code}\nmat.indices  # column indices\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([2, 4, 3, 0], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat.indptr   # row pointers\n\n## Ideally don't first construct the dense matrix if it is large.\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0, 2, 3, 3, 4], dtype=int32)\n```\n:::\n\n```{.python .cell-code}\nmat2 = sparse.csr_array((mat.data, mat.indices, mat.indptr))\nmat2.toarray()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([[   0,    0,    1,    0,   10],\n       [   0,    0,    0,  100,    0],\n       [   0,    0,    0,    0,    0],\n       [1000,    0,    0,    0,    0]])\n```\n:::\n:::\n\n\nThat's also how things are done in the `spam` package in R.\n\nWe can do a fast matrix multiply, $x = Ab$, as follows in pseudo-code:\n\n```\n    for(i in 1:nrows(A)){\n        x[i] = 0\n        # should also check that row is not empty...\n        for(j in (rowpointers[i]:(rowpointers[i+1]-1)) {\n            x[i] = x[i] + entries[j] * b[colindices[j]]\n        }   \n    }\n```\n\nHow many computations have we done? Only $k$ multiplies and $O(k)$\nadditions where $k$ is the number of non-zero elements of $A$. Compare\nthis to the usual $O(n^{2})$ for dense multiplication.\n\nNote that for the Cholesky of a sparse matrix, if the sparsity pattern\nis fixed, but the entries change, one can precompute an optimal\nre-ordering that retains as much sparsity in $U$ as possible. Then\nmultiple Cholesky decompositions can be done more quickly as the entries\nchange.\n\n#### Banded matrices\n\nSuppose we have a banded matrix $A$ where the lower bandwidth is $p$,\nnamely $A_{ij}=0$ for $i>j+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n-   Start with an initial approximation, $x^{(0)}$.\n-   Hold all but $x_{1}^{(0)}$ constant and solve to find\n    $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n-   Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n    finding $x^{(1)}$.\n-   Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n    criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n    or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n-   Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n    error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n    set $k=0$.\n\n-   Then iterate:\n\n    -   $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n        (choose step size so next error will be orthogonal to current\n        direction - which we can express in terms of the residual, which\n        is easily computable)\n\n    -   $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n    -   $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n    -   $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n        (choose next direction by conjugate Gram-Schmidt, starting with\n        $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n        to previous directions, but it turns out that $r_{(k+1)}$ is\n        already $A$-orthogonal to all but $d_{k}$).\n\n-   Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/units/unit10-linalg.qmd b/units/unit10-linalg.qmd
index c4dea77..17e9369 100644
--- a/units/unit10-linalg.qmd
+++ b/units/unit10-linalg.qmd
@@ -1009,14 +1009,14 @@ diagonals to be positive). One algorithm for computing $U$ is:
 
 We can then solve a system of equations as: $U^{-1}(U^{\top-1}b)$.
 
-Confusingly, while numpy's `cholesky` gives $L = U^\top$, scipy 
-can return either $L$ or $U$ but defaults to $U$.
+Confusingly, while numpy's `cholesky` gives $L = U^\top$, scipy's
+`choleskey` can return either $L$ or $U$ but defaults to $U$.
 
 Here are two ways we can use the Cholesky to solve a system of equations:
 
 ```{python, eval=FALSE}
 U = sp.linalg.cholesky(A)
-sp.linalg.solve_triangular(L.T, 
+sp.linalg.solve_triangular(U, 
           sp.linalg.solve_triangular(U, b, lower=False, trans='T'),
           lower=False)