From e71d847e6f3f2ae9e38bbb2b7b355675ebcf6f71 Mon Sep 17 00:00:00 2001 From: Christopher Paciorek Date: Thu, 2 Nov 2023 10:45:22 -0700 Subject: [PATCH] a bit more edits to unit 10 --- _freeze/units/unit10-linalg/execute-results/html.json | 4 ++-- _freeze/units/unit10-linalg/execute-results/tex.json | 4 ++-- units/unit10-linalg.qmd | 6 +++--- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/_freeze/units/unit10-linalg/execute-results/html.json b/_freeze/units/unit10-linalg/execute-results/html.json index 69590e4..b040609 100644 --- a/_freeze/units/unit10-linalg/execute-results/html.json +++ b/_freeze/units/unit10-linalg/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "04df3b9264ae4187721edd4da62e08c5", + "hash": "d1d0a6f17e0930b47afaca760a06ac18", "result": { - "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n pdf:\n documentclass: article\n margin-left: 30mm\n margin-right: 30mm\n toc: true\n html:\n theme: cosmo\n css: ../styles.css\n toc: true\n code-copy: true\n code-block-background: true\nexecute:\n freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n- Gentle: Numerical Linear Algebra for Applications in Statistics\n (available via UC Library Search) (my notes here are based primarily\n on this source) [Gentle-NLA]\n - Gentle: Matrix Algebra also has much of this material.\n- Gentle: Computational Statistics [Gentle-CS]\n- Lange: Numerical Analysis for Statisticians\n- Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n - Video 1. Ill-conditioned problems, part 1\n - Video 2. Ill-conditioned problems, part 2\n - Video 3. Triangular systems of equations\n - Video 4. Solving systems of equations via LU, part 1\n - Video 5. Solving systems of equations via LU, part 2\n - Video 6. Solving systems of equations via LU, part 3\n - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1. How to think about the computational order (number of computations\n involved) of a problem\n2. How to choose a computational approach to a given linear algebra\n calculation you need to do.\n3. An understanding of how issues with computer numbers (Unit 8) affect\n linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B) \nA @ B # alternative\nA.dot(B) # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1. We can find the ordering that reduces computation the most if the\n individual matrices are not square.\n2. $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n $x^{\\top}Ax$, is a scalar, and this is equal to\n $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n helpful to be able to go back and forth between a scalar and a trace\n in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n- having fewer rows or columns;\n- being diagonal, triangular or sparse in some way,\n- being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $np$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1. What does it mean when one or more eigenvalue (i.e.,\n $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2. Suppose I have an eigenvalue that is very small and I set it to\n zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3. Now let's consider the inverse of a covariance matrix, known as the\n precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots & & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n & & \\vdots & & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots & & & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\ & & 1 & -4 & 6 & -4 & 1 \\\\ & & & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0] # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n 3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3] # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals # variances\nevals[3] = 0 # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n-8.881784197001252e-16\n```\n:::\n:::\n\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1. Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n for(i=1:n){ \n b_i = 0\n for(j=1:m){\n b_i = b_i + a_{ij} x_j\n }\n }\n```\n\n2. Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n for(i=1:n){ \n b_i = 0\n }\n for(j=1:m){\n for(i = 1:n){\n b_i = b_i + a_{ij} x_j \n }\n }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011) # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1)) # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-3.61355412e+04, 3.61987534e+01, -8.91405553e-03])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000 # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505, 770. , 9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10 # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487, 7.7 , 1.95366513])\n```\n:::\n:::\n\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n| Name | Representation | Restrictions | Properties | Uses |\n| ----------| ----------------------| ------------| --------------------|-------------|\n| LU | $A_{nn}= L_{nn}U_{nn}$ | $A$ generally square | $L$ lower triangular; $U$ upper triangular | solving equations; inversion | \n| QR | $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n| Cholesky | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n| Eigen decomposition | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n| SVD | $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1. $x_{n}=b_{n}/A_{nn}$\n2. Now for $kp$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p2$, find interim vectors, $x_{k}^{(2)}$, by\n orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3. Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX = \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $pj+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n- Start with an initial approximation, $x^{(0)}$.\n- Hold all but $x_{1}^{(0)}$ constant and solve to find\n $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n- Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n finding $x^{(1)}$.\n- Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n- Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n set $k=0$.\n\n- Then iterate:\n\n - $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n (choose step size so next error will be orthogonal to current\n direction - which we can express in terms of the residual, which\n is easily computable)\n\n - $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n - $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n - $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n (choose next direction by conjugate Gram-Schmidt, starting with\n $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n to previous directions, but it turns out that $r_{(k+1)}$ is\n already $A$-orthogonal to all but $d_{k}$).\n\n- Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n", + "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n pdf:\n documentclass: article\n margin-left: 30mm\n margin-right: 30mm\n toc: true\n html:\n theme: cosmo\n css: ../styles.css\n toc: true\n code-copy: true\n code-block-background: true\nexecute:\n freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n- Gentle: Numerical Linear Algebra for Applications in Statistics\n (available via UC Library Search) (my notes here are based primarily\n on this source) [Gentle-NLA]\n - Gentle: Matrix Algebra also has much of this material.\n- Gentle: Computational Statistics [Gentle-CS]\n- Lange: Numerical Analysis for Statisticians\n- Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n - Video 1. Ill-conditioned problems, part 1\n - Video 2. Ill-conditioned problems, part 2\n - Video 3. Triangular systems of equations\n - Video 4. Solving systems of equations via LU, part 1\n - Video 5. Solving systems of equations via LU, part 2\n - Video 6. Solving systems of equations via LU, part 3\n - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1. How to think about the computational order (number of computations\n involved) of a problem\n2. How to choose a computational approach to a given linear algebra\n calculation you need to do.\n3. An understanding of how issues with computer numbers (Unit 8) affect\n linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B) \nA @ B # alternative\nA.dot(B) # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1. We can find the ordering that reduces computation the most if the\n individual matrices are not square.\n2. $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n $x^{\\top}Ax$, is a scalar, and this is equal to\n $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n helpful to be able to go back and forth between a scalar and a trace\n in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n- having fewer rows or columns;\n- being diagonal, triangular or sparse in some way,\n- being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $np$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1. What does it mean when one or more eigenvalue (i.e.,\n $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2. Suppose I have an eigenvalue that is very small and I set it to\n zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3. Now let's consider the inverse of a covariance matrix, known as the\n precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots & & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n & & \\vdots & & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots & & & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\ & & 1 & -4 & 6 & -4 & 1 \\\\ & & & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0] # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n 3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3] # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals # variances\nevals[3] = 0 # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.440892098500626e-16\n```\n:::\n:::\n\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1. Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n for(i=1:n){ \n b_i = 0\n for(j=1:m){\n b_i = b_i + a_{ij} x_j\n }\n }\n```\n\n2. Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n for(i=1:n){ \n b_i = 0\n }\n for(j=1:m){\n for(i = 1:n){\n b_i = b_i + a_{ij} x_j \n }\n }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011) # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1)) # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-1.02879965e+03, 1.16747888e+00, -1.75251012e-04])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000 # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505, 770. , 9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10 # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487, 7.7 , 1.95366513])\n```\n:::\n:::\n\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n| Name | Representation | Restrictions | Properties | Uses |\n| ----------| ----------------------| ------------| --------------------|-------------|\n| LU | $A_{nn}= L_{nn}U_{nn}$ | $A$ generally square | $L$ lower triangular; $U$ upper triangular | solving equations; inversion | \n| QR | $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n| Cholesky | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n| Eigen decomposition | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n| SVD | $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1. $x_{n}=b_{n}/A_{nn}$\n2. Now for $kp$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p2$, find interim vectors, $x_{k}^{(2)}$, by\n orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3. Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX = \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $pj+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n- Start with an initial approximation, $x^{(0)}$.\n- Hold all but $x_{1}^{(0)}$ constant and solve to find\n $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n- Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n finding $x^{(1)}$.\n- Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n- Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n set $k=0$.\n\n- Then iterate:\n\n - $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n (choose step size so next error will be orthogonal to current\n direction - which we can express in terms of the residual, which\n is easily computable)\n\n - $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n - $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n - $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n (choose next direction by conjugate Gram-Schmidt, starting with\n $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n to previous directions, but it turns out that $r_{(k+1)}$ is\n already $A$-orthogonal to all but $d_{k}$).\n\n- Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/units/unit10-linalg/execute-results/tex.json b/_freeze/units/unit10-linalg/execute-results/tex.json index 9658164..a6242b7 100644 --- a/_freeze/units/unit10-linalg/execute-results/tex.json +++ b/_freeze/units/unit10-linalg/execute-results/tex.json @@ -1,7 +1,7 @@ { - "hash": "04df3b9264ae4187721edd4da62e08c5", + "hash": "d1d0a6f17e0930b47afaca760a06ac18", "result": { - "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n pdf:\n documentclass: article\n margin-left: 30mm\n margin-right: 30mm\n toc: true\n html:\n theme: cosmo\n css: ../styles.css\n toc: true\n code-copy: true\n code-block-background: true\nexecute:\n freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n- Gentle: Numerical Linear Algebra for Applications in Statistics\n (available via UC Library Search) (my notes here are based primarily\n on this source) [Gentle-NLA]\n - Gentle: Matrix Algebra also has much of this material.\n- Gentle: Computational Statistics [Gentle-CS]\n- Lange: Numerical Analysis for Statisticians\n- Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n - Video 1. Ill-conditioned problems, part 1\n - Video 2. Ill-conditioned problems, part 2\n - Video 3. Triangular systems of equations\n - Video 4. Solving systems of equations via LU, part 1\n - Video 5. Solving systems of equations via LU, part 2\n - Video 6. Solving systems of equations via LU, part 3\n - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1. How to think about the computational order (number of computations\n involved) of a problem\n2. How to choose a computational approach to a given linear algebra\n calculation you need to do.\n3. An understanding of how issues with computer numbers (Unit 8) affect\n linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B) \nA @ B # alternative\nA.dot(B) # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1. We can find the ordering that reduces computation the most if the\n individual matrices are not square.\n2. $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n $x^{\\top}Ax$, is a scalar, and this is equal to\n $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n helpful to be able to go back and forth between a scalar and a trace\n in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n- having fewer rows or columns;\n- being diagonal, triangular or sparse in some way,\n- being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $np$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1. What does it mean when one or more eigenvalue (i.e.,\n $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2. Suppose I have an eigenvalue that is very small and I set it to\n zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3. Now let's consider the inverse of a covariance matrix, known as the\n precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots & & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n & & \\vdots & & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots & & & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\ & & 1 & -4 & 6 & -4 & 1 \\\\ & & & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0] # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n 3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3] # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals # variances\nevals[3] = 0 # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n-2.220446049250313e-16\n```\n:::\n:::\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1. Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n for(i=1:n){ \n b_i = 0\n for(j=1:m){\n b_i = b_i + a_{ij} x_j\n }\n }\n```\n\n2. Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n for(i=1:n){ \n b_i = 0\n }\n for(j=1:m){\n for(i = 1:n){\n b_i = b_i + a_{ij} x_j \n }\n }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011) # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1)) # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([-7.08730861e+03, 7.17362346e+00, -1.66371671e-03])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000 # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505, 770. , 9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10 # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487, 7.7 , 1.95366513])\n```\n:::\n:::\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n| Name | Representation | Restrictions | Properties | Uses |\n| ----------| ----------------------| ------------| --------------------|-------------|\n| LU | $A_{nn}= L_{nn}U_{nn}$ | $A$ generally square | $L$ lower triangular; $U$ upper triangular | solving equations; inversion | \n| QR | $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n| Cholesky | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n| Eigen decomposition | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n| SVD | $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1. $x_{n}=b_{n}/A_{nn}$\n2. Now for $kp$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p2$, find interim vectors, $x_{k}^{(2)}$, by\n orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3. Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX = \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $pj+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n- Start with an initial approximation, $x^{(0)}$.\n- Hold all but $x_{1}^{(0)}$ constant and solve to find\n $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n- Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n finding $x^{(1)}$.\n- Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n- Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n set $k=0$.\n\n- Then iterate:\n\n - $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n (choose step size so next error will be orthogonal to current\n direction - which we can express in terms of the residual, which\n is easily computable)\n\n - $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n - $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n - $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n (choose next direction by conjugate Gram-Schmidt, starting with\n $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n to previous directions, but it turns out that $r_{(k+1)}$ is\n already $A$-orthogonal to all but $d_{k}$).\n\n- Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n", + "markdown": "---\ntitle: \"Numerical linear algebra\"\nauthor: \"Chris Paciorek\"\ndate: \"2023-10-24\"\nformat:\n pdf:\n documentclass: article\n margin-left: 30mm\n margin-right: 30mm\n toc: true\n html:\n theme: cosmo\n css: ../styles.css\n toc: true\n code-copy: true\n code-block-background: true\nexecute:\n freeze: auto\nfrom: markdown+tex_math_single_backslash\n---\n\n\n\n[PDF](./unit10-linalg.pdf){.btn .btn-primary}\n\nReferences:\n\n- Gentle: Numerical Linear Algebra for Applications in Statistics\n (available via UC Library Search) (my notes here are based primarily\n on this source) [Gentle-NLA]\n - Gentle: Matrix Algebra also has much of this material.\n- Gentle: Computational Statistics [Gentle-CS]\n- Lange: Numerical Analysis for Statisticians\n- Monahan: Numerical Methods of Statistics\n\nVideos (optional): \n\nThere are various videos from 2020 in the bCourses Media Gallery that you\ncan use for reference if you want to. \n\n - Video 1. Ill-conditioned problems, part 1\n - Video 2. Ill-conditioned problems, part 2\n - Video 3. Triangular systems of equations\n - Video 4. Solving systems of equations via LU, part 1\n - Video 5. Solving systems of equations via LU, part 2\n - Video 6. Solving systems of equations via LU, part 3\n - Video 7. Cholesky decomposition\n\n\nIn working through how to compute something or understanding an\nalgorithm, it can be very helpful to depict the matrices and vectors\ngraphically. We'll see this on the board in class.\n\n# 1. Preliminaries\n\n## Context\n\nMany statistical and machine learning methods involve linear algebra of\nsome sort - at the very least matrix multiplication and very often some\nsort of matrix decomposition to fit models and do analysis: linear\nregression, various more sophisticated forms of regression, deep neural\nnetworks, principle components analysis (PCA) and the wide varieties of\ngeneralizations and variations on PCA, etc., etc.\n\n## Goals\n\nHere's what I'd like you to get out of this unit:\n\n1. How to think about the computational order (number of computations\n involved) of a problem\n2. How to choose a computational approach to a given linear algebra\n calculation you need to do.\n3. An understanding of how issues with computer numbers (Unit 8) affect\n linear algebra calculations.\n\n## Key principle\n\n**The form of a mathematical expression and how it should be evaluated\non a computer may be very different.** Better computational approaches\ncan increase speed and improve the numerical properties of the\ncalculation.\n\n- Example 1 (already seen in Unit 5): If $X$ and $Y$ are matrices and $z$\nis a vector, we should compute $X(Yz)$ rather than $(XY)z$; the former\nis much more computationally efficient. \n- Example 2: We do not compute $(X^{\\top}X)^{-1}X^{\\top}Y$ by computing\n$X^{\\top}X$ and finding its inverse. In fact, perhaps more surprisingly,\nwe may never actually form $X^{\\top}X$ in some implementations.\n- Example 3: Suppose I have a matrix $A$, and I want to permute (switch)\ntwo rows. I can do this with a permutation matrix, $P$, which is mostly\nzeroes. On a computer, in general I wouldn't need to even change the\nvalues of $A$ in memory in some cases (e.g., if I were to calculate\n$PAB$). Why not?\n\n## Computational complexity\n\nWe can assess the computational complexity of a linear algebra\ncalculation by counting the number multiplys/divides and the number of\nadds/subtracts. Sidenote: addition is a bit faster than multiplication,\nso some algorithms attempt to trade multiplication for addition.\n\nIn general we do not try to count the actual number of calculations, but\njust their order, though in some cases in this unit we'll actually get a\nmore exact count. In general, we denote this as $O(f(n))$ which means\nthat the number of calculations approaches $cf(n)$ as $n\\to\\infty$\n(i.e., we know the calculation is approximately proportional to $f(n)$).\nConsider matrix multiplication, $AB$, with matrices of size $a\\times b$\nand $b\\times c$. Each column of the second matrix is multiplied by all\nthe rows of the first. For any given inner product of a row by a column,\nwe have $b$ multiplies. We repeat these operations for each column and\nthen for each row, so we have $abc$ multiplies so $O(abc)$ operations.\nWe could count the additions as well, but there's usually an addition\nfor each multiply, so we can usually just count the multiplys and then\nsay there are such and such {multiply and add}s. This is Monahan's\napproach, but you may see other counting approaches where one counts the\nmultiplys and the adds separately.\n\nFor two symmetric, $n\\times n$ matrices, this is $O(n^{3})$. Similarly,\nmatrix factorization (e.g., the Cholesky decomposition) is $O(n^{3})$\nunless the matrix has special structure, such as being sparse. As\nmatrices get large, the speed of calculations decreases drastically\nbecause of the scaling as $n^{3}$ and memory use increases drastically.\nIn terms of memory use, to hold the result of the multiply indicated\nabove, we need to hold $ab+bc+ac$ total elements, which for symmetric\nmatrices sums to $3n^{2}$. So for a matrix with $n=10000$, we have\n$3\\cdot10000^{2}\\cdot8/1e9=2.4$Gb.\n\nWhen we have $O(n^{q})$ this is known as polynomial time. Much worse is\n$O(b^{n})$ (exponential time), while much better is $O(\\log n$) (log\ntime). Computer scientists talk about NP-complete problems; these are\nessentially problems for which there is not a polynomial time\nalgorithm - it turns out all such problems can be rewritten such that\nthey are equivalent to one another.\n\nIn real calculations, it's possible to have the actual time ordering of\ntwo approaches differ from what the order approximations tell us. For\nexample, something that involves $n^{2}$ operations may be faster than\none that involves $1000(n\\log n+n)$ even though the former is $O(n^{2})$\nand the latter $O(n\\log n)$. The problem is that the constant, $c=1000$,\ncan matter (depending on how big $n$ is), as can the extra calculations\nfrom the lower order term(s), in this case $1000n$.\n\nA note on terminology: *flops* stands for both floating point operations\n(the number of operations required) and floating point operations per\nsecond, the speed of calculation.\n\n## Notation and dimensions\n\nI'll try to use capital letters for matrices, $A$, and lower-case for\nvectors, $x$. Then $x_{i}$ is the ith element of $x$, $A_{ij}$ is the\n$i$th row, $j$th column element, and $A_{\\cdot j}$ is the $j$th column\nand $A_{i\\cdot}$ the $i$th row. By default, we'll consider a vector,\n$x$, to be a one-column matrix, and $x^{\\top}$ to be a one-row matrix.\nSome of the references given at the start of this Unit also use $a_{ij}$ for $A_{ij}$ and\n$a_{j}$ for the $j$th column.\n\nThroughout, we'll need to be careful that the matrices involved in an\noperation are conformable: for $A+B$ both matrices need to be of the\nsame dimension, while for $AB$ the number of columns of $A$ must match\nthe number of rows of $B$. Note that this allows for $B$ to be a column\nvector, with only one column, $Ab$. Just checking dimensions is a good\nway to catch many errors. Example: is\n$\\mbox{Cov}(Ax)=A\\mbox{Cov}(x)A^{\\top}$ or\n$\\mbox{Cov}(Ax)=A^{\\top}\\mbox{Cov}(x)A$? Well, if $A$ is $m\\times n$, it\nmust be the former, as the latter is not conformable.\n\nThe **inner product** of two vectors is\n$\\sum_{i}x_{i}y_{i}=x^{\\top}y\\equiv\\langle x,y\\rangle\\equiv x\\cdot y$.\n\nThe **outer product** is $xy^{\\top}$, which comes from all pairwise\nproducts of the elements.\n\nWhen the indices of summation should be obvious, I'll sometimes leave\nthem implicit. Ask me if it's not clear.\n\n## Norms\n\nFor a vector, $\\|x\\|_{p}=(\\sum_{i}|x_{i}|^{p})^{1/p}$ and the standard (Euclidean)\nnorm is $\\|x\\|_{2}=\\sqrt{\\sum x_{i}^{2}}=\\sqrt{x^{\\top}x}$, just the\nlength of the vector in Euclidean space, which we'll refer to as\n$\\|x\\|$, unless noted otherwise.\n\nOne commonly used norm for a matrix is\nthe Frobenius norm, $\\|A\\|_{F}=(\\sum_{i,j}a_{ij}^{2})^{1/2}$.\n\nIn this Unit, we'll often make use of the **induced matrix norm**, which is\ndefined relative to a corresponding vector norm, $\\|\\cdot\\|$, as:\n$$\\|A\\|=\\sup_{x\\ne0}\\frac{\\|Ax\\|}{\\|x\\|}$$ So we have\n$$\\|A\\|_{2}=\\sup_{x\\ne0}\\frac{\\|Ax\\|_{2}}{\\|x\\|_{2}}=\\sup_{\\|x\\|_{2}=1}\\|Ax\\|_{2}$$\nIf you're not familiar with the supremum (\"sup\" above), you can just\nthink of it as taking the maximum. In the case of the 2-norm, the norm turns\nout to be the largest singular value in the singular value decomposition (SVD)\nof the matrix.\n\nWe can interpret the norm of a matrix as the most that the matrix\ncan stretch a vector when multiplying by the vector (relative to the\nlength of the vector).\n\nA property of any legitimate matrix norm (including the induced norm) is\nthat $\\|AB\\|\\leq\\|A\\|\\|B\\|$. Also recall that norms must obey the triangle\ninequality, $\\|A+B\\|\\leq\\|A\\|+\\|B\\|$.\n\nA normalized vector is one with \"length\", i.e., Euclidean norm, of one.\nWe can easily normalize a vector: $\\tilde{x}=x/\\|x\\|$\n\nThe angle between two vectors is\n$$\\theta=\\cos^{-1}\\left(\\frac{\\langle x,y\\rangle}{\\sqrt{\\langle x,x\\rangle\\langle y,y\\rangle}}\\right)$$\n\n## Orthogonality\n\nTwo vectors are orthogonal if $x^{\\top}y=0$, in which case we say\n$x\\perp y$. An **orthogonal matrix** is a square matrix in which all of the\ncolumns are orthogonal to each other and normalized. The same holds for\nthe rows. Orthogonal matrices\ncan be shown to have full rank. Furthermore if $A$ is orthogonal,\n$A^{\\top}A=I$, so $A^{-1}=A^{\\top}$. Given all this, the determinant of\northogonal $A$ is either 1 or -1. Finally the product of two orthogonal\nmatrices, $A$ and $B$, is also orthogonal since\n$(AB)^{\\top}AB=B^{\\top}A^{\\top}AB=B^{\\top}B=I$.\n\n#### Permutations\n\nSometimes we make use of matrices that permute two rows (or two columns)\nof another matrix when multiplied. Such a matrix is known as an\nelementary permutation matrix and is an orthogonal matrix with a\ndeterminant of -1. You can multiply such matrices to get more general\npermutation matrices that are also orthogonal. If you premultiply by\n$P$, you permute rows, and if you postmultiply by $P$ you permute\ncolumns. Note that on a computer, you wouldn't need to actually do the\nmultiply (and if you did, you should use a sparse matrix routine), but\nrather one can often just rework index values that indicate where\nrelevant pieces of the matrix are stored (more in the next section).\n\n## Some vector and matrix properties\n\n$AB\\ne BA$ but $A+B=B+A$ and $A(BC)=(AB)C$.\n\nIn Python, recall the syntax is\n\n\n::: {.cell}\n\n```{.python .cell-code}\nA + B\n\n# Matrix multiplication\nnp.matmul(A, B) \nA @ B # alternative\nA.dot(B) # not recommended by the NumPy docs\n\nA * B # Hadamard (direct) product\n```\n:::\n\n\nYou don't need the spaces, but they're nice for code readability.\n\n## Trace and determinant of square matrices\n\nThe trace of a matrix is the sum of the diagonal elements. For square\nmatrices, $\\mbox{tr}(A+B)=\\mbox{tr}(A)+\\mbox{tr}(B)$,\n$\\mbox{tr}(A)=\\mbox{tr}(A^{\\top})$.\n\nWe also have $\\mbox{tr}(ABC)=\\mbox{tr}(CAB)=\\mbox{tr}(BCA)$ - basically\nyou can move a matrix from the beginning to the end or end to beginning,\nprovided they are conformable for this operation. This is helpful for a\ncouple reasons:\n\n1. We can find the ordering that reduces computation the most if the\n individual matrices are not square.\n2. $x^{\\top}Ax=\\mbox{tr}(x^{\\top}Ax)$ since the quadratic form,\n $x^{\\top}Ax$, is a scalar, and this is equal to\n $\\mbox{tr}(xx^{\\top}A)$ where $xx^{\\top}A$ is a matrix. It can be\n helpful to be able to go back and forth between a scalar and a trace\n in some statistical calculations.\n\nFor square matrices, the determinant exists and we have $|AB|=|A||B|$\nand therefore, $|A^{-1}|=1/|A|$ since $|I|=|AA^{-1}|=1$. Also\n$|A|=|A^{\\top}|$, which can be seen using the QR decomposition for $A$\nand understanding properties of determinants of triangular matrices (in\nthis case $R$) and orthogonal matrices (in this case $Q$).\n\n## Transposes and inverses\n\nFor square, invertible matrices, we have that\n$(A^{-1})^{\\top}=(A^{\\top})^{-1}$. Why? Since we have\n$(AB)^{\\top}=B^{\\top}A^{\\top}$, we have:\n$$A^{\\top}(A^{-1})^{\\top}=(A^{-1}A)^{\\top}=I$$ so\n$(A^{\\top})^{-1}=(A^{-1})^{\\top}$.\n\nFor two invertible matrices, we have that\n$(AB)^{-1} = B^{-1}A^{-1}$ since\n$B^{-1}A^{-1} AB = I$.\n\n#### Other matrix multiplications\n\nThe Hadamard or direct product is simply multiplication of the\ncorrespoding elements of two matrices by each other. In R this is\nsimply` A * B`.\\\n**Challenge**: How can I find $\\mbox{tr}(AB)$ without using `A %*% B` ?\n\nThe Kronecker product is the product of each element of one matrix with\nthe entire other matrix\"\n\n$$A\\otimes B=\\left(\\begin{array}{ccc}\nA_{11}B & \\cdots & A_{1m}B\\\\\n\\vdots & \\ddots & \\vdots\\\\\nA_{n1}B & \\cdots & A_{nm}B\n\\end{array}\\right)$$\n\nThe inverse of a Kronecker product is the Kronecker product of the\ninverses,\n\n$$ B^{-1} \\otimes A^{-1} $$ \n\nwhich is obviously quite a bit faster because\nthe inverse (i.e., solving a system of equations) in this special case\nis $O(n^{3}+m^{3})$ rather than the naive approach being $O((nm)^{3})$.\n\n## Matrix decompositions\n\nA matrix decomposition is a re-expression of a matrix, $A$, in terms of\na product of two or three other, simpler matrices, where the\ndecomposition reveals structure or relationships present in the original\nmatrix, $A$. The \"simpler\" matrices may be simpler in various ways,\nincluding\n\n- having fewer rows or columns;\n- being diagonal, triangular or sparse in some way,\n- being orthogonal matrices.\n\nIn addition, once you have a decomposition, computation is generally\neasier, because of the special structure of the simpler matrices.\n\nWe'll see this in great detail in Section 3.\n\n# 2. Statistical interpretations of matrix invertibility, rank, etc.\n\n## Linear independence, rank, and basis vectors\n\nA set of vectors, $v_{1},\\ldots v_{n}$, is linearly independent (LIN)\nwhen none of the vectors can be represented as a linear combination,\n$\\sum c_{i}v_{i}$, of the others for scalars, $c_{1},\\ldots,c_{n}$. If\nwe have vectors of length $n$, we can have at most $n$ linearly\nindependent vectors. The rank of a matrix is the number of linearly\nindependent rows (or columns - it's the same), and is at most the\nminimum of the number of rows and number of columns. We'll generally\nthink about it in terms of the dimension of the column space - so we can\njust think about the number of linearly independent columns.\n\nAny set of linearly independent vectors (say $v_{1},\\ldots,v_{n}$) span\na space made up of all linear combinations of those vectors\n($\\sum_{i=1}^{n}c_{i}v_{i}$). The spanning vectors are known as basis\nvectors. We can express a vector $y$ that is in the space with respect\nto (as a linear combination of) basis vectors as $y=\\sum_{i}c_{i}v_{i}$,\nwhere if the basis vectors are normalized and orthogonal, we can find\nthe weights as $c_{i}=\\langle y,v_{i}\\rangle$.\n\nConsider a regression context. We have $p$ covariates ($p$ columns in\nthe design matrix, $X$), of which $q\\leq p$ are linearly independent\ncovariates. This means that $p-q$ of the vectors can be written as\nlinear combos of the $q$ vectors. The space spanned by the covariate\nvectors is of dimension $q$, rather than $p$, and $X^{\\top}X$ has $p-q$\neigenvalues that are zero. The $q$ LIN vectors are basis vectors for the\nspace - we can represent any point in the space as a linear combination\nof the basis vectors. You can think of the basis vectors as being like\nthe axes of the space, except that the basis vectors are not orthogonal.\nSo it's like denoting a point in $\\Re^{q}$ as a set of $q$ numbers\ntelling us where on each of the axes we are - this is the same as a\nlinear combination of axis-oriented vectors.\n\nWhen fitting a regression, if $n=p=q$, a vector of $n$ observations can\nbe represented exactly as a linear combination of the $p$ basis vectors,\nso there is no residual and we have a single unique (and exact) solution\n(e.g., with $n=p=2$, the observations fall exactly on the simple linear\nregression line). If $np$, so the system is overdetermined - there is no exact\nsolution, but regression is all about finding solutions that minimize\nsome criterion about the differences between the observations and linear\ncombinations of the columns of the $X$ matrix (such as least squares or\npenalized least squares). In standard regression, we project the\nobservation vector onto the space spanned by the columns of the $X$\nmatrix, so we find the point in the space closest to the observation\nvector.\n\n## Invertibility, singularity, rank, and positive definiteness\n\nFor square matrices, let's consider how invertibility, singularity, rank\nand positive (or non-negative) definiteness relate.\n\nSquare matrices that are \"regular\" have an eigendecomposition,\n$A=\\Gamma\\Lambda\\Gamma^{-1}$ where $\\Gamma$ is a matrix with the\neigenvectors as the columns and $\\Lambda$ is a diagonal matrix of\neigenvalues, $\\Lambda_{ii}=\\lambda_{i}$. Symmetric matrices and matrices\nwith unique eigenvalues are regular, as are some other matrices. The\nnumber of non-zero eigenvalues is the same as the rank of the matrix.\nSquare matrices that have an inverse are also called nonsingular, and\nthis is equivalent to having full rank. If the matrix is symmetric, the\neigenvectors and eigenvalues are real and $\\Gamma$ is orthogonal, so we\nhave $A=\\Gamma\\Lambda\\Gamma^{\\top}$. The determinant of the matrix is\nthe product of the eigenvalues (why?), which is zero if it is less than\nfull rank. Note that if none of the eigenvalues are zero then\n$A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$.\n\nLet's focus on symmetric matrices. The symmetric matrices that tend to\narise in statistics are either positive definite (p.d.) or non-negative\ndefinite (n.n.d.). If a matrix is positive definite, then by definition\n$x^{\\top}Ax>0$ for any $x$. Note that if $\\mbox{Cov}(y)=A$ then\n$x^{\\top}Ax=x^{\\top}\\mbox{Cov}(y)x=\\mbox{Cov}(x^{\\top}y)=\\mbox{Var}(x^{\\top}y)$\nif so positive definiteness amounts to having linear combinations of\nrandom variables (with the elements of $x$ here being the weights)\nhaving positive variance. So we must have that positive definite\nmatrices are equivalent to variance-covariance matrices (I'll just refer\nto this as a variance matrix or as a covariance matrix). If $A$ is p.d.\nthen it has all positive eigenvalues and it must have an inverse, though\nas we'll see, from a numerical perspective, we may not be able to\ncompute it if some of the eigenvalues are very close to zero. In Python,\n`numpy.linalg.eig(A)[1]` is $\\Gamma$, with each column a vector, and\n`numpy.linalg.eig(A)[0]` contains the (unordered) eigenvalues.\n\nTo summarize, here are some of the various connections between\nmathematical and statistical properties of **positive definite**\nmatrices:\n\n$A$ positive definite $\\Leftrightarrow$ $A$ is a covariance matrix\n$\\Leftrightarrow$ $x^{\\top}Ax>0$ $\\Leftrightarrow$ $\\lambda_{i}>0$\n(positive eigenvalues) $\\Rightarrow$$|A|>0$ $\\Rightarrow$$A$ is\ninvertible $\\Leftrightarrow$ $A$ is non singular $\\Leftrightarrow$ $A$ is\nfull rank.\n\nAnd here are connections for positive semi-definite matrices:\n\n$A$ positive semi-definite $\\Leftrightarrow$ $A$ is a constrained\ncovariance matrix $\\Leftrightarrow$ $x^{\\top}Ax\\geq0$ and equal to 0 for\nsome $x$ $\\Leftrightarrow$ $\\lambda_{i}\\geq 0$ (non-negative eigenvalues),\nwith at least one zero $\\Rightarrow$ $|A|=0$ $\\Leftrightarrow$ $A$ is not\ninvertible $\\Leftrightarrow$ $A$ is singular $\\Leftrightarrow$ $A$ is not\nfull rank.\n\n## Interpreting an eigendecomposition\n\nLet's interpret the eigendecomposition in a generative context as a way\nof generating random vectors. We can generate $y$ s.t. $\\mbox{Cov}(y)=A$\nif we generate $y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$ and\n$\\Lambda^{1/2}$ is formed by taking the square roots of the eigenvalues.\nSo $\\sqrt{\\lambda_{i}}$ is the standard deviation associated with the\nbasis vector $\\Gamma_{\\cdot i}$. That is, the $z$'s provide the weights\non the basis vectors, with scaling based on the eigenvalues. So $y$ is\nproduced as a linear combination of eigenvectors as basis vectors, with\nthe variance attributable to the basis vectors determined by the\neigenvalues.\n\nTo go the other direction, we can project a vector $y$ onto the space \nspanned by the eigenvectors: $w = (\\Gamma^{\\top}\\Gamma)^{-1}\\Gamma^{\\top}y = \\Gamma^{\\top}y = \\Lambda^{1/2}z$,\nwhere the simplification of course comes from $\\Gamma$ being orthogonal.\n\nIf $x^{\\top}Ax\\geq0$ then $A$ is nonnegative definite (also called\npositive semi-definite). In this case one or more eigenvalues can be\nzero. Let's interpret this a bit more in the context of generating\nrandom vectors based on non-negative definite matrices,\n$y=\\Gamma\\Lambda^{1/2}z$ where $\\mbox{Cov}(z)=I$. Questions:\n\n1. What does it mean when one or more eigenvalue (i.e.,\n $\\lambda_{i}=\\Lambda_{ii}$) is zero?\n\n2. Suppose I have an eigenvalue that is very small and I set it to\n zero? What will be the impact upon $y$ and $\\mbox{Cov}(y)$?\n\n3. Now let's consider the inverse of a covariance matrix, known as the\n precision matrix, $A^{-1}=\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. What\n does it mean if a $(\\Lambda^{-1})_{ii}$ is very large? What if\n $(\\Lambda^{-1})_{ii}$ is very small?\n\nConsider an arbitrary $n\\times p$ matrix, $X$. Any crossproduct or sum\nof squares matrix, such as $X^{\\top}X$ is positive definite\n(non-negative definite if $p>n$). This makes sense as it's just a\nscaling of an empirical covariance matrix.\n\n## Generalized inverses (optional)\n\nSuppose I want to find $x$ such that $Ax=b$. Mathematically the answer\n(provided $A$ is invertible, i.e. of full rank) is $x=A^{-1}b$.\n\nGeneralized inverses arise in solving equations when $A$ is not full\nrank. A generalized inverse is a matrix, $A^{-}$ s.t. $AA^{-}A=A$. The\nMoore-Penrose inverse (the pseudo-inverse), $A^{+}$, is a (unique)\ngeneralized inverse that also satisfies some additional properties.\n$x=A^{+}b$ is the solution to the linear system, $Ax=b$, that has the\nshortest length for $x$.\n\nWe can find the pseudo-inverse based on an eigendecomposition (or an\nSVD) as $\\Gamma\\Lambda^{+}\\Gamma^{\\top}$. We obtain $\\Lambda^{+}$ from\n$\\Lambda$ as follows. For values $\\lambda_{i}>0$, compute\n$1/\\lambda_{i}$. All other values are set to 0. Let's interpret this\nstatistically. Suppose we have a precision matrix with one or more zero\neigenvalues and we want to find the covariance matrix. A zero eigenvalue\nmeans we have no precision, or infinite variance, for some linear\ncombination (i.e., for some basis vector). We take the pseudo-inverse\nand assign that linear combination zero variance.\n\nLet's consider a specific example. Autoregressive models are often used\nfor smoothing (in time, in space, and in covariates). A first order\nautoregressive model for $y_{1},y_{2},\\ldots,y_{T}$ has\n$E(y_{i}|y_{-i})=\\frac{1}{2}(y_{i-1}+y_{i+1})$. Another way of writing\nthe model is in time-order: $y_{i}=y_{i-1}+\\epsilon_{i}$. A second order\nautoregressive model has\n$E(y_{i}|y_{-i})=\\frac{1}{6}(4y_{i-1}+4y_{i+1}-y_{i-2}-y_{i+2})$. These\nconstructions basically state that each value should be a smoothed\nversion of its neighbors. One can figure out that the **precision**\nmatrix for $y$ in the first order model is $$\\left(\\begin{array}{ccccc}\n\\ddots & & \\vdots\\\\\n-1 & 2 & -1 & 0\\\\\n\\cdots & -1 & 2 & -1 & \\dots\\\\\n & 0 & -1 & 2 & -1\\\\\n & & \\vdots & & \\ddots\n\\end{array}\\right)$$ and in the second order model is\n\n$$\\left( \\begin{array}{ccccccc} \\ddots & & & \\vdots \\\\ 1 & -4 & 6 & -4 & 1 \\\\ \\cdots & 1 & -4 & 6 & -4 & 1 & \\cdots \\\\ & & 1 & -4 & 6 & -4 & 1 \\\\ & & & \\vdots \\end{array} \\right).$$ \n\nIf we look at the eigendecomposition of such\nmatrices, we see that in the first order case, the eigenvalue\ncorresponding to the constant eigenvector is zero.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n\nprecMat = np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])\ne = np.linalg.eig(precMat)\ne[0] # 4th eigenvalue is numerically zero\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,\n 3.81966011e-01])\n```\n:::\n\n```{.python .cell-code}\ne[1][:,3] # constant eigenvector\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])\n```\n:::\n:::\n\n\nThis means we have no information about the overall level of $y$. So how\nwould we generate sample $y$ vectors? We can't put infinite variance on\nthe constant basis vector and still generate samples. Instead we use the\npseudo-inverse and assign ZERO variance to the constant basis vector.\nThis corresponds to generating realizations under the constraint that\n$\\sum y_{i}$ has no variation, i.e., $\\sum y_{i}=\\bar{y}=0$ - you can\nsee this by seeing that $\\mbox{Var}(\\Gamma_{\\cdot i}^{\\top}y)=0$ when\n$\\lambda_{i}=0$.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n# generate a realization\nevals = e[0]\nevals = 1/evals # variances\nevals[3] = 0 # generalized inverse\ny = e[1] @ ((evals ** 0.5) * np.random.normal(size = 5))\ny.sum()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n6.661338147750939e-16\n```\n:::\n:::\n\n\nIn the second order case, we have two non-identifiabilities: for the sum\nand for the linear component of the variation in $y$ (linear in the\nindices of $y$).\n\nI could parameterize a statistical model as $\\mu+y$ where $y$ has\ncovariance that is the generalized inverse discussed above. Then I allow\nfor both a non-zero mean and for smooth variation governed by the\nautoregressive structure. In the second-order case, I would need to add\na linear component as well, given the second non-identifiability.\n\n## Matrices arising in regression\n\nIn regression, we work with $X^{\\top}X$. Some properties of this matrix\nare that it is symmetric and non-negative definite (hence our use of\n$(X^{\\top}X)^{-1}$ in the OLS estimator). When is it not positive\ndefinite?\n\nFitted values are $X\\hat{\\beta}=X(X^{\\top}X)^{-1}X^{\\top}Y=HY$. The\n\"hat\" matrix, $H$, projects $Y$ into the column space of $X$. $H$ is\nidempotent: $HH=H$, which makes sense - once you've projected into the\nspace, any subsequent projection just gives you the same thing back. $H$\nis singular. Why? Also, under what special circumstance would it not be\nsingular?\n\n# 3. Computational issues\n\n## Storing matrices\n\nWe've discussed column-major and row-major storage of matrices. First,\nretrieval of matrix elements from memory is quickest when multiple\nelements are contiguous in memory. So in a column-major language (e.g.,\nR, Fortran), it is best to work with values in a common column (or\nentire columns) while in a row-major language (e.g., Python, C) for\nvalues in a common row.\n\nIn some cases, one can save space (and potentially speed) by overwriting\nthe output from a matrix calculation into the space occupied by an\ninput. This occurs in some clever implementations of matrix\nfactorizations.\n\n## Algorithms\n\nGood algorithms can change the efficiency of an algorithm by one or more\norders of magnitude, and many of the improvements in computational speed\nover recent decades have been in algorithms rather than in computer\nspeed.\n\nMost matrix algebra calculations can be done in multiple ways. For\nexample, we could compute $b=Ax$ in either of the following ways,\ndenoted here in pseudocode.\n\n1. Stack the inner products of the rows of $A$ with $x$.\\\n\n```\n for(i=1:n){ \n b_i = 0\n for(j=1:m){\n b_i = b_i + a_{ij} x_j\n }\n }\n```\n\n2. Take the linear combination (based on $x$) of the columns of $A$\\\n\n```\n for(i=1:n){ \n b_i = 0\n }\n for(j=1:m){\n for(i = 1:n){\n b_i = b_i + a_{ij} x_j \n }\n }\n```\n\nIn this case the two approaches involve the same number of operations\nbut the first might be better for row-major matrices (so might be how we\nwould implement in C) and the second for column-major (so might be how\nwe would implement in Fortran). \n\n**Challenge**: check whether the first\napproach is faster in Python. (Write the code just doing the outer loop and\ndoing the inner loop using vectorized calculation.)\n\n#### General computational issues\n\nThe same caveats we discussed in terms of computer arithmetic hold\nnaturally for linear algebra, since this involves arithmetic with many\nelements. Good implementations of algorithms are aware of the danger of\ncatastrophic cancellation and of the possibility of dividing by zero or\nby values that are near zero.\n\n## Ill-conditioned problems\n\n#### Basics\n\nA problem is ill-conditioned if small changes to values in the\ncomputation result in large changes in the result. This is quantified by\nsomething called the *condition number* of a calculation. For different\noperations there are different condition numbers.\n\nIll-conditionedness arises most often in terms of matrix inversion, so\nthe standard condition number is the \"condition number with respect to\ninversion\", which when using the $L_{2}$ norm is the ratio of the\nabsolute values of the largest to smallest eigenvalue. Here's an\nexample: $$A=\\left(\\begin{array}{cccc}\n10 & 7 & 8 & 7\\\\\n7 & 5 & 6 & 5\\\\\n8 & 6 & 10 & 9\\\\\n7 & 5 & 9 & 10\n\\end{array}\\right).$$ The solution of $Ax=b$ for $b=(32,23,33,31)$ is\n$x=(1,1,1,1)$, while the solution for $b+\\delta b=(32.1,22.9,33.1,30.9)$\nis $x+\\delta x=(9.2,-12.6,4.5,-1.1)$, where $\\delta$ is notation for a\nperturbation to the vector or matrix.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef norm2(x):\n return(np.sum(x**2) ** 0.5)\n\nA = np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])\nb = np.array([32,23,33,31])\nx = np.linalg.solve(A, b)\n\nbPerturbed = np.array([32.1, 22.9, 33.1, 30.9])\nxPerturbed = np.linalg.solve(A, bPerturbed)\n\ndelta_b = bPerturbed - b\ndelta_x = xPerturbed - x\n```\n:::\n\n\n\nWhat's going on? Some manipulations with inequalities involving the\ninduced matrix norm (for any chosen vector norm, but we might as well\njust think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\leq\\|A\\|\\|A^{-1}\\|\\frac{\\|\\delta b\\|}{\\|b\\|}$$\nwhere we define the condition number w.r.t. inversion as\n$\\mbox{cond}(A)\\equiv\\|A\\|\\|A^{-1}\\|$. We'll generally work with the\n$L_{2}$ norm, and for a nonsingular square matrix the result is that the\ncondition number is the ratio of the absolute values of the largest and\nsmallest magnitude eigenvalues. This makes sense since $\\|A\\|_{2}$ is\nthe absolute value of the largest magnitude eigenvalue of $A$ and\n$\\|A^{-1}\\|_{2}$ that of the inverse of the absolute value of the\nsmallest magnitude eigenvalue of $A$.\n\nWe see in the code below that the large disparity in eigenvalues of $A$\nleads to an effect predictable from our inequality above, with the\ncondition number helping us find an upper bound.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ne = np.linalg.eig(A)\nevals = e[0]\nprint(evals)\n\n## relative perturbation in x much bigger than in b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_x) / norm2(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.19847546803699\n```\n:::\n\n```{.python .cell-code}\nnorm2(delta_b) / norm2(b)\n\n## ratio of relative perturbations\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.0033319453118976702\n```\n:::\n\n```{.python .cell-code}\n(norm2(delta_x) / norm2(x)) / (norm2(delta_b) / norm2(b))\n\n## ratio of largest and smallest magnitude eigenvalues\n## confusingly evals[2] is the smallest, not evals[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2460.567236431514\n```\n:::\n\n```{.python .cell-code}\n(evals[0]/evals[2])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2984.092701676269\n```\n:::\n:::\n\n\nThe main use of these ideas for our purposes is in thinking about the\nnumerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On\na computer we have the system $$(A+\\delta A)(x+\\delta x)=b+\\delta b$$\nwhere the 'perturbation' is from the inaccuracy of computer numbers. Our\nexploration of computer numbers tells us that\n$$\\frac{\\|\\delta b\\|}{\\|b\\|}\\approx10^{-p};\\,\\,\\,\\frac{\\|\\delta A\\|}{\\|A\\|}\\approx10^{-p}$$\nwhere $p=16$ for standard double precision floating points. Following\nGentle, one gets the approximation\n\n$$\\frac{\\|\\delta x\\|}{\\|x\\|}\\approx\\mbox{cond}(A)10^{-p},$$ so if\n$\\mbox{cond}(A)\\approx10^{t}$, we have accuracy of order $10^{t-p}$\ninstead of $10^{-p}$. (Gentle cautions that this holds only if\n$10^{t-p}\\ll1$). So we can think of the condition number as giving us\nthe number of digits of accuracy lost during a computation relative to\nthe precision of numbers on the computer. E.g., a condition number of\n$10^{8}$ means we lose 8 digits of accuracy relative to our original 16\non standard systems. One issue is that estimating the condition number\nis itself subject to numerical error and requires computation of\n$A^{-1}$ (albeit not in the case of $L_{2}$ norm with square,\nnonsingular $A$) but see Golub and van Loan (1996; p. 76-78) for an\nalgorithm.\n\n#### Improving conditioning\n\nIll-conditioned problems in statistics often arise from collinearity of\nregressors. Often the best solution is not a numerical one, but\nre-thinking the modeling approach, as this generally indicates\nstatistical issues beyond just the numerical difficulties.\n\nA general comment on improving conditioning is that we want to avoid\nlarge differences in the magnitudes of numbers involved in a\ncalculation. In some contexts such as regression, we can center and\nscale the columns to avoid such differences - this will improve the\ncondition of the problem. E.g., in simple quadratic regression with\n$x=\\{1990,\\ldots,2010\\}$ (e.g., regressing on calendar years), we see\nthat centering and scaling the matrix columns makes a huge difference on\nthe condition number\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport statsmodels.api as sm\n\nt1 = np.arange(1990, 2011) # naive covariate\nX1 = np.column_stack((np.ones(21), t1, t1 ** 2))\n\nbeta = np.array([5, .1, .0001])\ny = X1 @ beta + np.random.normal(size = len(t1))\n\ne1 = np.linalg.eig(np.dot(X1.T, X1))\nnp.sort(e1[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])\n```\n:::\n\n```{.python .cell-code}\nnp.linalg.cond(np.dot(X1.T, X1)) # built-in!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n4.653329826789808e+21\n```\n:::\n\n```{.python .cell-code}\nsm.OLS(y, X1).fit().params\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([ 1.60671997e+04, -1.59483225e+01, 4.10859375e-03])\n```\n:::\n\n```{.python .cell-code}\nt2 = t1 - 2000 # centered\nX2 = np.column_stack((np.ones(21), t2, t2 ** 2))\ne2 = np.linalg.eig(np.dot(X2.T, X2))\nwith np.printoptions(suppress=True):\n np.sort(e2[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([50677.70427505, 770. , 9.29572495])\n```\n:::\n\n```{.python .cell-code}\nt3 = t2/10 # centered and scaled\nX3 = np.column_stack((np.ones(21), t3, t3 ** 2))\ne3 = np.linalg.eig(np.dot(X3.T, X3))\nwith np.printoptions(suppress=True):\n np.sort(e3[0])[::-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\narray([24.11293487, 7.7 , 1.95366513])\n```\n:::\n:::\n\n\nI haven't shown the OLS results for the transformed version\nof the problem because the transformations modify the true\nparameter values, so there is a bit of arithmetic to be done,\nbut you should be able to verify that the second and third\napproaches give reasonable answers.\n\n\nThe basic story is that simple strategies often solve the problem, and\nthat you should be aware of the absolute and relative magnitudes\ninvolved in your calculations.\n\nOne rule of thumb is to try to work with numbers whose magnitude is\naround 1. We can often scale the values in our problem in order to do\nthis. I.e., change the units of your variables. Instead of personal\nincome in dollars, use personal income in thousands or hundreds of\nthousands of dollars.\n\n# 4. Matrix factorizations (decompositions) and solving systems of linear equations\n\nSuppose we want to solve the following linear system: \n\n$$\\begin{aligned} Ax & = & b\\\\ x & = & A^{-1}b \\end{aligned}$$ \n\n\nNumerically, this is never done by\nfinding the inverse and multiplying. Rather we solve the system using a\nmatrix decomposition (or equivalent set of steps). One approach uses\nGaussian elimination (equivalent to the LU decomposition), while another\nuses the Cholesky decomposition. There are also iterative methods that\ngenerate a sequence of approximations to the solution but reduce\ncomputation (provided they are stopped before the exact solution is\nfound).\n\nGentle-CS has a nice table overviewing the various factorizations (Table\n5.1, page 219). I've reproduced a variation on it here.\n\n\n\n| Name | Representation | Restrictions | Properties | Uses |\n| ----------| ----------------------| ------------| --------------------|-------------|\n| LU | $A_{nn}= L_{nn}U_{nn}$ | $A$ generally square | $L$ lower triangular; $U$ upper triangular | solving equations; inversion | \n| QR | $A_{nm}= Q_{nn}R_{nm}$ or $A_{nm}=Q_{nm}R_{mm}$(skinny) | | $Q$ orthogonal; $R$ upper triangular | regression |\n| Cholesky | $A_{nn}=U_{nn}^{\\top}U_{nn}$ | $A$ positive (semi-) definite | $U$ upper triangular | multivariate normal; covariance; solving equations; inversion |\n| Eigen decomposition | $A_{nn}=\\Gamma_{nn}\\Lambda_{nn}\\Gamma_{nn}^{\\top}$ | $A$ square, symmetric*| $\\Gamma$ orthogonal; $\\Lambda$ (non-negative**) diagonal | principal components analysis and related | \n| SVD | $A_{nm}=U_{nn}D_{nm}V_{mm}^{\\top}$ or $A_{nm}= U_{nk}D_{kk}V_{mk}^{\\top}$ | | $U, V$ orthogonal; $D$ (non-negative) diagonal | machine learning, topic models |\n\nTable: Matrix factorizations useful for statistics / data science / machine learning\n\n*For the eigen decomposition, I assume $A$ is symmetric, though there\nis a decomposition for non-symmetric $A$.\n\n** For positive definite or positive semi-definite $A$.\n\n## Triangular systems\n\nAs a preface, let's figure out how to solve $Ax=b$ if $A$ is upper\ntriangular. The basic algorithm proceeds from the bottom up (and\ntherefore is called a 'backsolve'. We solve for $x_{n}$ trivially, and\nthen move upwards plugging in the known values of $x$ and solving for\nthe remaining unknown in each row (each equation).\n\n1. $x_{n}=b_{n}/A_{nn}$\n2. Now for $kp$ then the leading $p$ rows of $R$ provide an upper triangular\nmatrix ($R_{1}$) and the remaining rows are 0. (I'm using $p$ because\nthe QR is generally applied to design matrices in regression). In this\ncase we really only need the first $p$ columns of $Q$, and we have\n$X=Q_{1}R_{1}$, the 'skinny' QR (this is what R's QR provides). For\nuniqueness, we can require the diagonals of $R$ to be nonnegative, and\nthen $R$ will be the same as the upper-triangular Cholesky factor of\n$X^{\\top}X$: \n\n$$\\begin{aligned} X^{\\top}X & = & R^{\\top}Q^{\\top}QR \\\\ & = & R^{\\top}R\\end{aligned}.$$ \n\n\nThere are three standard approaches for\ncomputing the QR, using (1) reflections (Householder transformations),\n(2) rotations (Givens transformations), or (3) Gram-Schmidt\northogonalization (see below for details).\n\nFor $n\\times n$ $X$, the QR (for the Householder approach) requires\n$2n^{3}/3$ flops, so QR is less efficient than LU or Cholesky.\n\nWe can also obtain the pseudo-inverse of $X$ from the QR:\n$X^{+}=[R_{1}^{-1}\\,0]Q^{\\top}$. In the case that $X$ is not full-rank,\nthere is a version of the QR that will work (involving pivoting) and we\nend up with some additional zeroes on the diagonal of $R_{1}$.\n\n### Regression and the QR\n\nOften QR is used to fit linear models, including in R. Consider the\nlinear model in the form $Y=X\\beta+\\epsilon$, finding\n$\\hat{\\beta}=(X^{\\top}X)^{-1}X^{\\top}Y$. Let's consider the skinny QR\nand note that $R^{\\top}$ is invertible. Therefore, we can express the\nnormal equations as \n\n$$\n\\begin{aligned} \nX^{\\top}X\\beta & = & X^{\\top} Y \\\\\nR^{\\top}Q^{\\top}QR\\beta & = & R^{\\top}Q^{\\top} Y \\\\\nR \\beta & = & Q^{\\top} Y\n\\end{aligned}\n$$ \n\n\nand solving for $\\beta$ is just a\nbacksolve since $R$ is upper-triangular. Furthermore the standard\nregression quantities, such as the hat matrix, the SSE, the residuals,\netc. can be easily expressed in terms of $Q$ and $R$.\n\nWhy use the QR instead of the Cholesky on $X^{\\top}X$? The condition\nnumber of $X$ is the square root of that of $X^{\\top}X$, and the $QR$\nfactorizes $X$. Monahan has a discussion of the condition of the\nregression problem, but from a larger perspective, the situations where\nnumerical accuracy is a concern are generally cases where the OLS\nestimators are not particularly helpful anyway (e.g., highly collinear\npredictors).\n\nWhat about computational order of the different approaches to least\nsquares? The Cholesky is $np^{2}+\\frac{1}{3}p^{3}$, an algorithm called\nsweeping is $np^{2}+p^{3}$ , the Householder method for QR is\n$2np^{2}-\\frac{2}{3}p^{3}$, and the modified Gram-Schmidt approach for\nQR is $2np^{2}$. So if $n\\gg p$ then Cholesky (and sweeping) are faster\nthan the QR approaches. According to Monahan, modified Gram-Schmidt is\nmost numerically stable and sweeping least. In general, regression is\npretty quick unless $p$ is large since it is linear in $n$, so it may\nnot be worth worrying too much about computational differences of the\nsort noted here.\n\n### Regression and the QR in Python and R\n\nWe can get the Q and R matrices easily in Python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = np.linalg.qr(X)\n```\n:::\n\n\nOne of the methods used by the [`statsmodel` package in Python\nuses the QR to fit a regression](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html).\n\nNote that by default in Python (and in R), you get the skinny QR, namely only the\nfirst $p$ rows of $R$ and the first $p$ columns of $Q$, where the latter\nform an orthonormal basis for the column space of $X$. The remaining\ncolumns form an orthonormal basis for the null space of $X$ (the space\northogonal to the column space of $X$). The analogy in regression is\nthat we get the basis vectors for the regression, while adding the\nremaining columns gives us the full $n$-dimensional space of the\nobservations.\n\nRegression in R uses the QR decomposition via `qr()`, which calls a\nFortran function. `qr()` (and the Fortran functions that are called) is\nspecifically designed to output quantities useful in fitting linear\nmodels. \n\nIn R, `qr()` returns the result as a list meant for use by other tools. R\nstores the $R$ matrix in the upper triangle of `\\$qr`, while the lower\ntriangle of *\\$qr* and *\\$aux* store the information for constructing\n$Q$ (this relates to the Householder-related vectors $u$ below). One can\nmultiply by $Q$ using *qr.qy()* and by $Q^{\\top}$ using *qr.qty()*. If\nyou want to extract $R$ and $Q$, the following will work:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nX.qr = qr(X)\nQ = qr.Q(X.qr)\nR = qr.R(X.qr) \n```\n:::\n\n\nAs a side note, there are QR-based functions that provide\nregression-related quantities, such as *qr.resid()*, *qr.fitted()* and\n*qr.coef()*. These functions (and their Fortran counterparts) exist\nbecause one can work through the various regression quantities of\ninterest and find their expressions in terms of $Q$ and $R$, with nice\nproperties resulting from $Q$ being orthogonal and $R$ triangular.\n\n### Computing the QR decomposition\n\nHere we'll see some of the details of the different approaches to the\nQR, in part because they involve some concepts that may be useful in\nother contexts. I won't expect you to see all of how this works, but\nplease skim through this to get an idea of how things are done.\n\nOne approach involves reflections of vectors and a second rotations of\nvectors. Reflections and rotations are transformations that are\nperformed by orthogonal matrices. The determinant of a reflection matrix\nis -1 and the determinant of a rotation matrix is 1. We'll see some of\nthe details in the demo code.\n\n#### QR Method 1: Reflections\n\nIf $u$ and $v$ are orthonormal vectors and $x$ is in the space spanned\nby $u$ and $v$, $x=c_{1}u+c_{2}v$, then $\\tilde{x}=-c_{1}u+c_{2}v$ is a\nreflection (a *Householder* reflection) along the $u$ dimension (since\nwe are using the negative of that basis vector). We can think of this as\nreflecting across the plane perpendicular to $u$. This extends simply to\nhigher dimensions with orthonormal vectors, $u,v_{1},v_{2},\\ldots$\n\nSuppose we want to formulate the reflection in terms of a \"Householder\"\nmatrix, $Q$. It turns out that $$Qx=\\tilde{x}$$ if $Q=I-2uu^{\\top}$. $Q$\nhas the following properties: (1) $Qu=-u$, (2) $Qv=v$ for $u^{\\top}v=0$,\n(3) $Q$ is orthogonal and symmetric.\n\nOne way to create the QR decomposition is by a series of Householder\ntransformations that create an upper triangular $R$ from $X$:\n\n$$\n\\begin{aligned}\nR & = & Q_{p}\\cdots Q_{1} X \\\\\nQ & = & (Q_{p}\\cdots Q_{1})^{\\top}\n\\end{aligned}\n$$ \n\n\nwhere we make use of\nthe symmetry in defining $Q$.\n\nBasically $Q_{1}$ reflects the first column of $X$ with respect to a\ncarefully chosen $u$, so that the result is all zeroes except for the\nfirst element. We want $Q_{1}x=\\tilde{x}=(||x||,0,\\ldots,0)$. This can\nbe achieved with $u=\\frac{x-\\tilde{x}}{||x-\\tilde{x}||}$. Then $Q_{2}$\nmakes the last $n-2$ rows of the second column equal to zero. We'll work\nthrough this a bit in class.\n\nIn the regression context, as we work through the individual\ntransformations, $Q_{j}=I-2u_{j}u_{j}^{\\top}$, we apply them to $X$ and\n$Y$ to create $R$ (note this would not involve doing the full matrix\nmultiplication - think about what calculations are actually needed) and\n$QY=Q^{\\top}Y$, and then solve $R\\beta=Q^{\\top}Y$. To find\n$\\mbox{Cov}(\\hat{\\beta})\\propto(X^{\\top}X)^{-1}=(R^{\\top}R)^{-1}=R^{-1}R^{-\\top}$\nwe do need to invert $R$, but it's upper-triangular and of dimension\n$p\\times p$. It turns out that $Q^{\\top}Y$ can be partitioned into the\nfirst $p$ and the last $n-p$ elements, $z^{(1)}$ and $z^{(2)}$. The SSR\nis $\\|z^{(1)}\\|^{2}$ and SSE is $\\|z^{(2)}\\|^{2}$.\n\nFinal side note: if $X$ is square (so $n=p)$ you might wonder why we\nneed $Q_{p}$ since after $p-1$ reflections, we don't need to zero\nanything else out (since the last column of $R$ has $n$ non-zero\nelements). It turns out that if we go back to thinking about a\nHouseholder reflection in general, there is a lack of uniqueness in\nchoosing $\\tilde{x}$. It could either be $(||x||,0,\\ldots,0)$ or\n$(-||x||,0,\\ldots,0)$. For better numerical stability, one chooses from\nthe two of those such that $x_{1}$ is of the opposite sign to\n$\\tilde{x}_{1}$, so that one avoids cancellation of numbers that may be\nof the same magnitude when doing $x-\\tilde{x}$. The transformation\n$Q_{p}$ is the last step of taking that approach of choosing the sign at\neach step. $Q_{p}$ doesn't zero anything out; it just basically just\ninvolves potentially setting $R_{pp}$ to be $-R_{pp}$. (To be honest,\nI'm not clear on why one would bother to do that last step, but that\nseems to be how it is presented in discussions of the Householder\napproach.) Of course in the case of $p2$, find interim vectors, $x_{k}^{(2)}$, by\n orthogonalizing with respect to $\\tilde{x}_{1}$\n\n3. Proceed for $k=3,\\ldots$, in turn orthogonalizing and normalizing\n the first of the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ and\n orthogonalizing the remaining vectors w.r.t. $\\tilde{x}_{k-1}$ to\n get new interim vectors\n\nMathematically, we could instead orthogonalize $x_{2}$ w.r.t.\n$\\tilde{x}_{1}$, then orthogonalize $x_{3}$ w.r.t.\n$\\{\\tilde{x}_{1},\\tilde{x}_{2}\\}$, etc. The algorithm above is the\n*modified* G-S, and is known to be more numerically stable if the\ncolumns of $X$ are close to collinear, giving vectors that are closer to\northogonal. The resulting $\\tilde{x}$ vectors are the columns of $Q$.\nThe elements of $R$ are obtained as we proceed: the diagonal values are\nthe the normalization values in the denominators, while the\noff-diagonals are the inner products with the already-computed columns\nof $Q$ that are computed as part of the numerators.\n\nAnother way to think about this is that $R=Q^{\\top}X$, which is the same\nas regressing the columns of $X$ on $Q,$ since\n$(Q^{\\top}Q)^{-1}Q^{\\top}X=Q^{\\top}X$. By construction, the first column\nof $X$ is a scaling of the first column of $Q$, the second column of $X$\nis a linear combination of the first two columns of $Q$, etc., so $R$\nbeing upper triangular makes sense.\n\n### The \"tall-skinny\" QR\n\nSuppose you have a very large regression problem, with $n$ very large,\nand $n\\gg p$. There is a variant of the QR, called the tall-skinny QR\n(see for details) that allows us\nto find the decomposition in a parallel fashion. The basic idea is to do\na nested set of QR decompositions on blocks of rows of $X$:\n\n$$\nX = \\left( \\begin{array}{c}\nX_{0} \\\\\nX_{1} \\\\\nX_{2} \\\\\nX_{3}\n\\end{array}\n\\right) =\n\\left(\n\\begin{array}{c}\nQ_{0} R_{0} \\\\\nQ_{1} R_{1} \\\\\nQ_{2} R_{2} \\\\\nQ_{3} R_{3}\n\\end{array} \\right),\n$$\n\nfollowed by 'reduction' steps (this\ncan be done in a map-reduce context) that do the $QR$ of pairs of the\n$R$ factors: \n$$\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\\\\\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\left(\\begin{array}{c}\nR_{0}\\\\\nR_{1}\n\\end{array}\\right)\\\\\n\\left(\\begin{array}{c}\nR_{2}\\\\\nR_{3}\n\\end{array}\\right)\n\\end{array}\\right)=\\left(\\begin{array}{c}\nQ_{01}R_{01}\\\\\nQ_{23}R_{23}\n\\end{array}\\right)$$ and $$\\left(\\begin{array}{c}\nR_{01}\\\\\nR_{23}\n\\end{array}\\right)=Q_{0123}R_{0123}.$$ \n\nThe full decomposition is then\n\n$$X=\\left( \\begin{array}{cccc} Q_{0} & 0 & 0 & 0 \\\\ 0 & Q_{1} & 0 & 0 \\\\ 0 & 0 & Q_{2} & 0 \\\\ 0 & 0 & 0 & Q_{3} \\end{array} \\right) \\left( \\begin{array}{cc} Q_{01} & 0 \\\\ 0 & Q_{23} \\end{array} \\right) Q_{0123} R_{0123} = QR.$$\n\n\nThe computation can be done in\nparallel (in particular it can be done with map-reduce) and the $Q$\nmatrix for big problems would generally not be computed explicitly but\nwould be stored in its constituent pieces.\n\nAlternatively, there is a variant on the algorithm that processes the\nrow-blocks of $X$ serially, allowing you to do QR on a large tall-skinny\nmatrix that you can't fit in memory (or possibly even on disk). First\nyou do $QR$ on $X_{0}$ to get $Q_{0}R_{0}$. Then you stack $R_{0}$ on\ntop of $X_{1}$ and do QR to get $R_{01}$. Then stack $R_{01}$ on top of\n$X_{2}$ to get $R_{012}$, etc.\n\n## Determinants\n\nThe absolute value of the determinant of a square matrix can be found\nfrom the product of the diagonals of the triangular matrix in any\nfactorization that gives a triangular (including diagonal) matrix times\nan orthogonal matrix (or matrices) since the determinant of an\northogonal matrix is either one or minus one.\n\n$|A|=|QR|=|Q||R|=\\pm|R|$\n\n$|A^{\\top}A|=|(QR)^{\\top}QR|=|R^{\\top}R|=|R_{1}^{\\top}R_{1}|=|R_{1}|^{2}$\\\nIn Python, the following will do it (on the log scale).\n\n\n::: {.cell}\n\n```{.python .cell-code}\nQ,R = qr(A)\nmagn = np.sum(np.log(np.abs(np.diag(R)))) \n```\n:::\n\n\nAn alternative is the product of the diagonal elements of $D$ (the\nsingular values) in the SVD factorization, $A=UDV^{\\top}$.\n\nFor non-negative definite matrices, we know the determinant is\nnon-negative, so the uncertainty about the sign is not an issue. For\npositive definite matrices, a good approach is to use the product of the\ndiagonal elements of the Cholesky decomposition.\n\nOne can also use the product of the eigenvalues:\n$|A|=|\\Gamma\\Lambda\\Gamma^{-1}|=|\\Gamma||\\Gamma^{-1}||\\Lambda|=|\\Lambda|$\n\n#### Computation\n\nComputing from any of these diagonal or triangular matrices as the\nproduct of the diagonals is prone to overflow and underflow, so we\n**always** work on the log scale as the sum of the log of the values.\nWhen some of these may be negative, we can always keep track of the\nnumber of negative values and take the log of the absolute values.\n\nOften we will have the factorization as a result of other parts of the\ncomputation, so we get the determinant for free.\n\nWe can use `np.linalg.logdet()` or (definitely not recommended)\n`np.linalg.det()` to calculate the determinant in Python.\nThese functions use the LU decomposition.\n\n\n# 5. Eigendecomposition and SVD\n\n## Eigendecomposition \n\nThe eigendecomposition (spectral decomposition) is useful in considering\nconvergence of algorithms and of course for statistical decompositions\nsuch as PCA. We think of decomposing the components of variation into\northogonal patterns (the eigenvectors) with variances (eigenvalues)\nassociated with each pattern.\n\nSquare symmetric matrices have real eigenvectors and eigenvalues, with\nthe factorization into orthogonal $\\Gamma$ and diagonal $\\Lambda$,\n$A=\\Gamma\\Lambda\\Gamma^{\\top}$, where the eigenvalues on the diagonal of\n$\\Lambda$ are ordered in decreasing value. Of course this is equivalent\nto the definition of an eigenvalue/eigenvector pair as a pair such that\n$Ax=\\lambda x$ where $x$ is the eigenvector and $\\lambda$ is a scalar,\nthe eigenvalue. The inverse of the eigendecomposition is simply\n$\\Gamma\\Lambda^{-1}\\Gamma^{\\top}$. On a similar note, we can create a\nsquare root matrix, $\\Gamma\\Lambda^{1/2}$, by taking the square roots of\nthe eigenvalues.\n\nThe spectral radius of $A$, denoted $\\rho(A)$, is the maximum of the\nabsolute values of the eigenvalues. As we saw when talking about\nill-conditionedness, for symmetric matrices, this maximum is the induced\nnorm, so we have $\\rho(A)=\\|A\\|_{2}$. It turns out that\n$\\rho(A)\\leq\\|A\\|$ for any induced matrix norm. The spectral radius\ncomes up in determining the rate of convergence of some iterative\nalgorithms.\n\n#### Computation\n\nThere are several methods for eigenvalues; a common one for doing the\nfull eigendecomposition is the *QR algorithm*. The first step is to\nreduce $A$ to upper Hessenburg form, which is an upper triangular matrix\nexcept that the first subdiagonal in the lower triangular part can be\nnon-zero. For symmetric matrices, the result is actually tridiagonal. We\ncan do the reduction using Householder reflections or Givens rotations.\nAt this point the QR decomposition (using Givens rotations) is applied\niteratively (to a version of the matrix in which the diagonals are\nshifted), and the result converges to a diagonal matrix, which provides\nthe eigenvalues. It's more work to get the eigenvectors, but they are\nobtained as a product of Householder matrices (required for the initial\nreduction) multiplied by the product of the $Q$ matrices from the\nsuccessive QR decompositions.\n\nWe won't go into the algorithm in detail, but note that it involves\nmanipulations and ideas we've seen already.\n\nIf only the largest (or the first few largest) eigenvalues and their\neigenvectors are needed, which can come up in time series and Markov\nchain contexts, the problem is easier and can be solved by the *power\nmethod*. E.g., in a Markov chain context, steady state is reached\nthrough $x_{t}=A^{t}x_{0}$. One can find the largest eigenvector by\nmultiplying by $A$ many times, normalizing at each step.\n$v^{(k)}=Az^{(k-1)}$ and $z^{(k)}=v^{(k)}/\\|v^{(k)}\\|$. There is an\nextension to find the $p$ largest eigenvalues and their vectors. See the\ndemo code in the qmd source file for an implementation (in R).\n\n\n\n\n\n## Singular value decomposition\n\nLet's consider an $n\\times m$ matrix, $A$, with $n\\geq m$ (if $m>n$, we\ncan always work with $A^{\\top})$. This often is a matrix representing\n$m$ features of $n$ observations. We could have $n$ documents and $m$\nwords, or $n$ gene expression levels and $m$ experimental conditions,\netc. $A$ can always be decomposed as $$A=UDV^{\\top}$$ where $U$ and $V$\nare matrices with orthonormal columns (left and right eigenvectors) and\n$D$ is diagonal with non-negative values (which correspond to\neigenvalues in the case of square $A$ and to squared eigenvalues of\n$A^{\\top}A$).\n\nThe SVD can be represented in more than one way. One representation is\n$$A_{n\\times m}=U_{n\\times k}D_{k\\times k}V_{k\\times m}^{\\top}=\\sum_{j=1}^{k}D_{jj}u_{j}v_{j}^{\\top}$$\nwhere $u_{j}$ and $v_{j}$ are the columns of $U$ and $V$ and where $k$\nis the rank of $A$ (which is at most the minimum of $n$ and $m$ of\ncourse). The diagonal elements of $D$ are the singular values.\n\nThat representation is as the sum of rank-one matrices (since\neach term is the scaled outer product of two vectors). \n\nIf $A$ is positive semi-definite, the eigendecomposition is an SVD.\nFurthermore, $A^{\\top}A=VD^{2}V^{\\top}$ and $AA^{\\top}=UD^{2}U^{\\top}$,\nso we can find the eigendecomposition of such matrices using the SVD of\n$A$ (for $AA^{\\top}$ we need to fill out $U$ to have $n$ columns). Note\nthat the squares of the singular values of $A$ are the eigenvalues of\n$A^{\\top}A$ and $AA^{\\top}$.\n\nWe can also fill out the matrices to get\n$$A=U_{n\\times n}D_{n\\times m}V_{m\\times m}^{\\top}$$ where the added\nrows and columns of $D$ are zero with the upper left block the\n$D_{k\\times k}$ from above.\n\n#### Uses\n\nThe SVD is an excellent way to determine a matrix rank and to construct\na pseudo-inverse ($A^{+}=VD^{+}U^{\\top})$.\n\nWe can use the SVD to approximate $A$ by taking\n$A\\approx\\tilde{A}=\\sum_{j=1}^{p}D_{jj}u_{j}v_{j}^{\\top}$ for $pj+p$ and the upper bandwidth is $q$ ($A_{ij}=0$\nfor $j>i+q$). An alternative to reducing to $Ux=b^{*}$ is to compute\n$A=LU$ and then do two solutions, $U^{-1}(L^{-1}b)$. One can show that\nthe computational complexity of the LU factorization is $O(npq)$ for\nbanded matrices, while solving the two triangular systems is $O(np+nq)$,\nso for small $p$ and $q$, the speedup can be dramatic.\n\nBanded matrices come up in time series analysis. E.g., moving average (MA) models produce\nbanded covariance structures because the covariance is zero after a\ncertain number of lags.\n\n## Low rank updates (optional)\n\nA transformation of the form $A-uv^{\\top}$ is a rank-one update because\n$uv^{\\top}$ is of rank one.\n\nMore generally a low rank update of $A$ is $\\tilde{A}=A-UV^{\\top}$ where\n$U$ and $V$ are $n\\times m$ with $n\\geq m$. The\nSherman-Morrison-Woodbury formula tells us that\n$$\\tilde{A}^{-1}=A^{-1}+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\nso if we know $x_{0}=A^{-1}b$, then the solution to $\\tilde{A}x=b$ is\n$x+A^{-1}U(I_{m}-V^{\\top}A^{-1}U)^{-1}V^{\\top}x$. Provided $m$ is not\ntoo large, and particularly if we already have a factorization of $A$,\nthen $A^{-1}U$ is not too bad computationally, and\n$I_{m}-V^{\\top}A^{-1}U$ is $m\\times m$. As a result\n$A^{-1}(U(\\cdots)^{-1}V^{\\top}x)$ isn't too bad.\n\nThis also comes up in working with precision matrices in Bayesian\nproblems where we may have $A^{-1}$ but not $A$ (we often add precision\nmatrices to find conditional normal distributions). An alternative\nexpression for the formula is $\\tilde{A}=A+UCV^{\\top}$, and the identity\ntells us\n$$\\tilde{A}^{-1}=A^{-1}-A^{-1}U(C^{-1}+V^{\\top}A^{-1}U)^{-1}V^{\\top}A^{-1}$$\n\nBasically Sherman-Morrison-Woodbury gives us matrix identities that we\ncan use in combination with our knowledge of smart ways of solving\nsystems of equations.\n\n# 7. Iterative solutions of linear systems (optional)\n\n#### Gauss-Seidel\n\nSuppose we want to iteratively solve $Ax=b$. Here's the algorithm, which\nsequentially updates each element of $x$ in turn.\n\n- Start with an initial approximation, $x^{(0)}$.\n- Hold all but $x_{1}^{(0)}$ constant and solve to find\n $x_{1}^{(1)}=\\frac{1}{a_{11}}(b_{1}-\\sum_{j=2}^{n}a_{1j}x_{j}^{(0)})$.\n- Repeat for the other rows of $A$ (i.e., the other elements of $x$),\n finding $x^{(1)}$.\n- Now iterate to get $x^{(2)}$, $x^{(3)}$, etc. until a convergence\n criterion is achieved, such as $\\|x^{(k)}-x^{(k-1)}\\|\\leq\\epsilon$\n or $\\|r^{(k)}-r^{(k-1)}\\|\\leq\\epsilon$ for $r^{(k)}=b-Ax^{(k)}$.\n\nLet's consider how many operations are involved in a single update:\n$O(n)$ for each element, so $O(n^{2})$ for each update. Thus if we can\nstop well before $n$ iterations, we've saved computation relative to\nexact methods.\n\nIf we decompose $A=L+D+U$ where $L$ is strictly lower triangular, $U$ is\nstrictly upper triangular, then Gauss-Seidel is equivalent to solving\n$$(L+D)x^{(k+1)}=b-Ux^{(k)}$$ and we know that solving the lower\ntriangular system is $O(n^{2})$.\n\nIt turns out that the rate of convergence depends on the spectral radius\nof $(L+D)^{-1}U$.\n\nGauss-Seidel amounts to optimizing by moving in axis-oriented\ndirections, so it can be slow in some cases.\n\n#### Conjugate gradient\n\nFor positive definite $A$, conjugate gradient (CG) reexpresses the\nsolution to $Ax=b$ as an optimization problem, minimizing\n$$f(x)=\\frac{1}{2}x^{\\top}Ax-x^{\\top}b,$$ since the derivative of $f(x)$\nis $Ax-b$ and at the minimum this gives $Ax-b=0$.\n\nInstead of finding the minimum by following the gradient at each step\n(so-called steepest descent, which can give slow convergence - we'll see\na demonstration of this in the optimization unit), CG chooses directions\nthat are mutually conjugate w.r.t. $A$, $d_{i}^{\\top}Ad_{j}=0$ for\n$i\\ne j$. The method successively chooses vectors giving the direction,\n$d_{k}$, in which to move down towards the minimum and a scaling of how\nmuch to move, $\\alpha_{k}$. If we start at $x_{(0)}$, the $k$th point we\nmove to is $x_{(k)}=x_{(k-1)}+\\alpha_{k}d_{k}$ so we have\n$$x_{(k)}=x_{(0)}+\\sum_{j\\leq k}\\alpha_{j}d_{j}$$ and we use a\nconvergence criterion such as given above for Gauss-Seidel. The\ndirections are chosen to be the residuals, $b-Ax_{(k)}$. Here's the\nbasic algorithm:\n\n- Choose $x_{(0)}$ and define the residual, $r_{(0)}=b-Ax_{(0)}$ (the\n error on the scale of $b$) and the direction, $d_{0}=r_{(0)}$ and\n set $k=0$.\n\n- Then iterate:\n\n - $\\alpha_{k}=\\frac{r_{(k)}^{\\top}r_{(k)}}{d_{k}^{\\top}Ad_{k}}$\n (choose step size so next error will be orthogonal to current\n direction - which we can express in terms of the residual, which\n is easily computable)\n\n - $x_{(k+1)}=x_{(k)}+\\alpha_{k}d_{k}$ (update current value)\n\n - $r_{(k+1)}=r_{(k)}-\\alpha_{k}Ad_{k}$ (update current residual)\n\n - $d_{k+1}=r_{(k+1)}+\\frac{r_{(k+1)}^{\\top}r_{(k+1)}}{r_{(k)}^{\\top}r_{(k)}}d_{k}$\n (choose next direction by conjugate Gram-Schmidt, starting with\n $r_{(k+1)}$ and removing components that are not $A$-orthogonal\n to previous directions, but it turns out that $r_{(k+1)}$ is\n already $A$-orthogonal to all but $d_{k}$).\n\n- Stop when $\\|r^{(k+1)}\\|$ is sufficiently small.\n\nThe convergence of the algorithm depends in a complicated way on the\neigenvalues, but in general convergence is faster when the condition\nnumber is smaller (the eigenvalues are not too spread out). CG will in\nprinciple give the exact answer in $n$ steps (where $A$ is $n\\times n$).\nHowever, computationally we lose accuracy and interest in the algorithm\nis really as an iterative approximation where we stop before $n$ steps.\nThe approach basically amounts to moving in axis-oriented directions in\na space stretched by $A$.\n\nIn general, CG is used for large sparse systems.\n\nSee the [extensive description from\nShewchuk](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)\nfor more details, as well as the use\nof CG when $A$ is not positive definite.\n\n#### Updating a solution\n\nSometimes we have solved a system, $Ax=b$ and then need to solve $Ax=c$.\nIf we have solved the initial system using a factorization, we can reuse\nthat factorization and solve the new system in $O(n^{2})$. Iterative\napproaches can do a nice job if $c=b+\\delta b$. Start with the solution\n$x$ for $Ax=b$ as $x^{(0)}$ and use one of the methods above.\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/units/unit10-linalg.qmd b/units/unit10-linalg.qmd index c4dea77..17e9369 100644 --- a/units/unit10-linalg.qmd +++ b/units/unit10-linalg.qmd @@ -1009,14 +1009,14 @@ diagonals to be positive). One algorithm for computing $U$ is: We can then solve a system of equations as: $U^{-1}(U^{\top-1}b)$. -Confusingly, while numpy's `cholesky` gives $L = U^\top$, scipy -can return either $L$ or $U$ but defaults to $U$. +Confusingly, while numpy's `cholesky` gives $L = U^\top$, scipy's +`choleskey` can return either $L$ or $U$ but defaults to $U$. Here are two ways we can use the Cholesky to solve a system of equations: ```{python, eval=FALSE} U = sp.linalg.cholesky(A) -sp.linalg.solve_triangular(L.T, +sp.linalg.solve_triangular(U, sp.linalg.solve_triangular(U, b, lower=False, trans='T'), lower=False)