lda.Rmd

---
title: "Linear Discriminant Analysis (LDA)"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
  bookdown::pdf_document2:
    toc: true
    number_sections: true
    latex_engine: xelatex
---

```{r, child="_setup.Rmd"}
```

```{r}
library(tidyverse)
library(gridExtra)
```

# Breast cancer example

- Schmidt *et al.*, 2008, Cancer Research, **68**, 5405-5413

- Gene expression patterns in n=200 breast tumors were investigated (p=22283 genes)

- After surgery the tumors were graded by a pathologist (stage 1,2,3)


## Data

```{r, message=FALSE, warning=FALSE}
#BiocManager::install("genefu")
#BiocManager::install("breastCancerMAINZ")

library(genefu)
library(breastCancerMAINZ)
data(mainz)

X <- t(exprs(mainz)) # gene expressions
n <- nrow(X)
H <- diag(n)-1/n*matrix(1,ncol=n,nrow=n)
X <- H%*%X
Y <- pData(mainz)$grade
table(Y)

svdX <- svd(X)
k <- 2
Zk <- svdX$u[,1:k] %*% diag(svdX$d[1:k])
colnames(Zk) <- paste0("Z",1:k)

Zk %>%
  as.data.frame %>%
  mutate(grade = Y %>% as.factor) %>%
  ggplot(aes(x= Z1, y = Z2, color = grade)) +
  geom_point(size = 3)
```

# Linear discriminant analysis

Fisher's construction of LDA is simple: it allows for classification in a dimension-reduced subspace of $\mathbb{R}^p$.

  First we assume that $Y$ can only take two values (0/1).

  Fisher aimed for a direction, say $\mathbf{a}$, in the $p$-dimensional predictor space such that the orthogonal projections of the predictors, $\mathbf{x}^t\mathbf{a}$, show maximal ratio between the between and within sums of squares:
  \[
  \mathbf{v} = \text{ArgMax}_a \frac{\mathbf{a}^t\mathbf{B}\mathbf{a}}{\mathbf{a}^t\mathbf{W}\mathbf{a}} \text{ subject to } {\mathbf{a}^t\mathbf{W}\mathbf{a}}=1,
 \]
 where $\mathbf{W}$ and $\mathbf{B}$ are the within and between covariance matrices of $\mathbf{x}$.
 The restriction is introduced to obtain a (convenient) unique solution.

---

## Between and within sums of squares


- In the training dataset, let $\mathbf{x}_{ik}$ denote the $i$th $p$-dimensional observation in the $k$th group ($k=0,1$ referring to $Y=0$ and $Y=1$, resp.), $i=1,\ldots, n_k$.

- Let $z_{ik}=\mathbf{a}^t\mathbf{x}_{ik}$ denote the orthogonal projection of $\mathbf{x}_{ik}$ onto $\mathbf{a}$

- For the one-dimensional $z$-observations, consider the following sum of squares:
  \[
    \text{SSE}=\text{within sum of squares} = \sum_{k=0,1}\sum_{i=1}^{n_k} (z_{ik}-\bar{z}_k)^2
  \]
  \[
    \text{SSB}=\text{between sum of squares} = \sum_{k=0,1}\sum_{i=1}^{n_k} (\bar{z}_{k}-\bar{z})^2 = \sum_{k=0,1} n_k (\bar{z}_{k}-\bar{z})^2
  \]
  with $\bar{z}_k$ the sample mean of $z_{ik}$ within group $k$, and $\bar{z}$ the sample mean of all $z_{ik}$.

- To reformulate SSE and SSB in terms of the $p$-dimensional $\mathbf{x}_{ik}$, we need the sample means
  \[
    \bar{z}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} z_{ik} = \frac{1}{n_k} \sum_{i=1}^{n_k} \mathbf{a}^t\mathbf{x}_{ik} = \mathbf{a}^t \frac{1}{n_k} \sum_{i=1}^{n_k} \mathbf{x}_{ik} = \mathbf{a}^t \bar{\mathbf{x}}_k
  \]
  \[
    \bar{z} = \frac{1}{n}\sum_{k=0,1}\sum_{i=1}^{n_k} z_{ik} = \cdots = \mathbf{a}^t\bar{\mathbf{x}}.
  \]

- SSE becomes
  \[
    \text{SSE} = \sum_{k=0,1}\sum_{i=1}^{n_k} (z_{ik}-\bar{z}_k)^2 = \mathbf{a}^t \left(\sum_{k=0,1}\sum_{i=1}^{n_k} (\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)(\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)^t\right)\mathbf{a}
  \]

- SSB becomes
  \[
    \text{SSB} = \sum_{k=0,1} n_k (\bar{z}_{k}-\bar{z})^2 = \mathbf{a}^t \left(\sum_{k=0,1} n_k (\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})(\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})^t \right)\mathbf{a}
  \]

---

- The $p \times p$ matrix
   \[
    \mathbf{W}=\sum_{k=0,1}\sum_{i=1}^{n_k} (\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)(\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)^t
   \]
   is referred to as the matrix of **within** sum of squares and cross products.

- The $p \times p$ matrix
   \[
    \mathbf{B}=\sum_{k=0,1} n_k (\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})(\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})^t
   \]
   is referred to as the matrix of **between** sum of squares and cross products.

- Note that on the diagonal of $\mathbf{W}$ and $\mathbf{B}$ you find the ordinary univariate within and between sums of squares of the individual components of $\mathbf{x}$.

## Obtain projections

An equivalent formulation:
  \[
  \mathbf{v} = \text{ArgMax}_a \mathbf{a}^t\mathbf{B}\mathbf{a} \text{ subject to } \mathbf{a}^t\mathbf{W}\mathbf{a}=1.
 \]

 This can be solved by introducing a Langrange multiplier:
  \[
  \mathbf{v} = \text{ArgMax}_a \mathbf{a}^t\mathbf{B}\mathbf{a} -\lambda(\mathbf{a}^t\mathbf{W}\mathbf{a}-1).
 \]

 Calculating the partial derivative w.r.t. $\mathbf{a}$ and setting it to zero gives
 \begin{eqnarray*}
   2\mathbf{B}\mathbf{a} -2\lambda \mathbf{W}\mathbf{a} &=& 0\\
   \mathbf{B}\mathbf{a}  &=& \lambda \mathbf{W}\mathbf{a} \\
   \mathbf{W}^{-1}\mathbf{B}\mathbf{a} &=& \lambda\mathbf{a}.
 \end{eqnarray*}

 From the final equation we recognise that $\mathbf{v}=\mathbf{a}$ is an **eigenvector** of $\mathbf{W}^{-1}\mathbf{B}$, and $\lambda$ is the corresponding **eigenvalue**.

 The equation has in general $\text{rank}(\mathbf{W}^{-1}\mathbf{B})$ solutions. In the case of two classes, the rank equals 1 and thus only one solution exists.

---

- A training data set is used for the calculation of $\mathbf{W}$ and $\mathbf{B}$. $\longrightarrow$ This gives the eigenvector $\mathbf{v}$

- The training data is also used for the calculation of the centroids of the classes (e.g. the sample means, say $\bar{\mathbf{x}}_1$ and $\bar{\mathbf{x}}_2$). $\longrightarrow$ The projected centroids are given by $\bar{\mathbf{x}}_1^t\mathbf{v}$ and $\bar{\mathbf{x}}_2^t\mathbf{v}$.

- A new observation with predictor $\mathbf{x}$ is classified in the class for which the projected centroid is closest to the projected predictor $z=\mathbf{x}^t\mathbf{v}$.

An advantage of this approach is that $\mathbf{v}$ can be interpreted (similar as the loadings in a PCA) in terms of which predictors $x_j$ are important to discriminate between classes 0 and 1.

---

## More than two classes

When the outcome $Y$ refers to more than two classes, say $m$ classes, then Fisher's method is constructed in exactly the same way. Now
\[
  \mathbf{W}^{-1}\mathbf{B}\mathbf{a} = \lambda\mathbf{a}
\]
will have $r=\text{rank}(\mathbf{W}^{-1}\mathbf{B}) = \min(m-1,p,n)$ solutions (eigenvectors and eigenvalues). ($n$: sample size of training data)

Let $\mathbf{v}_j$ and $\lambda_j$ denote the $r$ solutions, and define

- $\mathbf{V}$: $p\times r$ matrix with collums $\mathbf{v}$

- $\mathbf{L}$: $r \times r$ diagonal matrix with elements $\lambda_1 > \lambda_2 > \cdots > \lambda_r$

The $p$-dimensional predictor data in $\mathbf{X}$ may then be transformed to the $r$-dimensional scores
\[
  \mathbf{Z} = \mathbf{X}\mathbf{V}.
\]

---

For eigenvectors $\mathbf{v}_i$ and $\mathbf{v}_j$, it holds that
\[
 \text{cov}\left[Z_i,Z_j\right] = \text{cov}\left[\mathbf{X}\mathbf{v}_i,\mathbf{X}\mathbf{v}_j\right]= \mathbf{v}_i^t \mathbf{W} \mathbf{v}_j = \delta_{ij} ,
\]
in which the covariances are defined within groups. Hence, within the groups (classes) the scores are uncorrelated.

# High dimensional predictors

With high-dimensional predictors

- Replace the $p\times p$ matrices $\mathbf{W}$ and $\mathbf{B}$ by their diagonal matrices (i.e. put zeroes on the off-diagonal positions)

-  **Sparse LDA** by imposing an $L_1$-penalty on $\mathbf{v}$.

Two approaches: Zhou *et al.* (2006), Journal of Computational and Graphical Statistics , **15**, 265-286, and [Clemmensen *et al.* (2011), Technometrics, **53**](https://web.stanford.edu/~hastie/Papers/sda_resubm_daniela-final.pdf).

# Breast cancer example

## All genes

### LDA

- Fisher's LDA is illustrated on the breast cancer data with all three tumor stages as outcome.

- We try to discriminate between the different stages according to the gene expression data of all genes.

- We cache the result because the calculation takes 10 minutes.

```{r run-lda, cache=TRUE}
breast.lda <- MASS::lda(x = X, grouping = Y)
```

```{r}
Vlda <- breast.lda$scaling
colnames(Vlda) <- paste0("V",1:ncol(Vlda))

Zlda <- X%*%Vlda
colnames(Zlda) <- paste0("Z",1:ncol(Zlda))

grid.arrange(
  Zlda %>%
    as.data.frame %>%
    mutate(grade = Y %>% as.factor) %>%
    ggplot(aes(x= Z1, y = Z2, color = grade)) +
    geom_point(size = 3) +
    coord_fixed(),
  ggplot() +
    geom_bar(aes(x = c("z1","z2"), y = breast.lda$svd), stat = "identity") +
    xlab("Discriminant") +
    ylab("Eigen Values"),
  layout_matrix = matrix(
    c(1,1,2),
    nrow=1)
  )
```

---

- The columns of the matrix $\mathbf{V}$ contain the eigenvectors. There are $\min(3-2,22283,200)=2$ eigenvectors. The $200\times 2$ matrix $\mathbf{Z}$ contains the scores on the two Fisher discriminants.

- The eigenvalue $\lambda_j$ can be interpreted as the ratio
\[
  \frac{\mathbf{v}_j^t\mathbf{B}\mathbf{v}_j}{\mathbf{v}_j^t\mathbf{W}\mathbf{v}_j} ,
\]
or (upon using $\mathbf{v}_j^t\mathbf{W}\mathbf{v}_j=1$) the between-centroid sum of squares (in the reduced dimension space of the Fisher discriminants)
\[
  \mathbf{v}_j^t\mathbf{B}\mathbf{v}_j.
\]

- From the screeplot of the eigenvalues we see that the first dimension is more important than the second (not hugely) in terms of discriminating between the groups.

- From the scatterplot we can see that there is no perfect separation (discrimination) between the three tumor stages (quite some overlap).

- To some extent the first Fisher discriminant dimension discriminates stage 3 (green dots) from the other two stages, and the second dimension separates stage 1 (black dots) from the two others.

### Interpretation of loadings

```{r}
grid.arrange(
  Vlda %>%
    as.data.frame %>%
    mutate(geneID = 1:nrow(Vlda)) %>%
    ggplot(aes(x = geneID, y = V1)) +
    geom_point(pch=21) +
    geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,1]), col = "red"),

  Vlda %>%
    as.data.frame %>%
    mutate(geneID = 1:nrow(Vlda)) %>%
    ggplot(aes(x = geneID, y = V2)) +
    geom_point(pch=21) +
    geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,2]), col = "red"),
  ncol = 2)
```

The loadings of the Fisher discriminants are within the columns of the $\mathbf{V}$ matrix.

- Since we have 22283 genes, each discriminant is a linear combination of 22283 gene expression. Instead of looking at the listing of 22283 loadings, we made an index plot (no particular ordering of genes on horizontal axis).

- The red horizontal reference lines correspond to the average of the loading (close to zero) and the average plus and minus twice the standard deviation of the loadings.

- If no genes had any "significant" discriminating power, then we would expect approximately $95\%$ of all loadings within the band. Thus loadings outside of the band are of potential interest and may perhaps be discriminating between the three tumor stages.

- In the graphs presented here we see many loadings within the bands, but also many outside of the band.

We repeat the analysis, but now with the sparse LDA method of Clemmensen et al. (2011).

---

## Sparse LDA based on 150 random genes

- We only present the results of the sparse LDA based on a random subset of 150 genes (ordering of genes in datamatrix is random).

- The discrimination seems better than with classical LDA based on all genes. This is very likely caused by too much noise in the full data matrix with over 20000 predictors.

```{r}
# BiocManager::install("sparseLDA")

library(sparseLDA)

YDummy <- data.frame(
  Y1 = ifelse(Y == 1, 1, 0),
  Y2 = ifelse(Y == 2, 1, 0),
  Y3 = ifelse(Y == 3, 1, 0)
  )

X2 <- X[,1:150]

breast.slda <- sda(x = X2,
  y = as.matrix(YDummy),
  lambda = 1e-6,
  stop = -50,
  maxIte = 25,
  trace = TRUE)

Vsda <- matrix(0, nrow=ncol(X2), ncol=2)
Vsda[breast.slda$varIndex,] <- breast.slda$beta
colnames(Vsda) <- paste0("V",1:ncol(Vsda))

Zsda <- X2%*%Vsda
colnames(Zsda) <- paste0("Z",1:ncol(Zsda))

Zsda %>%
 as.data.frame %>%
 mutate(grade = Y %>% as.factor) %>%
 ggplot(aes(x= Z1, y = Z2, color = grade)) +
 geom_point(size = 3) +
 ggtitle("sparse LDA on 150 genes")

grid.arrange(
   Vsda %>%
     as.data.frame %>%
     mutate(geneID = 1:nrow(Vsda)) %>%
     ggplot(aes(x = geneID, y = V1)) +
     geom_point(pch=21) +
     geom_hline(yintercept = c(-2,0,2)*sd(Vsda[,1]), col = "red") ,

   Vsda %>%
     as.data.frame %>%
     mutate(geneID = 1:nrow(Vsda)) %>%
     ggplot(aes(x = geneID, y = V2)) +
     geom_point(pch=21) +
     geom_hline(yintercept = c(-2,0,2)*sd(Vsda[,2]), col = "red"),
   ncol = 2)
```

## LDA based on 150 random genes

```{r}
breast.lda150 <- MASS::lda(x = X2, grouping = Y)

Vlda <- breast.lda150$scaling
colnames(Vlda) <- paste0("V",1:ncol(Vlda))

Zlda <- X2%*%Vlda
colnames(Zlda) <- paste0("Z",1:ncol(Zlda))

grid.arrange(
  Zlda %>%
    as.data.frame %>%
    mutate(grade = Y %>% as.factor) %>%
    ggplot(aes(x= Z1, y = Z2, color = grade)) +
    geom_point(size = 3) +
    coord_fixed(),
  ggplot() +
    geom_bar(aes(x = c("z1","z2"), y = breast.lda150$svd), stat = "identity") +
    xlab("Discriminant") +
    ylab("Eigen Values"),
  layout_matrix = matrix(
    c(1,1,2),
    nrow=1)
  )

  grid.arrange(
    Vlda %>%
      as.data.frame %>%
      mutate(geneID = 1:nrow(Vlda)) %>%
      ggplot(aes(x = geneID, y = V1)) +
      geom_point(pch=21) +
      geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,1]), col = "red"),

    Vlda %>%
      as.data.frame %>%
      mutate(geneID = 1:nrow(Vlda)) %>%
      ggplot(aes(x = geneID, y = V2)) +
      geom_point(pch=21) +
      geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,2]), col = "red"),
    ncol = 2)
  ```

## Wrapup

LDA on all 22283 genes gave poorer result than on 150 genes.
This is probably caused by numerical instability when working with large $\mathbf{W}$ and $\mathbf{B}$ matrices

- Sparse LDA gave slightly poorer result than LDA on the subset of 150 genes. This may be caused by overfitting of the LDA.

- When (sparse) LDA is used to build a prediction model/classifier, then CV methods, or splitting of dataset into training and test datasets should be used to allow for an honest evaluation of the final prediction model.

- The graphs in the first two Fisher discriminant dimensions shown on the previous slides should only be used for data exploration.

- When the objective is to try to understand differences between groups in a high dimensional space, Fisher LDA is preferred over PCA.


# Acknowledgement {-}

- Olivier Thas for sharing his materials of Analysis of High Dimensional Data 2019-2020, which I used as the starting point for this chapter.

```{r, child="_session-info.Rmd"}
```