forked from statOmics/HDDA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
lda.Rmd
408 lines (297 loc) · 14.3 KB
/
lda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
---
title: "Linear Discriminant Analysis (LDA)"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
bookdown::pdf_document2:
toc: true
number_sections: true
latex_engine: xelatex
---
```{r, child="_setup.Rmd"}
```
```{r}
library(tidyverse)
library(gridExtra)
```
# Breast cancer example
- Schmidt *et al.*, 2008, Cancer Research, **68**, 5405-5413
- Gene expression patterns in n=200 breast tumors were investigated (p=22283 genes)
- After surgery the tumors were graded by a pathologist (stage 1,2,3)
## Data
```{r, message=FALSE, warning=FALSE}
#BiocManager::install("genefu")
#BiocManager::install("breastCancerMAINZ")
library(genefu)
library(breastCancerMAINZ)
data(mainz)
X <- t(exprs(mainz)) # gene expressions
n <- nrow(X)
H <- diag(n)-1/n*matrix(1,ncol=n,nrow=n)
X <- H%*%X
Y <- pData(mainz)$grade
table(Y)
svdX <- svd(X)
k <- 2
Zk <- svdX$u[,1:k] %*% diag(svdX$d[1:k])
colnames(Zk) <- paste0("Z",1:k)
Zk %>%
as.data.frame %>%
mutate(grade = Y %>% as.factor) %>%
ggplot(aes(x= Z1, y = Z2, color = grade)) +
geom_point(size = 3)
```
# Linear discriminant analysis
Fisher's construction of LDA is simple: it allows for classification in a dimension-reduced subspace of $\mathbb{R}^p$.
First we assume that $Y$ can only take two values (0/1).
Fisher aimed for a direction, say $\mathbf{a}$, in the $p$-dimensional predictor space such that the orthogonal projections of the predictors, $\mathbf{x}^t\mathbf{a}$, show maximal ratio between the between and within sums of squares:
\[
\mathbf{v} = \text{ArgMax}_a \frac{\mathbf{a}^t\mathbf{B}\mathbf{a}}{\mathbf{a}^t\mathbf{W}\mathbf{a}} \text{ subject to } {\mathbf{a}^t\mathbf{W}\mathbf{a}}=1,
\]
where $\mathbf{W}$ and $\mathbf{B}$ are the within and between covariance matrices of $\mathbf{x}$.
The restriction is introduced to obtain a (convenient) unique solution.
---
## Between and within sums of squares
- In the training dataset, let $\mathbf{x}_{ik}$ denote the $i$th $p$-dimensional observation in the $k$th group ($k=0,1$ referring to $Y=0$ and $Y=1$, resp.), $i=1,\ldots, n_k$.
- Let $z_{ik}=\mathbf{a}^t\mathbf{x}_{ik}$ denote the orthogonal projection of $\mathbf{x}_{ik}$ onto $\mathbf{a}$
- For the one-dimensional $z$-observations, consider the following sum of squares:
\[
\text{SSE}=\text{within sum of squares} = \sum_{k=0,1}\sum_{i=1}^{n_k} (z_{ik}-\bar{z}_k)^2
\]
\[
\text{SSB}=\text{between sum of squares} = \sum_{k=0,1}\sum_{i=1}^{n_k} (\bar{z}_{k}-\bar{z})^2 = \sum_{k=0,1} n_k (\bar{z}_{k}-\bar{z})^2
\]
with $\bar{z}_k$ the sample mean of $z_{ik}$ within group $k$, and $\bar{z}$ the sample mean of all $z_{ik}$.
- To reformulate SSE and SSB in terms of the $p$-dimensional $\mathbf{x}_{ik}$, we need the sample means
\[
\bar{z}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} z_{ik} = \frac{1}{n_k} \sum_{i=1}^{n_k} \mathbf{a}^t\mathbf{x}_{ik} = \mathbf{a}^t \frac{1}{n_k} \sum_{i=1}^{n_k} \mathbf{x}_{ik} = \mathbf{a}^t \bar{\mathbf{x}}_k
\]
\[
\bar{z} = \frac{1}{n}\sum_{k=0,1}\sum_{i=1}^{n_k} z_{ik} = \cdots = \mathbf{a}^t\bar{\mathbf{x}}.
\]
- SSE becomes
\[
\text{SSE} = \sum_{k=0,1}\sum_{i=1}^{n_k} (z_{ik}-\bar{z}_k)^2 = \mathbf{a}^t \left(\sum_{k=0,1}\sum_{i=1}^{n_k} (\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)(\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)^t\right)\mathbf{a}
\]
- SSB becomes
\[
\text{SSB} = \sum_{k=0,1} n_k (\bar{z}_{k}-\bar{z})^2 = \mathbf{a}^t \left(\sum_{k=0,1} n_k (\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})(\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})^t \right)\mathbf{a}
\]
---
- The $p \times p$ matrix
\[
\mathbf{W}=\sum_{k=0,1}\sum_{i=1}^{n_k} (\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)(\mathbf{x}_{ik}-\bar{\mathbf{x}}_k)^t
\]
is referred to as the matrix of **within** sum of squares and cross products.
- The $p \times p$ matrix
\[
\mathbf{B}=\sum_{k=0,1} n_k (\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})(\bar{\mathbf{x}_{k}}-\bar{\mathbf{x}})^t
\]
is referred to as the matrix of **between** sum of squares and cross products.
- Note that on the diagonal of $\mathbf{W}$ and $\mathbf{B}$ you find the ordinary univariate within and between sums of squares of the individual components of $\mathbf{x}$.
## Obtain projections
An equivalent formulation:
\[
\mathbf{v} = \text{ArgMax}_a \mathbf{a}^t\mathbf{B}\mathbf{a} \text{ subject to } \mathbf{a}^t\mathbf{W}\mathbf{a}=1.
\]
This can be solved by introducing a Langrange multiplier:
\[
\mathbf{v} = \text{ArgMax}_a \mathbf{a}^t\mathbf{B}\mathbf{a} -\lambda(\mathbf{a}^t\mathbf{W}\mathbf{a}-1).
\]
Calculating the partial derivative w.r.t. $\mathbf{a}$ and setting it to zero gives
\begin{eqnarray*}
2\mathbf{B}\mathbf{a} -2\lambda \mathbf{W}\mathbf{a} &=& 0\\
\mathbf{B}\mathbf{a} &=& \lambda \mathbf{W}\mathbf{a} \\
\mathbf{W}^{-1}\mathbf{B}\mathbf{a} &=& \lambda\mathbf{a}.
\end{eqnarray*}
From the final equation we recognise that $\mathbf{v}=\mathbf{a}$ is an **eigenvector** of $\mathbf{W}^{-1}\mathbf{B}$, and $\lambda$ is the corresponding **eigenvalue**.
The equation has in general $\text{rank}(\mathbf{W}^{-1}\mathbf{B})$ solutions. In the case of two classes, the rank equals 1 and thus only one solution exists.
---
- A training data set is used for the calculation of $\mathbf{W}$ and $\mathbf{B}$. $\longrightarrow$ This gives the eigenvector $\mathbf{v}$
- The training data is also used for the calculation of the centroids of the classes (e.g. the sample means, say $\bar{\mathbf{x}}_1$ and $\bar{\mathbf{x}}_2$). $\longrightarrow$ The projected centroids are given by $\bar{\mathbf{x}}_1^t\mathbf{v}$ and $\bar{\mathbf{x}}_2^t\mathbf{v}$.
- A new observation with predictor $\mathbf{x}$ is classified in the class for which the projected centroid is closest to the projected predictor $z=\mathbf{x}^t\mathbf{v}$.
An advantage of this approach is that $\mathbf{v}$ can be interpreted (similar as the loadings in a PCA) in terms of which predictors $x_j$ are important to discriminate between classes 0 and 1.
---
## More than two classes
When the outcome $Y$ refers to more than two classes, say $m$ classes, then Fisher's method is constructed in exactly the same way. Now
\[
\mathbf{W}^{-1}\mathbf{B}\mathbf{a} = \lambda\mathbf{a}
\]
will have $r=\text{rank}(\mathbf{W}^{-1}\mathbf{B}) = \min(m-1,p,n)$ solutions (eigenvectors and eigenvalues). ($n$: sample size of training data)
Let $\mathbf{v}_j$ and $\lambda_j$ denote the $r$ solutions, and define
- $\mathbf{V}$: $p\times r$ matrix with collums $\mathbf{v}$
- $\mathbf{L}$: $r \times r$ diagonal matrix with elements $\lambda_1 > \lambda_2 > \cdots > \lambda_r$
The $p$-dimensional predictor data in $\mathbf{X}$ may then be transformed to the $r$-dimensional scores
\[
\mathbf{Z} = \mathbf{X}\mathbf{V}.
\]
---
For eigenvectors $\mathbf{v}_i$ and $\mathbf{v}_j$, it holds that
\[
\text{cov}\left[Z_i,Z_j\right] = \text{cov}\left[\mathbf{X}\mathbf{v}_i,\mathbf{X}\mathbf{v}_j\right]= \mathbf{v}_i^t \mathbf{W} \mathbf{v}_j = \delta_{ij} ,
\]
in which the covariances are defined within groups. Hence, within the groups (classes) the scores are uncorrelated.
# High dimensional predictors
With high-dimensional predictors
- Replace the $p\times p$ matrices $\mathbf{W}$ and $\mathbf{B}$ by their diagonal matrices (i.e. put zeroes on the off-diagonal positions)
- **Sparse LDA** by imposing an $L_1$-penalty on $\mathbf{v}$.
Two approaches: Zhou *et al.* (2006), Journal of Computational and Graphical Statistics , **15**, 265-286, and [Clemmensen *et al.* (2011), Technometrics, **53**](https://web.stanford.edu/~hastie/Papers/sda_resubm_daniela-final.pdf).
# Breast cancer example
## All genes
### LDA
- Fisher's LDA is illustrated on the breast cancer data with all three tumor stages as outcome.
- We try to discriminate between the different stages according to the gene expression data of all genes.
- We cache the result because the calculation takes 10 minutes.
```{r run-lda, cache=TRUE}
breast.lda <- MASS::lda(x = X, grouping = Y)
```
```{r}
Vlda <- breast.lda$scaling
colnames(Vlda) <- paste0("V",1:ncol(Vlda))
Zlda <- X%*%Vlda
colnames(Zlda) <- paste0("Z",1:ncol(Zlda))
grid.arrange(
Zlda %>%
as.data.frame %>%
mutate(grade = Y %>% as.factor) %>%
ggplot(aes(x= Z1, y = Z2, color = grade)) +
geom_point(size = 3) +
coord_fixed(),
ggplot() +
geom_bar(aes(x = c("z1","z2"), y = breast.lda$svd), stat = "identity") +
xlab("Discriminant") +
ylab("Eigen Values"),
layout_matrix = matrix(
c(1,1,2),
nrow=1)
)
```
---
- The columns of the matrix $\mathbf{V}$ contain the eigenvectors. There are $\min(3-2,22283,200)=2$ eigenvectors. The $200\times 2$ matrix $\mathbf{Z}$ contains the scores on the two Fisher discriminants.
- The eigenvalue $\lambda_j$ can be interpreted as the ratio
\[
\frac{\mathbf{v}_j^t\mathbf{B}\mathbf{v}_j}{\mathbf{v}_j^t\mathbf{W}\mathbf{v}_j} ,
\]
or (upon using $\mathbf{v}_j^t\mathbf{W}\mathbf{v}_j=1$) the between-centroid sum of squares (in the reduced dimension space of the Fisher discriminants)
\[
\mathbf{v}_j^t\mathbf{B}\mathbf{v}_j.
\]
- From the screeplot of the eigenvalues we see that the first dimension is more important than the second (not hugely) in terms of discriminating between the groups.
- From the scatterplot we can see that there is no perfect separation (discrimination) between the three tumor stages (quite some overlap).
- To some extent the first Fisher discriminant dimension discriminates stage 3 (green dots) from the other two stages, and the second dimension separates stage 1 (black dots) from the two others.
### Interpretation of loadings
```{r}
grid.arrange(
Vlda %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vlda)) %>%
ggplot(aes(x = geneID, y = V1)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,1]), col = "red"),
Vlda %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vlda)) %>%
ggplot(aes(x = geneID, y = V2)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,2]), col = "red"),
ncol = 2)
```
The loadings of the Fisher discriminants are within the columns of the $\mathbf{V}$ matrix.
- Since we have 22283 genes, each discriminant is a linear combination of 22283 gene expression. Instead of looking at the listing of 22283 loadings, we made an index plot (no particular ordering of genes on horizontal axis).
- The red horizontal reference lines correspond to the average of the loading (close to zero) and the average plus and minus twice the standard deviation of the loadings.
- If no genes had any "significant" discriminating power, then we would expect approximately $95\%$ of all loadings within the band. Thus loadings outside of the band are of potential interest and may perhaps be discriminating between the three tumor stages.
- In the graphs presented here we see many loadings within the bands, but also many outside of the band.
We repeat the analysis, but now with the sparse LDA method of Clemmensen et al. (2011).
---
## Sparse LDA based on 150 random genes
- We only present the results of the sparse LDA based on a random subset of 150 genes (ordering of genes in datamatrix is random).
- The discrimination seems better than with classical LDA based on all genes. This is very likely caused by too much noise in the full data matrix with over 20000 predictors.
```{r}
# BiocManager::install("sparseLDA")
library(sparseLDA)
YDummy <- data.frame(
Y1 = ifelse(Y == 1, 1, 0),
Y2 = ifelse(Y == 2, 1, 0),
Y3 = ifelse(Y == 3, 1, 0)
)
X2 <- X[,1:150]
breast.slda <- sda(x = X2,
y = as.matrix(YDummy),
lambda = 1e-6,
stop = -50,
maxIte = 25,
trace = TRUE)
Vsda <- matrix(0, nrow=ncol(X2), ncol=2)
Vsda[breast.slda$varIndex,] <- breast.slda$beta
colnames(Vsda) <- paste0("V",1:ncol(Vsda))
Zsda <- X2%*%Vsda
colnames(Zsda) <- paste0("Z",1:ncol(Zsda))
Zsda %>%
as.data.frame %>%
mutate(grade = Y %>% as.factor) %>%
ggplot(aes(x= Z1, y = Z2, color = grade)) +
geom_point(size = 3) +
ggtitle("sparse LDA on 150 genes")
grid.arrange(
Vsda %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vsda)) %>%
ggplot(aes(x = geneID, y = V1)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vsda[,1]), col = "red") ,
Vsda %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vsda)) %>%
ggplot(aes(x = geneID, y = V2)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vsda[,2]), col = "red"),
ncol = 2)
```
## LDA based on 150 random genes
```{r}
breast.lda150 <- MASS::lda(x = X2, grouping = Y)
Vlda <- breast.lda150$scaling
colnames(Vlda) <- paste0("V",1:ncol(Vlda))
Zlda <- X2%*%Vlda
colnames(Zlda) <- paste0("Z",1:ncol(Zlda))
grid.arrange(
Zlda %>%
as.data.frame %>%
mutate(grade = Y %>% as.factor) %>%
ggplot(aes(x= Z1, y = Z2, color = grade)) +
geom_point(size = 3) +
coord_fixed(),
ggplot() +
geom_bar(aes(x = c("z1","z2"), y = breast.lda150$svd), stat = "identity") +
xlab("Discriminant") +
ylab("Eigen Values"),
layout_matrix = matrix(
c(1,1,2),
nrow=1)
)
grid.arrange(
Vlda %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vlda)) %>%
ggplot(aes(x = geneID, y = V1)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,1]), col = "red"),
Vlda %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vlda)) %>%
ggplot(aes(x = geneID, y = V2)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vlda[,2]), col = "red"),
ncol = 2)
```
## Wrapup
LDA on all 22283 genes gave poorer result than on 150 genes.
This is probably caused by numerical instability when working with large $\mathbf{W}$ and $\mathbf{B}$ matrices
- Sparse LDA gave slightly poorer result than LDA on the subset of 150 genes. This may be caused by overfitting of the LDA.
- When (sparse) LDA is used to build a prediction model/classifier, then CV methods, or splitting of dataset into training and test datasets should be used to allow for an honest evaluation of the final prediction model.
- The graphs in the first two Fisher discriminant dimensions shown on the previous slides should only be used for data exploration.
- When the objective is to try to understand differences between groups in a high dimensional space, Fisher LDA is preferred over PCA.
# Acknowledgement {-}
- Olivier Thas for sharing his materials of Analysis of High Dimensional Data 2019-2020, which I used as the starting point for this chapter.
```{r, child="_session-info.Rmd"}
```