forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path14-remove-conf.Rmd
242 lines (202 loc) · 11 KB
/
14-remove-conf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
output: html_document
---
# Dealing with confounders
## Introduction
In the previous chapter we normalized for library size, effectively removing it as a confounder. Now we will consider removing other less well defined confounders from our data. Technical confounders (aka batch effects) can arise from difference in reagents, isolation methods, the lab/experimenter who performed the experiment, even which day/time the experiment was performed. Accounting for technical confounders, and batch effects particularly, is a large topic that also involves principles of experimental design. Here we address approaches that can be taken to account for confounders when the experimental design is appropriate.
Fundamentally, accounting for technical confounders involves identifying and, ideally, removing sources of variation in the expression data that are not related to (i.e. are confounding) the biological signal of interest. Various approaches exist, some of which use spike-in or housekeeping genes, and some of which use endogenous genes.
The use of spike-ins as control genes is appealing, since the same amount of ERCC (or other) spike-in was added to each cell in our experiment. In principel, all the variablity we observe for these genes is due to technical noise; whereas endogenous genes are affected by both technical noise and biological variability. Technical noise can be removed by fitting a model to the spike-ins and "substracting" this from the endogenous genes. There are several methods available based on this premise (eg. [BASiCS](https://github.com/catavallejos/BASiCS), [scLVM](https://github.com/PMBio/scLVM), [RUVg](http://bioconductor.org/packages/release/bioc/html/RUVSeq.html)); each using different noise models and different fitting procedures. Alternatively, one can identify genes which exhibit significant variation beyond technical noise (eg. Distance to median, [Highly variable genes](http://www.nature.com/nmeth/journal/v10/n11/full/nmeth.2645.html)). However, there are issues with the use of spike-ins for normalisation (particularly ERCCs, derived from bacterial sequences), including that their variability can, for various reasons, actually be *higher* than that of endogenous genes.
Given the issues with using spike-ins, better results can often be obtained by using endogenous genes instead. Where we have a large number of endogenous genes that, on average, do not vary systematically between cells and where we expect technical effects to affect a large number of genes (a very common and reasonable assumption), then such methods (for example, the RUVs method) can perform well.
We explore both general approaches below.
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align = 'center')
```
```{r, message=FALSE, warning=FALSE}
library(scRNA.seq.funcs)
library(RUVSeq)
library(scater, quietly = TRUE)
library(scran)
library(edgeR)
options(stringsAsFactors = FALSE)
umi <- readRDS("blischak/umi.rds")
umi.qc <- umi[fData(umi)$use, pData(umi)$use]
endog_genes <- !fData(umi.qc)$is_feature_control
erccs <- fData(umi.qc)$is_feature_control
```
## Remove Unwanted Variation
Factors contributing to technical noise frequently appear as "batch
effects" where cells processed on different days or by different
technicians systematically vary from one another. Removing technical
noise and correcting for batch effects can frequently be performed
using the same tool or slight variants on it. We will be considering
the [Remove Unwanted Variation (RUVSeq)](http://bioconductor.org/packages/RUVSeq). Briefly, RUVSeq works as follows. For $n$ samples and $J$ genes, consider the following generalized linear model (GLM), where the RNA-Seq read counts are regressed on both the known covariates of interest and unknown factors of unwanted variation:
\[\log E[Y|W,X,O] = W\alpha + X\beta + O\]
Here, $Y$ is the $n \times J$ matrix of observed gene-level read counts, $W$ is an $n \times k$ matrix corresponding to the factors of “unwanted variation” and $O$ is an $n \times J$ matrix of offsets that can either be set to zero or estimated with some other normalization procedure (such as upper-quartile normalization). The simultaneous estimation of $W$, $\alpha$, $\beta$, and $k$ is infeasible. For a given $k$, instead the following three
approaches to estimate the factors of unwanted variation $W$ are used:
* _RUVg_ uses negative control genes (e.g. ERCCs), assumed to have constant expression across samples;
* _RUVs_ uses centered (technical) replicate/negative control samples for which the covariates of interest are
constant;
* _RUVr_ uses residuals, e.g., from a first-pass GLM regression of the counts on the covariates of interest.
We will concentrate on the first two approaches.
### RUVg
```{r, message=FALSE, warning=FALSE}
ruvg <- RUVg(counts(umi.qc), erccs, k = 1)
set_exprs(umi.qc, "ruvg1") <- ruvg$normalizedCounts
ruvg <- RUVg(counts(umi.qc), erccs, k = 2)
set_exprs(umi.qc, "ruvg2") <- ruvg$normalizedCounts
set_exprs(umi.qc, "ruvg2_logcpm") <- log2(t(t(ruvg$normalizedCounts) /
colSums(ruvg$normalizedCounts)) + 1)
```
### RUVs
```{r}
scIdx <- matrix(-1, ncol = max(table(umi.qc$individual)), nrow = 3)
tmp <- which(umi.qc$individual == "NA19098")
scIdx[1, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19101")
scIdx[2, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19239")
scIdx[3, 1:length(tmp)] <- tmp
cIdx <- rownames(umi.qc)
ruvs <- RUVs(counts(umi.qc), cIdx, k = 1, scIdx = scIdx, isLog = FALSE)
set_exprs(umi.qc, "ruvs1") <- ruvs$normalizedCounts
ruvs <- RUVs(counts(umi.qc), cIdx, k = 2, scIdx = scIdx, isLog = FALSE)
set_exprs(umi.qc, "ruvs2") <- ruvs$normalizedCounts
set_exprs(umi.qc, "ruvs2_logcpm") <- log2(t(t(ruvs$normalizedCounts) /
colSums(ruvs$normalizedCounts)) + 1)
```
## Effectiveness 1
We evaluate the effectiveness of the normalization by inspecting the
PCA plot where colour corresponds the technical replicates and shape
corresponds to different biological samples (individuals). Separation of biological samples and
interspersed batches indicates that technical variation has been
removed.
```{r}
plotPCA(
umi.qc[endog_genes, ],
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
exprs_values = "ruvg1") +
ggtitle("PCA - RUVg normalisation: k = 1")
plotPCA(
umi.qc[endog_genes, ],
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
exprs_values = "ruvg2") +
ggtitle("PCA - RUVg normalisation: k = 2")
plotPCA(
umi.qc[endog_genes, ],
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
exprs_values = "ruvs1") +
ggtitle("PCA - RUVs normalisation: k = 1")
plotPCA(
umi.qc[endog_genes, ],
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
exprs_values = "ruvs2") +
ggtitle("PCA - RUVs normalisation: k = 2")
plotPCA(
umi.qc[endog_genes, ],
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
exprs_values = "ruvs2_logcpm") +
ggtitle("PCA - RUVs normalisation log2-cpm: k = 2")
```
Plotting log2-normalized CPM from RUVs with k = 2 looks to give the best separation of cells by individual.
## Effectiveness 2
We can also examine the effectiveness of correction using the relative log expression (RLE) across cells to confirm technical noise has been removed from the dataset.
```{r}
boxplot(
list(
"Raw counts" = calc_cell_RLE(counts(umi.qc), erccs),
"RUVg (k = 1)" = calc_cell_RLE(assayData(umi.qc)$ruvg1, erccs),
"RUVg (k = 2)" = calc_cell_RLE(assayData(umi.qc)$ruvg2, erccs),
"RUVs (k = 1)" = calc_cell_RLE(assayData(umi.qc)$ruvs1, erccs),
"RUVs (k = 2)" = calc_cell_RLE(assayData(umi.qc)$ruvs2, erccs)
)
)
```
## Effectiveness 3
Another way of evaluating the effectiveness of correction is to look at the differentially expressed (DE) genes among the batches of the same individual Theoretically, these batches should not differ from each other. Let's take the most promising individual (__NA19101__, whose batches are the closest to each other) and check whether it is true.
For demonstration purposes we will only use a subset of cells. You should not do that with your real dataset, though.
```{r}
keep <- c(
sample(which(umi.qc$batch == "NA19101.r1"), 20),
sample(which(umi.qc$batch == "NA19101.r2"), 20),
sample(which(umi.qc$batch == "NA19101.r3"), 20)
)
design <- model.matrix(~umi.qc[, keep]$batch)
```
We will use the [edgeR](http://bioconductor.org/packages/edgeR) package to calculate DE genes between plates for this particular individual. Recall that the input data for edgeR (and similar methods like DESeq2) must always be raw counts.
The particular coefficient that we test for DE in each case below tests to for genes that show a difference in expression between replicate plate 3 and replicate plate 1.
### DE (raw counts)
```{r}
dge1 <- DGEList(
counts = counts(umi.qc[, keep]),
norm.factors = rep(1, length(keep)),
group = umi.qc[, keep]$batch
)
dge1 <- estimateDisp(dge1, design = design, trend.method = "none")
plotBCV(dge1)
fit1 <- glmFit(dge1, design)
res1 <- glmLRT(fit1)
topTags(res1)
summary(decideTestsDGE(res1))
plotSmear(
res1, lowess = TRUE,
de.tags = rownames(topTags(res1, n = sum(abs(decideTestsDGE(res1))))$table)
)
```
### DE (RUVg, k = 2)
```{r}
design_ruvg <- model.matrix(~ruvg$W[keep,] + umi.qc[, keep]$batch)
head(design_ruvg)
dge_ruvg <- estimateDisp(dge1, design = design_ruvg, trend.method = "none")
plotBCV(dge_ruvg)
fit2 <- glmFit(dge_ruvg, design_ruvg)
res2 <- glmLRT(fit2)
topTags(res2)
summary(decideTestsDGE(res2))
plotSmear(
res2, lowess = TRUE,
de.tags = rownames(topTags(res2, n = sum(abs(decideTestsDGE(res2))))$table)
)
```
### DE (RUVs, k = 2)
```{r}
design_ruvs <- model.matrix(~ruvs$W[keep,] + umi.qc[, keep]$batch)
head(design_ruvs)
dge_ruvs <- estimateDisp(dge1, design = design_ruvs, trend.method = "none")
plotBCV(dge_ruvs)
fit3 <- glmFit(dge_ruvs, design_ruvs)
res3 <- glmLRT(fit3)
topTags(res3)
summary(decideTestsDGE(res3))
plotSmear(
res3, lowess = TRUE,
de.tags = rownames(topTags(res3, n = sum(abs(decideTestsDGE(res3))))$table)
)
```
In the above analyses, we have ignored size factors between cells. A typical edgeR analysis would always include these.
```{r}
umi.qc <- scran::computeSumFactors(umi.qc, sizes = 15)
dge_ruvs$samples$norm.factors <- sizeFactors(umi.qc)[keep]
dge_ruvs_sf <- estimateDisp(dge_ruvs, design = design_ruvs, trend.method = "none")
plotBCV(dge_ruvs_sf)
fit4 <- glmFit(dge_ruvs_sf, design_ruvs)
res4 <- glmLRT(fit4)
topTags(res4)
summary(decideTestsDGE(res4))
plotSmear(
res4, lowess = TRUE,
de.tags = rownames(topTags(res4, n = sum(abs(decideTestsDGE(res4))))$table)
)
```
## Exercise
Perform the same analysis with read counts of the Blischak data. Use `blischak/reads.rds` file to load the reads SCESet object. Once you have finished please compare your results to ours (next chapter). Additionally, experiment with other combinations of normalizations and compare the results.