-
Notifications
You must be signed in to change notification settings - Fork 14
/
slide_preprocessing.Rmd
441 lines (325 loc) · 17.2 KB
/
slide_preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "Data Preprocessing"
subtitle: "Workshop on RNA-Seq"
author: "`r paste0('<b>Roy Francis and Nima Rafati</b>')`"
institute: NBIS, SciLifeLab
keywords: bioinformatics, course, scilifelab, nbis
output:
xaringan::moon_reader:
encoding: 'UTF-8'
self_contained: false
chakra: 'assets/remark-latest.min.js'
css: 'assets/slide.css'
lib_dir: libs
nature:
ratio: '4:3'
highlightLanguage: r
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
slideNumberFormat: "%current%/%total%"
include: NULL
---
exclude: true
count: false
```{r,echo=FALSE,child="assets/header-slide.Rmd"}
```
<!-- ------------ Only edit title, subtitle & author above this ------------ -->
```{r,include=FALSE}
# load the packages you need
library(dplyr)
library(tidyr)
#library(stringr)
library(ggplot2)
#library(plotly)
library(pheatmap)
library(DESeq2)
library(edgeR)
library(gridExtra)
```
---
name: raw
## Raw data
- Raw count table
```{r,eval=TRUE,echo=FALSE}
cr <- read.csv("./data/gene_counts_raw.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1)
cr[1:6,1:6]
```
- Metadata
```{r,eval=TRUE,echo=FALSE}
mr <- read.csv("./data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)
mr[1:6,]
```
???
A glimpse into the count table and metadata table. Before any downstream analyses, it is important to make sure that the count column names match the metadata row names. Make sure the column in both tables are of the correct data type; ie; numbers are numeric and groups are factors.
---
name: pp
## Filtering
- Remove genes and samples with low counts
```{r,echo=TRUE}
cf1 <- cr[rowSums(cr>0) >= 3, ] # Keep rows/genes that have at least one read in +3 samples
cf2 <- cr[rowSums(cr>3) >= 3, ] # Keep rows/genes that have at least three reads in +3 samples
cf3 <- cr[rowSums(edgeR::cpm(cr)>5) >= 3, ] # need at least three samples to have cpm > 5.
```
_count/read per million (cpm/rpm): a normalized value for sequencing depth._
- Inspect distribution
```{r,fig.height=2.7,fig.width=9,echo=FALSE}
h0 <- cr %>%
gather(key="sample",value="value") %>%
ggplot(aes(x=log2(value+1),group=sample))+
geom_density()+
labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Raw")+
theme_bw()+
theme(panel.border=element_blank(),
axis.ticks=element_blank())
h1 <- cf1 %>%
gather(key="sample",value="value") %>%
ggplot(aes(x=log2(value+1),group=sample))+
geom_density()+
labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Method 1")+
theme_bw()+
theme(panel.border=element_blank(),
axis.ticks=element_blank())
h2 <- cf2 %>%
gather(key="sample",value="value") %>%
ggplot(aes(x=log2(value+1),group=sample))+
geom_density()+
labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Method 2")+
theme_bw()+
theme(panel.border=element_blank(),
axis.ticks=element_blank())
h3 <- cf3 %>%
gather(key="sample",value="value") %>%
ggplot(aes(x=log2(value+1),group=sample))+
geom_density()+
labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Method 3")+
theme_bw()+
theme(panel.border=element_blank(),
axis.ticks=element_blank())
gridExtra::grid.arrange(h0,h1,h2,h3,nrow=1,ncol=4)
```
- Inspect the number of rows (genes) available after filtering
```{r,eval=TRUE,echo=FALSE}
cat(paste0("Raw: ",nrow(cr),", Method 1: ",nrow(cf1),", Method 2: ",nrow(cf2),", Method 3: ",nrow(cf3)))
cf <- cf3
rm(cf1,cf2,cf3)
```
???
Lowly expressed genes are removed to improve the signal-to-noise ratio. Genes not expressed in any sample can be removed completely as they negatively affect multiple testing correction. The stringency of low count filtering can be adjusted based on the researcher's preference. It is a trade-off between data quality vs data size.
In the above example, raw data has a huge number of zeros and the distribution of over higher value counts are barely visible. Three different levels of low count filtering are shown.
In method 1, the detection limit is set at 1 count, ie; any value above 0 is considered as an expressed gene. And we aim to have expression in atleast 3 samples.
In method 2, the detection limit is set at 2 counts, ie; any value above 2 is considered an expressed gene. A value below 2 is considered noise and is discarded and we aim to have 2 count expression in at least 3 samples. Note that changing the minimum limit of detection has a dramatic effect on the count distribution. Notice how many rows (genes) are discarded.
In method 3, the limit of detection is a bit less subjective. A count that is greater than 5 count per million reads is considered a positive detection. This is a stringent and fairly robust method at the expense of losing the most number of genes.
---
name: norm-1
## Normalisation
.pull-left-50[
- Removing technical biases in sequencing data (e.g. sequencing depth and gene length)
- Make counts comparable across features (genes).
- Make counts comparable across samples
<!-- Control for sequencing depth -->
![](data/normalization_methods_depth.png)
```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,15),B=c(6,6,4))
rownames(dfr) <- c("x","y","z")
dfr$A_tc <- round(dfr$A/sum(dfr$A),2)
dfr$B_tc <- round(dfr$B/sum(dfr$B),2)
dfr
```
]
???
Normalisation of raw count data is necessary to make samples comparable. The comparison may be within sample, within groups, between groups and possibly between studies.
Also, by sequencing deep we generate more read counts for a given gene and differences in gene length will result in an uneven counts for genes expressed in the same level.
**Total count normalisation**
Imagine two samples A and B with three genes x, y and z. A has higher counts than B. Are the genes highly expressed in A? Probably not. A was sequenced deeper which resulted in more reads overall. Controlling for sequencing depth is one of the the first steps with normalisation. Total count normalisation controls for sequencing depth. See columns A_tc and B_tc for the count values after total count normalisation. If the total number of starting mRNA is comparable between samples, then this value reflects absolute expression for each gene.
Quantile normalisation, Upper quartile normalisation and Median normalisation all work in a similar way. Rather than the total count, a high quantile value or the median is used.
--
.pull-right-50[
- Control for compositional bias
![](data/normalization_methods_composition.png)
```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,15,100),B=c(6,6,4,2))
rownames(dfr) <- c("x","y","z","de")
dfr$A_tc <- round(dfr$A/sum(dfr$A),2)
dfr$B_tc <- round(dfr$B/sum(dfr$B),2)
dfr
#dfr$s3_tc <- round(dfr$s3/sum(dfr$s3),2)
#sf <- apply(dfr[,1:3]/(apply(dfr[,1:3],1,mean)),2,median)
#cb <- round(t(t(dfr[,1:3])/sf),2); colnames(cb) <- c("s1_c","s2_c","s3_c")
#cbind(dfr,cb)
```
]
???
**Controlling for compositional bias**
In another scenario, imagine the same two samples and four genes x,y,z and de. This time, both samples are sequenced to the same depth and the gene de is highly overexpressed in A than in B. Now, look at the total count normalised value for gene y. They have different normalised values in A and B although A and B had identical expression for gene y. This effect of few highly overexpressed genes seemingly changing the relative expression of other genes is called compositional bias.
---
name: norm-2
## Normalisation
- Controlling for gene length: It can be useful for gene to gene comparisons.
.size-60[![](data/normalization_methods_length.png)]
```{r,echo=FALSE}
dfr <- data.frame(counts=c(50,25),gene_length=c(10,5))
dfr$norm_counts <- round(dfr$counts/dfr$gene_length,2)
rownames(dfr) <- c("x","y")
dfr
```
--
- Bring counts to a human-friendly scale
???
**Controlling for gene length**
For two genes X and Y within a sample A, the longer gene will produce more reads than the shorter gene. For comparing expression of these two genes to each other, they need to be controlled for gene length.
In this example, gene x has higher counts than y. But when controlled for gene length, they both have the same expression.
**Counts per million reads**
The last point with normalisation is to bring the numbers to a human friendly scale. This is the reason for the per million part of CPM, RPKM etc.
---
name: norm-3a
## Normalisation
**Normalisation by library size**
- Assumes total expression is the same under different experimental conditions
- Methods include TC, RPKM, FPKM, TPM
- RPKM, FPKM and TPM control for sequencing depth and gene length
- Total number of RPKM/FPKM normalized counts for each sample will be different, therefore, you cannot compare the normalized counts for each gene equally between samples.
- TPM is proportional to RPKM and enables better comparison between samples because total per sample sums to equal value
```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,4),B=c(6,6,15),len=c(2*10^3,4*10^3,1*10^3))
rownames(dfr) <- c("x","y","z")
dfr$A_rpm <- round(dfr$A/(sum(dfr$A)/(10^6)))
dfr$B_rpm <- round(dfr$B/(sum(dfr$B)/(10^6)))
dfr$A_rpkm <- round(dfr$A_rpm * 10^3/dfr$len,2)
dfr$B_rpkm <- round(dfr$B_rpm * 10^3/dfr$len,2)
dfr$A_rpk <- dfr$A * 10^3/dfr$len
dfr$B_rpk <- dfr$B * 10^3/dfr$len
dfr$A_tpm <- round((dfr$A_rpk*10^6)/sum(dfr$A_rpk))
dfr$B_tpm <- round((dfr$B_rpk*10^6)/sum(dfr$B_rpk))
cs <- as.data.frame(t(as.data.frame(colSums(dfr))))
rownames(cs) <- "sum"
rbind(dfr,cs)
```
rpm = cpm.
.citation[
.cite[<i class="fas fa-link"></i> Evans, Ciaran, Johanna Hardin, and Daniel M. Stoebel. "Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions." [Briefings in bioinformatics (2017)](https://arxiv.org/abs/1609.00959)]
]
???
Normalisation strategies can be roughly grouped into four approaches: Normalisation by library size, normalisation by distribution, normalisation by testing and normalisation by using controls.
Normalisation by library size is the most basic. TPM is probably the only method that must be used.
**RPKM**
RPKM (Reads per kilo base of transcript per million mapped reads) in each sample sums up to different total values which make it hard to compare a gene from one sample to another sample.
```
RPM or CPM = number of reads mapped to gene * 10^6 / total number of mapped reads
RPKM = RPM * 10^3/ features length
```
Features length can be transcript or gene. 10^3 is for gene length in Kb and 10^6 is for million reads.
**TPM**
TPM (Transcripts per million) in every sample sums up to 1 million. This makes it easier to compare a gene from one sample to another.
```
RPK = count * 10^3 / feature length
TPM = (RPK * 10^6)/total_RPK
```
Features length can be transcript or gene. 10^3 is for gene length in Kb and 10^6 is for million reads.
---
name: norm-3b
## Normalisation
**Normalisation by distribution**
- Assumes technical effects are same for DE and non-DE genes
- Assumes number of over and under-expressed genes are roughly same across conditions
- Corrects for compositional bias
- Methods include Q, UQ, M, RLE, TMM, MRN
- `edgeR::calcNormFactors()` implements TMM, TMMwsp, RLE & UQ
- `DESeq2::estimateSizeFactors()` implements relative log expression (RLE)
- Does not correct for gene length
- **[geTMM](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2246-7)** is gene length corrected TMM
```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,4),B=c(6,6,15),len=c(2*10^3,4*10^3,1*10^3))
rownames(dfr) <- c("x","y","z")
dfr$ref <- round(sqrt(dfr$A*dfr$B),2)
dfr$A_ratio <- round(dfr$A/dfr$ref,2)
dfr$B_ratio <- round(dfr$B/dfr$ref,2)
noA <- median(dfr$A_ratio)
noB <- median(dfr$B_ratio)
dfr$A_mrn <- dfr$A/noA
dfr$B_mrn <- dfr$B/noB
dfr
```
.citation[
.cite[<i class="fas fa-link"></i> Evans, Ciaran, Johanna Hardin, and Daniel M. Stoebel. "Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions." [Briefings in bioinformatics (2017)](https://arxiv.org/abs/1609.00959)]
]
???
**Quartile (Q)**
In this method expression values are scaled in a way to keep the order based on expression values in each sample and expression values are normalized based on mean value of top genes. Then after assigning top genes to calculated mean then this procedure will be repeated for second top genes. It is called Quantile Normalization because the normalized datasets have identical quantile.
**Upper Quartile (UQ)**
In Upper Quartile normalization, the expression values are divided by 75th quantile of the counts for each samples.
**Trimmed Mean of M-values (TMM)**
The assumption of this method is that, there are many genes which are not deferentially expressed. By considering one of the sample as reference TMM is calculated. First the most expressed genes and the genes with the largest log ratios is excluded and then TMM, as the weighted mean of log ratios between reference and other samples, are calculated. TMM values will be used as correction of the library size and should be close to **one**. TMM is implemented in **edgeR**.
**TMM with singleton pairing (TMMwsp)**
It is similar to TMM which recommended for data with a lot of unexpressed genes (zero values). In TMM, zero counts are ignored while in TMMwsp method the positive counts from such genes are reused to increase the number of features by which the libraries are compared.
**Relative Log Expression (RLE)**
Similar to TMM, in RLE the hypothesis is that majority of the genes are not DE. The scaling factor in RLE is calculated as the median of the ratio of its expression level (read count) over its geometric mean across all samples. This method is implemented in **DESeq** and **DESeq2**.
**MRN**
MRN method assumes most samples are not deferentially expressed
- Find a reference sample for each gene
- Divide each count by the reference sample to create a ratio
- Compute median of this ratio for each sample
- Divide counts by this median ratio
Normalisation by distribution controls for composition bias. Most software use a mix of different approaches.
---
name: norm-3c
## Normalisation
**Normalisation by testing**
- A more robust version of normalisation by distribution
- A set of non-DE genes are detected through hypothesis testing
- Tolerates a larger difference in number of over and under expressed genes between conditions
- Methods include PoissonSeq, DEGES
--
**Normalisation using Controls**
- Assumes controls are not affected by experimental condition and technical effects are similar to all other genes
- Useful in conditions with global shift in expression
- Controls could be house-keeping genes or spike-ins
- Methods include RUV, CLS
--
**Stabilizing variance**
- Variance is stabilised across the range of mean values
- Methods include VST, RLOG, VOOM
- For use in exploratory analyses. Not for DE.
- `vst()` and `rlog()` functions from *DESeq2*
- `voom()` function from *Limma* converts data to normal distribution
---
name: norm-4
exclude: true
## Normalisation
| Method | Description | Accounted factors | Recommendation |
| ---- | ---- | ---- | ---- |
| CPM/RPM | counts scaled by total number of reads | sequencing depth | compare counts between replicates of the same group; NOT for within sample comparisons or DE analysis |
| TPM| counts per length of transcript (kb) per million reads mapped | sequencing depth and gene length | gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis |
| RPKM/FPKM | similar to TPM | sequencing depth and gene length | compare counts between genes within a sample; NOT for between sample comparisons or DE analysis |
| DESeq2 MRN | counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene | sequencing depth and RNA composition | compare counts between samples and for DE analysis; NOT for within sample comparisons |
| edgeR TMM | uses a weighted trimmed mean of the log expression ratios between samples | sequencing depth, RNA composition, and gene length | compare counts between and within samples and for DE analysis |
---
name: norm-5
## Normalisation
**Recommendations**
- Most tools use a mix of many different normalisations
- For DGE using DGE R packages (DESeq2, edgeR, Limma etc), use **raw counts**
- For visualisation (PCA, clustering, heatmaps etc), use VST or RLOG
- For own analysis with gene length correction, use TPM (maybe geTMM?)
- Custom solutions: spike-ins/house-keeping genes
--
.citation[
.cite[<i class="fas fa-link"></i> Dillies, Marie-Agnes, *et al*. "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis." [Briefings in bioinformatics 14.6 (2013): 671-683](https://www.ncbi.nlm.nih.gov/pubmed/22988256)]
]
---
# Acknowledgements
- [Normalising RNA-seq data](https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2012/121029_HTS/ernest_turro_normalising_rna-seq_data.pdf) by Ernest Turro
- RNA-seq analysis [Bioconductor vignette](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html)
---
name: end_slide
class: end-slide, middle
count: false
# Thank you. Questions?
```{r,echo=FALSE,child="assets/footer-slide.Rmd"}
```
```{r,include=FALSE,eval=FALSE}
# manually run this to render this document to HTML
#rmarkdown::render("slide_preprocessing.Rmd","xaringan::moon_reader")
# manually run this to convert HTML to PDF
#pagedown::chrome_print("presentation_dge.html",output="presentation_dge.pdf")
```