slide_preprocessing.Rmd

---
title: "Data Preprocessing"
subtitle: "Workshop on RNA-Seq"
author: "`r paste0('<b>Roy Francis and Nima Rafati</b>')`"
institute: NBIS, SciLifeLab
keywords: bioinformatics, course, scilifelab, nbis
output:
  xaringan::moon_reader:
    encoding: 'UTF-8'
    self_contained: false
    chakra: 'assets/remark-latest.min.js'
    css: 'assets/slide.css'
    lib_dir: libs
    nature:
      ratio: '4:3'
      highlightLanguage: r
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
      slideNumberFormat: "%current%/%total%"
    include: NULL
---
exclude: true
count: false

```{r,echo=FALSE,child="assets/header-slide.Rmd"}
```

<!-- ------------ Only edit title, subtitle & author above this ------------ -->

```{r,include=FALSE}
# load the packages you need
library(dplyr)
library(tidyr)
#library(stringr)
library(ggplot2)
#library(plotly)
library(pheatmap)
library(DESeq2)
library(edgeR)
library(gridExtra)
```

---
name: raw

## Raw data

- Raw count table

```{r,eval=TRUE,echo=FALSE}
cr <- read.csv("./data/gene_counts_raw.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1)
cr[1:6,1:6]
```

- Metadata

```{r,eval=TRUE,echo=FALSE}
mr <- read.csv("./data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)
mr[1:6,]
```

???

A glimpse into the count table and metadata table. Before any downstream analyses, it is important to make sure that the count column names match the metadata row names. Make sure the column in both tables are of the correct data type; ie; numbers are numeric and groups are factors.

---
name: pp

## Filtering

- Remove genes and samples with low counts

```{r,echo=TRUE}
cf1 <- cr[rowSums(cr>0) >= 3, ] # Keep rows/genes that have at least one read in +3 samples 
cf2 <- cr[rowSums(cr>3) >= 3, ] # Keep rows/genes that have at least three reads in +3 samples 
cf3 <- cr[rowSums(edgeR::cpm(cr)>5) >= 3, ] # need at least  three samples to have cpm > 5. 
```
_count/read per million (cpm/rpm): a normalized value for sequencing depth._  
- Inspect distribution

```{r,fig.height=2.7,fig.width=9,echo=FALSE}
h0 <- cr %>%
  gather(key="sample",value="value") %>%
  ggplot(aes(x=log2(value+1),group=sample))+
  geom_density()+
  labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Raw")+
  theme_bw()+
  theme(panel.border=element_blank(),
        axis.ticks=element_blank())

h1 <- cf1 %>%
  gather(key="sample",value="value") %>%
  ggplot(aes(x=log2(value+1),group=sample))+
  geom_density()+
  labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Method 1")+
  theme_bw()+
  theme(panel.border=element_blank(),
        axis.ticks=element_blank())

h2 <- cf2 %>%
  gather(key="sample",value="value") %>%
  ggplot(aes(x=log2(value+1),group=sample))+
  geom_density()+
  labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Method 2")+
  theme_bw()+
  theme(panel.border=element_blank(),
        axis.ticks=element_blank())

h3 <- cf3 %>%
  gather(key="sample",value="value") %>%
  ggplot(aes(x=log2(value+1),group=sample))+
  geom_density()+
  labs(x=expression('Log'[10]~'Read counts'),y="Density",title="Method 3")+
  theme_bw()+
  theme(panel.border=element_blank(),
        axis.ticks=element_blank())

gridExtra::grid.arrange(h0,h1,h2,h3,nrow=1,ncol=4)
```

- Inspect the number of rows (genes) available after filtering

```{r,eval=TRUE,echo=FALSE}
cat(paste0("Raw: ",nrow(cr),", Method 1: ",nrow(cf1),", Method 2: ",nrow(cf2),", Method 3: ",nrow(cf3)))
cf <- cf3
rm(cf1,cf2,cf3)
```

???

Lowly expressed genes are removed to improve the signal-to-noise ratio. Genes not expressed in any sample can be removed completely as they negatively affect multiple testing correction. The stringency of low count filtering can be adjusted based on the researcher's preference. It is a trade-off between data quality vs data size.

In the above example, raw data has a huge number of zeros and the distribution of over higher value counts are barely visible. Three different levels of low count filtering are shown.

In method 1, the detection limit is set at 1 count, ie; any value above 0 is considered as an expressed gene. And we aim to have expression in atleast 3 samples.

In method 2, the detection limit is set at 2 counts, ie; any value above 2 is considered an expressed gene. A value below 2 is considered noise and is discarded and we aim to have 2 count expression in at least 3 samples. Note that changing the minimum limit of detection has a dramatic effect on the count distribution. Notice how many rows (genes) are discarded.

In method 3, the limit of detection is a bit less subjective. A count that is greater than 5 count per million reads is considered a positive detection. This is a stringent and fairly robust method at the expense of losing the most number of genes.

---
name: norm-1

## Normalisation

.pull-left-50[
- Removing technical biases in sequencing data (e.g. sequencing depth and gene length)
- Make counts comparable across features (genes). 
- Make counts comparable across samples

<!-- Control for sequencing depth -->

![](data/normalization_methods_depth.png)

```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,15),B=c(6,6,4))
rownames(dfr) <- c("x","y","z")
dfr$A_tc <- round(dfr$A/sum(dfr$A),2)
dfr$B_tc <- round(dfr$B/sum(dfr$B),2)
dfr
```
]

???

Normalisation of raw count data is necessary to make samples comparable. The comparison may be within sample, within groups, between groups and possibly between studies. 
Also, by sequencing deep we generate more read counts for a given gene and differences in gene length will result in an uneven counts for genes expressed in the same level.  

**Total count normalisation**

Imagine two samples A and B with three genes x, y and z. A has higher counts than B. Are the genes highly expressed in A? Probably not. A was sequenced deeper which resulted in more reads overall. Controlling for sequencing depth is one of the the first steps with normalisation. Total count normalisation controls for sequencing depth. See columns A_tc and B_tc for the count values after total count normalisation. If the total number of starting mRNA is comparable between samples, then this value reflects absolute expression for each gene.

Quantile normalisation, Upper quartile normalisation and Median normalisation all work in a similar way. Rather than the total count, a high quantile value or the median is used.

--

.pull-right-50[
- Control for compositional bias
![](data/normalization_methods_composition.png)

```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,15,100),B=c(6,6,4,2))
rownames(dfr) <- c("x","y","z","de")
dfr$A_tc <- round(dfr$A/sum(dfr$A),2)
dfr$B_tc <- round(dfr$B/sum(dfr$B),2)
dfr
#dfr$s3_tc <- round(dfr$s3/sum(dfr$s3),2)
#sf <- apply(dfr[,1:3]/(apply(dfr[,1:3],1,mean)),2,median)
#cb <- round(t(t(dfr[,1:3])/sf),2); colnames(cb) <- c("s1_c","s2_c","s3_c")
#cbind(dfr,cb)
```
]

???

**Controlling for compositional bias**

In another scenario, imagine the same two samples and four genes x,y,z and de. This time, both samples are sequenced to the same depth and the gene de is highly overexpressed in A than in B. Now, look at the total count normalised value for gene y. They have different normalised values in A and B although A and B had identical expression for gene y. This effect of few highly overexpressed genes seemingly changing the relative expression of other genes is called compositional bias.

---
name: norm-2

## Normalisation

- Controlling for gene length: It can be useful for gene to gene comparisons.
.size-60[![](data/normalization_methods_length.png)]

```{r,echo=FALSE}
dfr <- data.frame(counts=c(50,25),gene_length=c(10,5))
dfr$norm_counts <- round(dfr$counts/dfr$gene_length,2)
rownames(dfr) <- c("x","y")
dfr
```

--

- Bring counts to a human-friendly scale

???

**Controlling for gene length**

For two genes X and Y within a sample A, the longer gene will produce more reads than the shorter gene. For comparing expression of these two genes to each other, they need to be controlled for gene length.

In this example, gene x has higher counts than y. But when controlled for gene length, they both have the same expression.

**Counts per million reads**

The last point with normalisation is to bring the numbers to a human friendly scale. This is the reason for the per million part of CPM, RPKM etc. 

---
name: norm-3a

## Normalisation

**Normalisation by library size**

- Assumes total expression is the same under different experimental conditions
- Methods include TC, RPKM, FPKM, TPM
- RPKM, FPKM and TPM control for sequencing depth and gene length
- Total number of RPKM/FPKM normalized counts for each sample will be different, therefore, you cannot compare the normalized counts for each gene equally between samples.
- TPM is proportional to RPKM and enables better comparison between samples because total per sample sums to equal value

```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,4),B=c(6,6,15),len=c(2*10^3,4*10^3,1*10^3))
rownames(dfr) <- c("x","y","z")
dfr$A_rpm <- round(dfr$A/(sum(dfr$A)/(10^6)))
dfr$B_rpm <- round(dfr$B/(sum(dfr$B)/(10^6)))
dfr$A_rpkm <- round(dfr$A_rpm * 10^3/dfr$len,2)
dfr$B_rpkm <- round(dfr$B_rpm * 10^3/dfr$len,2)
dfr$A_rpk <- dfr$A * 10^3/dfr$len
dfr$B_rpk <- dfr$B * 10^3/dfr$len
dfr$A_tpm <- round((dfr$A_rpk*10^6)/sum(dfr$A_rpk))
dfr$B_tpm <- round((dfr$B_rpk*10^6)/sum(dfr$B_rpk))

cs <- as.data.frame(t(as.data.frame(colSums(dfr))))
rownames(cs) <- "sum"
rbind(dfr,cs)
```

rpm = cpm.  

.citation[
.cite[<i class="fas fa-link"></i> Evans, Ciaran, Johanna Hardin, and Daniel M. Stoebel. "Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions." [Briefings in bioinformatics (2017)](https://arxiv.org/abs/1609.00959)]  
]

???

Normalisation strategies can be roughly grouped into four approaches: Normalisation by library size, normalisation by distribution, normalisation by testing and normalisation by using controls.

Normalisation by library size is the most basic. TPM is probably the only method that must be used.

**RPKM**

RPKM (Reads per kilo base of transcript per million mapped reads) in each sample sums up to different total values which make it hard to compare a gene from one sample to another sample.

```
RPM or CPM = number of reads mapped to gene * 10^6 / total number of mapped reads
RPKM = RPM * 10^3/ features length
```
Features length can be transcript or gene. 10^3 is for gene length in Kb and 10^6 is for million reads.  
**TPM**

TPM (Transcripts per million) in every sample sums up to 1 million. This makes it easier to compare a gene from one sample to another.

```
RPK = count * 10^3 / feature length
TPM = (RPK * 10^6)/total_RPK
```
Features length can be transcript or gene. 10^3 is for gene length in Kb and 10^6 is for million reads.  
---
name: norm-3b

## Normalisation

**Normalisation by distribution**

- Assumes technical effects are same for DE and non-DE genes
- Assumes number of over and under-expressed genes are roughly same across conditions
- Corrects for compositional bias
- Methods include Q, UQ, M, RLE, TMM, MRN
- `edgeR::calcNormFactors()` implements TMM, TMMwsp, RLE & UQ
- `DESeq2::estimateSizeFactors()` implements relative log expression (RLE)
- Does not correct for gene length
- **[geTMM](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2246-7)** is gene length corrected TMM


```{r,echo=FALSE}
dfr <- data.frame(A=c(20,25,4),B=c(6,6,15),len=c(2*10^3,4*10^3,1*10^3))
rownames(dfr) <- c("x","y","z")

dfr$ref <- round(sqrt(dfr$A*dfr$B),2)
dfr$A_ratio <- round(dfr$A/dfr$ref,2)
dfr$B_ratio <- round(dfr$B/dfr$ref,2)
noA <- median(dfr$A_ratio)
noB <- median(dfr$B_ratio)
dfr$A_mrn <- dfr$A/noA
dfr$B_mrn <- dfr$B/noB
dfr
```

.citation[
.cite[<i class="fas fa-link"></i> Evans, Ciaran, Johanna Hardin, and Daniel M. Stoebel. "Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions." [Briefings in bioinformatics (2017)](https://arxiv.org/abs/1609.00959)]  
]

???


**Quartile (Q)**  
In this method expression values are scaled in a way to keep the order based on expression values in each sample and expression values are normalized based on mean value of top genes. Then after assigning top genes to calculated mean then this procedure will be repeated for second top genes. It is called Quantile Normalization because the normalized datasets have identical quantile.  

**Upper Quartile (UQ)**  
In Upper Quartile normalization, the expression values are divided by 75th quantile of the counts for each samples.  

**Trimmed Mean of M-values (TMM)**  
The assumption of this method is that, there are many genes which are not deferentially expressed. By considering one of the sample as reference TMM is calculated. First the most expressed genes and the genes with the largest log ratios is excluded and then TMM, as the weighted mean of log ratios between reference and other samples, are calculated. TMM values will be used as correction of the library size and should be close to **one**. TMM is implemented in **edgeR**.   

**TMM with singleton pairing (TMMwsp)**  
It is similar to TMM which recommended for data with a lot of unexpressed genes (zero values).  In TMM, zero counts are ignored while in TMMwsp method the positive counts from such genes are reused to increase the number of features by which the libraries are compared.  

**Relative Log Expression (RLE)**  
Similar to TMM, in RLE the hypothesis is that majority of the genes are not DE. The scaling factor in RLE is calculated as the median of the ratio of its expression level (read count) over its geometric mean across all samples. This method is implemented in **DESeq** and **DESeq2**.  

**MRN**

MRN method assumes most samples are not deferentially expressed

- Find a reference sample for each gene  
- Divide each count by the reference sample to create a ratio
- Compute median of this ratio for each sample
- Divide counts by this median ratio

Normalisation by distribution controls for composition bias. Most software use a mix of different approaches.

---
name: norm-3c

## Normalisation

**Normalisation by testing**

- A more robust version of normalisation by distribution 
- A set of non-DE genes are detected through hypothesis testing
- Tolerates a larger difference in number of over and under expressed genes between conditions
- Methods include PoissonSeq, DEGES

--

**Normalisation using Controls**

- Assumes controls are not affected by experimental condition and technical effects are similar to all other genes
- Useful in conditions with global shift in expression
- Controls could be house-keeping genes or spike-ins
- Methods include RUV, CLS

--

**Stabilizing variance**
- Variance is stabilised across the range of mean values
- Methods include VST, RLOG, VOOM
- For use in exploratory analyses. Not for DE.
- `vst()` and `rlog()` functions from *DESeq2*
- `voom()` function from *Limma* converts data to normal distribution

---
name: norm-4
exclude: true

## Normalisation

| Method |	Description |	Accounted factors	| Recommendation |
| ---- | ---- | ---- | ---- |
| CPM/RPM	| counts scaled by total number of reads 	| sequencing depth |	compare counts between replicates of the same group; NOT for within sample comparisons or DE analysis |
| TPM|	counts per length of transcript (kb) per million reads mapped |	sequencing depth and gene length |	gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis |
| RPKM/FPKM |	similar to TPM |	sequencing depth and gene length	| compare counts between genes within a sample; NOT for between sample comparisons or DE analysis |
| DESeq2 MRN |	counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene |	sequencing depth and RNA composition | compare counts between samples and for DE analysis; NOT for within sample comparisons |
| edgeR TMM |	uses a weighted trimmed mean of the log expression ratios between samples |	sequencing depth, RNA composition, and gene length |	compare counts between and within samples and for DE analysis |

---
name: norm-5

## Normalisation

**Recommendations**

- Most tools use a mix of many different normalisations
- For DGE using DGE R packages (DESeq2, edgeR, Limma etc), use **raw counts**
- For visualisation (PCA, clustering, heatmaps etc), use VST or RLOG
- For own analysis with gene length correction, use TPM (maybe geTMM?)
- Custom solutions: spike-ins/house-keeping genes

--

.citation[
.cite[<i class="fas fa-link"></i> Dillies, Marie-Agnes, *et al*. "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis." [Briefings in bioinformatics 14.6 (2013): 671-683](https://www.ncbi.nlm.nih.gov/pubmed/22988256)]
]

---

# Acknowledgements

- [Normalising RNA-seq data](https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2012/121029_HTS/ernest_turro_normalising_rna-seq_data.pdf) by Ernest Turro  
- RNA-seq analysis [Bioconductor vignette](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html)

---
name: end_slide
class: end-slide, middle
count: false

# Thank you. Questions?

```{r,echo=FALSE,child="assets/footer-slide.Rmd"}
```

```{r,include=FALSE,eval=FALSE}
# manually run this to render this document to HTML
#rmarkdown::render("slide_preprocessing.Rmd","xaringan::moon_reader")
# manually run this to convert HTML to PDF
#pagedown::chrome_print("presentation_dge.html",output="presentation_dge.pdf")
```