forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path23-ideal-scrnaseq-pipeline.Rmd
79 lines (60 loc) · 3.58 KB
/
23-ideal-scrnaseq-pipeline.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
output: html_document
---
# "Ideal" scRNAseq pipeline (as of Oct 2016)
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center", echo=FALSE)
```
## Experimental Design
* Avoid confounding biological and batch effects (Figure \@ref(fig:pipeline-batches))
* Multiple conditions should be captured on the same chip if possible
* Perform multiple replicates of each condition where replicates of different conditions should be performed together if possible
* Statistics cannot correct a completely confounded experiment!
* Unique molecular identifiers
* Greatly reduce noise in data
* May reduce gene detection rates (unclear if it is UMIs or other protocol differences)
* Use longer UMIs (~10bp)
* Correct for sequencing errors in UMIs using [UMI-tools](https://github.com/CGATOxford/UMI-tools)
* Spike-ins
* Useful for quality control
* Can be used to approximate cell-size/RNA content (if relevant to biological question)
* Often exhibit higher noise than endogenous genes (pipetting errors, mixture quality)
* Requires more sequencing to get enough endogenous reads per cell
* Cell number vs Read depth
* Gene detection plateaus starting from 1 million reads per cell
* Transcription factor detection (regulatory networks) require high read depth and most sensitive protocols (i.e. Fluidigm C1)
* Cell clustering & cell-type identification benefits from large number of cells and doesn't requireas high sequencing depth (~100,000 reads per cell).
```{r pipeline-batches, out.width = '90%', fig.cap="Appropriate approaches to batch effects in scRNASeq. Red arrows indicate batch effects which are (pale) or are not (vibrant) correctable through batch-correction."}
knitr::include_graphics("figures/Pipeline-batches.png")
```
## Processing Reads
* Read QC & Trimming
* [FASTQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [cutadapt](http://cutadapt.readthedocs.io/en/stable/index.html)
* Mapping
* Small datasets or UMI datasets: align to genome/transcriptome using [STAR](https://github.com/alexdobin/STAR)
* Large datasets: pseudo-alignment with [Salmon](http://salmon.readthedocs.io/en/latest/salmon.html)
* Quantification
* Small dataset, no UMIs : [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/tools/) or [featureCounts](http://subread.sourceforge.net/)
* Large datasets, no UMIs: [Salmon](http://salmon.readthedocs.io/en/latest/salmon.html)
* UMI dataset : [UMI-tools'](https://github.com/CGATOxford/UMI-tools) + [featureCounts](http://subread.sourceforge.net/)
## Preparing Expression Matrix
* Cell QC
* [scater](http://bioconductor.org/packages/scater)
* consider: mtRNA, rRNA, spike-ins (if available), number of detected genes per cell, total reads/molecules per cell
* Library Size Normalization
* [scran](http://bioconductor.org/packages/scran)
* Batch correction (if appropriate)
* [RUVs](http://bioconductor.org/packages/RUVSeq)
## Biological Interpretation
* Feature Selection
* [M3Drop](http://bioconductor.org/packages/M3Drop)
* Clustering and Marker Gene Identification
* [SC3](http://bioconductor.org/packages/SC3)
* Pseudotime
* distinct timepoints: [TSCAN](http://bioconductor.org/packages/TSCAN)
* continuous data: [destiny](http://bioconductor.org/packages/destiny)
* Differential Expression
* Small number of cells and few groups : [scde](http://hms-dbmi.github.io/scde/)
* Replicates with batch effects : mixture/linear models
* Large datasets: Kruskal-Wallis test (all groups at once), or Kolmogorov-Smirnov (KS)-test (compare 2-groups at a time).