forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path20-de-intro.Rmd
225 lines (188 loc) · 7.98 KB
/
20-de-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
knit: bookdown::preview_chapter
---
# Differential Expression (DE) analysis
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center")
```
## Bulk RNA-seq
One of the most common types of analyses when analyzing bulk RNA-seq
data is to identify differentially expressed genes. By comparing the
genes that change between two conditions, e.g. mutant and wild-type or
stimulated and unstimulated, it is possible to characterize the
molecular mechanisms underlying the change.
Several different methods,
e.g. [DESeq2](https://bioconductor.org/packages/DESeq2) and
[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html),
have been developed for bulk RNA-seq. Moreover, there are also
extensive
[datasets](http://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-9-r95)
available where the RNA-seq data has been validated using
RT-qPCR. These data can be used to benchmark DE finding algorithms.
## Single cell RNA-seq
In contrast to the bulk RNA-seq in scRNA-seq we usually do not have a defined
set of experimental conditions, but instead, as was shown in the previous chapter
(\@ref(clust-methods)) we can identify the cell groups by using the unsupervised
clustering approach. Once the groups have been identified one can find differentially
expressed genes by either looking at the differencies in variance between the groups (like the Kruskal-Wallis test implemented in SC3), or by comparing gene expression between clusters in a pairwise manner. In the following chapter we will mainly consider tools developed for the comparison of the two groups of cells.
## scRNA-seq synthetic data
One advantage of working with single-cell data is
that there is a reliable, analytically tractable mathematical model for
the expression levels, the Poisson-Beta distribution. Importantly, the
Poisson-Beta distribution has strong experimental support, and it
provides a good fit to scRNA-seq data.
In this module, we first discuss the Poisson-Beta distribution. We
then use the model to generate synthetic data which we can use to
compare different DE finding algorithms. In the final section we
investigate a real scRNA-seq dataset.
## Single gene expression
```{r}
library(scRNA.seq.funcs)
set.seed(1)
```
For single-cell data, the analytically tractable stochastic bursting
model provides a good fit. Thus, we can use it to generate realistic
data. We start by generating samples from one gene:
```{r}
s <- scRNA.seq.funcs::PoiBeta(100, 2, 3)
```
We then plot the results as a histogram:
```{r poisson-beta-plot, fig.cap = "Distribution of read counts for a single genes across 100 cells based on the Poisson-Beta model"}
hist(s,
freq = FALSE,
xlab = "# transcripts",
main = "Transcript distribution")
```
The probability mass function of the Poisson-Beta distribution is
challenging to work with since it involves special functions. However,
the mean and variance can be calculated as:
$\mu = k\cdot a/(a+b)$
$\sigma^2 = k^2\cdot a\cdot b/((a+b+1)\cdot(a+b)^2)$
## Poisson-Beta distribution
There are three regimes of the Poisson-Beta distribution and they are
determined by the values of the parameters $a$ and $b$.
When $a<1$ and $b<1$ we have a bimodal distribution with one mode at 0 and
the other at $k$, when $a<1$ and $b>1$ we have a monotonically decreasing
distribution and otherwise we have a unimodal distribution with a mode
at $ka/(a+b)$.
```{r poisson-beta-shapes, fig.height=12, fig.cap = "Different distributions of read counts for a single genes across 100 cells based on the Poisson-Beta model corresponding to different paramete sets"}
par(mfrow=c(3,1))
hist(scRNA.seq.funcs::PoiBeta(100, .2, .3),
freq = FALSE,
xlab = "# transcripts",
col = rgb(0, 0, 1, 1/4),
breaks = seq(0, 120, 10))
hist(scRNA.seq.funcs::PoiBeta(100, .2, 3),
freq = FALSE,
xlab = "# transcripts",
col = rgb(0,1,0,1/4),
breaks = seq(0, 120, 10))
hist(scRNA.seq.funcs::PoiBeta(100, 2, .3),
freq = FALSE,
xlab = "# transcripts",
col = rgb(1, 0, 0, 1/4),
breaks = seq(0, 120, 10))
```
__Exercise 1__: Vary the parameters _a_, _b_ and _k_ to explore how the
location and shape of the distribution changes.
## Sample size
The difficulty in determining the parameters of the distribution also
depends on the sample size. In the example below, it is not clear if
the distribution is bimodal when only 10 samples are drawn.
```{r poisson-beta-sample-size, fig.cap = "Effect of sampling size on the distribution of read counts based on the Poisson-Beta model"}
hist(scRNA.seq.funcs::PoiBeta(10, .6, 1.2, 10),
freq = FALSE,
xlab = "# transcripts",
col = rgb(1, 0, 0, 1/4),
breaks = seq(0, 20, 1))
hist(scRNA.seq.funcs::PoiBeta(10, .6, 1.2, 50),
freq = FALSE,
xlab = "# transcripts",
col = rgb(0, 1, 0, 1/4),
breaks = seq(0, 20, 1),
add = TRUE)
```
__Exercise 2__: Modify the number of samples (i.e. cells) drawn to
explore how difficult it is to correctly infer the correct shape.
## Dropout noise
The stochastic bursting model only captures the biological
variability. In practice there will also be experimental
variability. We model the noise as drop-outs, i.e. we assume that
there is a small probality that each transcript is lost. For gene $i$
and cell $j$ it is assumed that the probability of loosing the
transcript is given by $p_d = \mu/(d+\mu)$, where $\mu$ is
the mean expression level of the gene and $d$ is a drop-out
parameter. Thus, the probability of drop-outs monotonically decreases
as the mean expression level increases.
To visualize the impact of the drop-outs on a sample, we can tune the
drop-out parameter:
```{r poisson-beta-dropout-example, fig.cap = "Effect of dropouts on the distribution of read counts based on the Poisson-Beta model", fig.height=12}
par(mfrow=c(3,1))
hist(scRNA.seq.funcs::PoiBeta(100, 2, 3, 100, 1),
freq = FALSE,
xlab = "# transcripts",
col = rgb(1, 0, 0, 1/4),
ylim = c(0, .08),
xlim = c(0, 120))
hist(scRNA.seq.funcs::PoiBeta(100, 2, 3, 100, 10),
freq = FALSE,
xlab = "# transcripts",
col = rgb(0, 1, 0, 1/4),
ylim = c(0, .08),
xlim = c(0, 120))
hist(scRNA.seq.funcs::PoiBeta(100, 2, 3, 100, 100),
freq = FALSE,
xlab = "# transcripts",
col = rgb(0, 0, 1, 1/4),
ylim = c(0, .08),
xlim = c(0, 120))
```
__Exercise 3__: Explore the different parameter regimes for the same
drop-out rate. Do you think that we are more sensitive to drop-outs in
any specific regime?
## Dispersion noise
Another example of noise is under- or over-dispersion. This can be
modelled using a single parameter multiplying the parameters $a$ and $b$
in the Poisson-Beta distribution by a scalar $s$. We can see that the
mean is left unchanged while the variance is inversely proportional to $s$.
```{r poisson-beta-dispersion-analytical, fig.cap = "The mean and the variance of the distribution of read counts based on the Poisson-Beta model"}
s <- 10^(-3:3)
k <- 100
a <- 2
b <- 3
par(mfrow=c(1,1))
plot(s,
k*a*s/(a*s + b*s),
log = "x",
col = "red",
ylim = c(0, 50),
ylab = "Moments")
points(s,
k*a*s*b*s/((a*s + b*s)^2*(a*s + b*s + 1)),
col = "blue")
```
To illustrate the effect of the dispersion parameter on the distribution consider:
```{r poisson-beta-dispersion-effect, fig.cap = "Effect of dispersion on the distribution of read counts based on the Poisson-Beta model", fig.height=12}
par(mfrow=c(3,1))
hist(scRNA.seq.funcs::PoiBeta(100, 2, 3, 100),
freq = FALSE,
xlab = "# transcripts",
col = rgb(1, 0, 0, 1/4),
ylim = c(0, .05),
xlim = c(0, 120))
hist(scRNA.seq.funcs::PoiBeta(100, 2/10, 3/10, 100),
freq = FALSE,
xlab = "# transcripts",
col = rgb(0, 1, 0, 1/4),
ylim = c(0, .05),
xlim = c(0, 120))
hist(scRNA.seq.funcs::PoiBeta(100, 2*10, 3*10, 100),
freq = FALSE,
xlab = "# transcripts",
col = rgb(0, 0, 1, 1/4),
ylim = c(0, .05),
xlim = c(0, 120))
```
__Exercise 4__: Explore what happens when you have both drop-outs and
under/over-dispersion. Can the effects be deconvoluted?