-
Notifications
You must be signed in to change notification settings - Fork 4
/
scrna-cell-types-2020-09.Rmd
345 lines (240 loc) · 9.32 KB
/
scrna-cell-types-2020-09.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
---
title: "Cell Type Annotation"
output:
html_notebook:
theme: readable
toc: yes
toc_float: yes
code_folding: none
---
## Introduction
This is a brief tutorial on automatic cell type annotation of single-cell RNA sequencing (scRNA-seq) data. The primary dataset used here contains hematopoietic and stromal bone marrow populations ([Baccin et al.](https://doi.org/10.1038/s41556-019-0439-6)). The version used here is a subset of the original dataset to have more similar population sizes and speed up processing.
## Load data
This tutorial includes some Bioconductor dependencies. Before proceeding, confirm that Bioconductor is installed and its version is at least 3.10.
```{r bioc-version}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::version()
```
If this is not the case, remove all versions of BiocVersion with `remove.packages("BiocVersion")`. Then update Bioconductor packages using `BiocManager::install()`.
Since we are using a Seurat object, load Seurat and related packages.
```{r load-libraries, message=FALSE, warning=FALSE}
library(Seurat)
library(ggplot2)
library(cowplot)
library(ggsci)
library(dplyr)
library(stringr)
```
Load the dataset.
```{r load-seurat-object, message=FALSE, warning=FALSE}
so = readRDS(url("https://osf.io/cvnqb/download"))
so
```
Check the experiment labels overlaid onto the UMAP visualization.
```{r umap-exp}
DimPlot(so, reduction = "umap", group.by = "experiment", cells = sample(colnames(so))) +
scale_color_nejm()
```
Check the experiment labels overlaid onto the tSNE visualization, as shown in the original publication ([original figure](https://www.nature.com/articles/s41556-019-0439-6/figures/1)).
```{r tsne-exp}
DimPlot(so, reduction = "tsne", group.by = "experiment", cells = sample(colnames(so))) +
scale_color_nejm()
```
Check the cell type labels overlaid onto the tSNE visualization.
```{r tsne-celltype}
DimPlot(so, reduction = "tsne", group.by = "celltype", cells = sample(colnames(so))) +
scale_color_igv()
```
## Annotation using SingleR
Load SingleR (install with `BiocManager::install("SingleR")`).
```{r load-singler, message=FALSE, warning=FALSE}
library(SingleR)
```
SingleR expects the input as a matrix or a SummarizedExperiment object.
```{r exp-mat}
exp_mat = GetAssayData(so, assay = "RNA", slot = "data")
exp_mat = as.matrix(exp_mat)
dim(exp_mat)
```
### MouseRNAseqData cell types
The SingleR package provides normalized expression values and cell types labels based on bulk RNA-seq, microarray, and single-cell RNA-seq data from several different datasets. See the [SingleR vignette](https://bioconductor.org/packages/3.10/bioc/vignettes/SingleR/inst/doc/SingleR.html#5_available_references) for the description.
We will try the mouse dataset from 358 bulk RNA-seq samples of sorted cell populations as the reference. It provides normalized expression values for samples that have been assigned to one of 18 main cell types and 28 subtypes.
```{r mousernaseq-se-load, include=FALSE}
singler_se = MouseRNAseqData()
```
It is possible that this fails with a `No internet connection using 'localHub=TRUE'` error. This may be resolved by running `ExperimentHub::setExperimentHubOption("PROXY", "http://127.0.0.1:10801")`.
```{r mousernaseq-se-show}
singler_se
```
Restrict to common genes between the test and reference datasets.
```{r mousernaseq-common-genes}
common_genes = intersect(rownames(exp_mat), rownames(singler_se))
common_genes = sort(common_genes)
exp_common_mat = exp_mat[common_genes, ]
singler_se = singler_se[common_genes, ]
length(common_genes)
```
Perform SingleR annotation.
```{r mousernaseq-singler}
singler_pred = SingleR(
test = exp_common_mat,
ref = singler_se,
labels = singler_se$label.main
)
```
Each row of the output data frame contains prediction results for a single cell.
```{r mousernaseq-df}
head(as.data.frame(singler_pred))
```
Add the assigned labels to the Seurat object.
```{r mousernaseq-addmetadata}
so_labeled = AddMetaData(so, as.data.frame(singler_pred))
```
Check the assigned labels to the original labels.
```{r mousernaseq-table}
[email protected] %>% select(labels) %>% table(useNA = "ifany")
```
SingleR also attempts to remove low quality or ambiguous assignments. Ambiguous assignments are based on the difference between the score for the assigned label and the median across all labels for each cell. Tuning parameters can be adjusted with `pruneScores()`. We can check how many cells are considered ambiguous and which populations they potentially belong to.
```{r mousernaseq-pruned-table}
[email protected] %>% select(pruned.labels) %>% table(useNA = "ifany")
```
Visualize MouseRNAseqData cell type labels.
```{r mousernaseq-tsne}
DimPlot(so_labeled, reduction = "tsne", group.by = "labels") +
scale_color_igv()
```
## Annotation using clustermole
SingleR is able to label cells, but it requires a reference dataset.
A more exploratory and unbiased approach is possible with [clustermole](https://cran.r-project.org/package=clustermole), an R package that provides a collection of cell type markers for thousands of human and mouse cell populations sourced from a variety of databases as well as methods to query them.
Load clustermole.
```{r load-clustermole, message=FALSE, warning=FALSE}
library(clustermole)
```
### Available cell types
We can retrieve a list of all markers in the clustermole database to see all the available options. Each row in the returned data frame is a combination of each cell type and its genes.
```{r clustermole-markers-table}
markers_tbl = clustermole_markers()
head(markers_tbl)
```
Check the available cell types (ignore gene columns).
```{r clustermole-markers-summary}
markers_tbl %>% distinct(celltype, organ, db)
```
### Marker gene overlaps
Calculate the average expression levels for the clusters for enrichment analysis using Seurat's `AverageExpression` function, convert to a matrix, and log-transform.
```{r avg-exp}
Idents(so) = "celltype"
avg_exp_mat = AverageExpression(so, assays = "RNA", slot = "data")
avg_exp_mat = as.matrix(avg_exp_mat$RNA)
avg_exp_mat = log1p(avg_exp_mat)
```
Check the average expression matrix.
```{r avg-exp-head}
avg_exp_mat[1:5, 1:5]
```
Check the cell type names.
```{r}
levels(Idents(so))
```
Find markers for the B-cell cluster.
```{r}
b_genes = rownames(avg_exp_mat[avg_exp_mat[, "B-cell"] == rowMaxs(avg_exp_mat), ])
length(b_genes)
```
```{r find-markers-b}
b_markers_df = FindMarkers(so, ident.1 = "B-cell", features = b_genes, verbose = FALSE)
nrow(b_markers_df)
```
Check the markers table.
```{r find-markers-b-head}
head(b_markers_df)
```
With the default cutoffs, this gives us a data frame with hundreds of genes. Let's subset to just the top 20 genes.
```{r markers-top-genes}
b_markers = rownames(b_markers_df)
b_markers = head(b_markers, 20)
b_markers
```
Check overlap of our B-cell markers with all cell type signatures.
```{r clustermole-overlaps}
overlaps_tbl = clustermole_overlaps(genes = b_markers, species = "mm")
```
Check the top scoring cell types for the B-cell cluster.
```{r clustermole-overlaps-result}
head(overlaps_tbl, 10)
```
Find markers for the Adipo-CAR cluster.
```{r}
acar_genes = rownames(avg_exp_mat[avg_exp_mat[, "Adipo-CAR"] == rowMaxs(avg_exp_mat), ])
length(acar_genes)
```
```{r}
acar_markers_df = FindMarkers(so, ident.1 = "Adipo-CAR", features = acar_genes, verbose = FALSE)
nrow(acar_markers_df)
```
Check the markers table.
```{r}
head(acar_markers_df)
```
With the default cutoffs, this gives us a data frame with hundreds of genes. Let's subset to just the top 20 genes.
```{r}
acar_markers = rownames(acar_markers_df)
acar_markers = head(acar_markers, 20)
acar_markers
```
Check overlap of our Adipo-CAR markers with all cell type signatures.
```{r}
overlaps_tbl = clustermole_overlaps(genes = acar_markers, species = "mm")
```
Check the top scoring cell types for the Adipo-CAR cluster.
```{r}
head(overlaps_tbl, 10)
```
Find markers for the Osteoblasts cluster.
```{r}
o_genes = rownames(avg_exp_mat[avg_exp_mat[, "Osteoblasts"] == rowMaxs(avg_exp_mat), ])
length(o_genes)
```
```{r}
o_markers_df = FindMarkers(so, ident.1 = "Osteoblasts", features = o_genes, verbose = FALSE)
nrow(o_markers_df)
```
Check the markers table.
```{r}
head(o_markers_df)
```
With the default cutoffs, this gives us a data frame with hundreds of genes. Let's subset to just the top 20 genes.
```{r}
o_markers = rownames(o_markers_df)
o_markers = head(o_markers, 20)
o_markers
```
Check overlap of our Osteoblasts markers with all cell type signatures.
```{r}
overlaps_tbl = clustermole_overlaps(genes = o_markers, species = "mm")
```
Check the top scoring cell types for the Osteoblasts cluster.
```{r}
head(overlaps_tbl, 10)
```
### Enrichment of markers
Run enrichment of all cell type signatures across all clusters.
```{r}
enrich_tbl = clustermole_enrichment(expr_mat = avg_exp_mat, species = "mm")
```
Most enriched cell types for the B-cell cluster.
```{r}
enrich_tbl %>% filter(cluster == "B-cell") %>% select(-cluster) %>% head(10)
```
Most enriched cell types for the Adipo-CAR cluster.
```{r}
enrich_tbl %>% filter(cluster == "Adipo-CAR") %>% select(-cluster) %>% head(10)
```
Most enriched cell types for the Osteoblasts cluster.
```{r}
enrich_tbl %>% filter(cluster == "Osteoblasts") %>% select(-cluster) %>% head(10)
```
---
[previous tutorials](https://igordot.github.io/tutorials/)