-
Notifications
You must be signed in to change notification settings - Fork 1
/
Single_cell_RNA_seq_practical_partB.Rmd
181 lines (131 loc) · 6.51 KB
/
Single_cell_RNA_seq_practical_partB.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: "Single Cell RNA-sequencing Practical - Part B"
output: github_document
---
Created by: Ahmed Mahfouz
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width=9, fig.height=6)
```
# Overview
In this tutorial we will look at different approaches to clustering scRNA-seq datasets in order to characterize the different subgroups of cells. Using unsupervised clustering, we will try to identify groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels.
Load required packages:
```{r packages}
suppressMessages(require(tidyverse))
suppressMessages(require(Seurat))
suppressMessages(require(cowplot))
suppressMessages(require(scater))
suppressMessages(require(scran))
suppressMessages(require(igraph))
```
## Datasets
In this tutorial, we will use a small dataset of cells from developing mouse embryo [Deng et al. 2014](https://science.sciencemag.org/content/343/6167/193). We have preprocessed the dataset and created a `SingleCellExperiment` object in advance. We have also annotated the cells with the cell types identified in the original publication (it is the `cell_type2` column in the `colData` slot).
```{r load}
#load expression matrix
deng <- readRDS("deng-reads.rds")
deng
#look at the cell type annotation
table(colData(deng)$cell_type2)
```
## Feature selection
The first step is to decide which genes to use in clustering the cells. Single cell RNASeq can profile a huge number of genes in a lot of cells. But most of the genes are not expressed enough to provide a meaningful signal and are often driven by technical noise. Including them could potentially add some unwanted signal that would blur the biological variation. Moreover gene filtering can also speed up the computational time for downstream analysis.
#### Filtering out low abundance genes
Low-abundance genes are mostly non informative and are not representative of the biological variance of the data. They are often driven by technical noise such as dropout event. However, their presence in downstream analysis leads often to a lower accuracy since they can interfere with some statistical model that will be used and also will pointlessly increase the computational time which can be critical when working with very large data.
```{r filt_low_abundance}
#Calculate gene mean across cell
gene_mean <- rowMeans(counts(deng))
#Calculate gene variance across cell
gene_var <- rowVars(counts(deng))
abundant_genes <- gene_mean >= 0.5 #Remove Low abundance genes
# plot low abundance gene filtering
hist(log10(gene_mean), breaks=100, main="", col="grey80",
xlab=expression(Log[10]~"average count"))
abline(v=log10(0.5), col="red", lwd=2, lty=2)
```
```{r remove_low_abund}
#remove low abundance gene in SingleCellExperiment Object
deng <- deng[abundant_genes,]
dim(deng)
```
#### Filtering genes that are expressed in very few cells
We can also filter some genes that are in a small number of cells. This procedure would remove some outlier genes that is highly expressed in one or two cells. These genes are unwanted for further analysis since they mostly comes from an iregular amplification of artifacts. It is important to note that we might not want to filter with this procedure when the aim of the analysis is to detect a very rare subpopulation in the data.
```{r remove_genes_in_few_cells}
#Calculate the number of non zero expression for each genes
numcells <- nexprs(deng, byrow=TRUE)
#Filter genes detected in less than 5 cells
numcells2 <- numcells >= 5
deng <- deng[numcells2,]
dim(deng)
```
#### Detecting Highly Variable Genes
```{r hvg_2}
fit <- trendVar(deng, parametric=TRUE, use.spikes = FALSE)
dec <- decomposeVar(deng, fit)
dec$HVG <- (dec$FDR<0.00001)
hvg_genes <- rownames(dec[dec$FDR < 0.00001, ])
# plot highly variable genes
plot(dec$mean, dec$total, pch=16, cex=0.6, xlab="Mean log-expression",
ylab="Variance of log-expression")
o <- order(dec$mean)
lines(dec$mean[o], dec$tech[o], col="dodgerblue", lwd=2)
points(dec$mean[dec$HVG], dec$total[dec$HVG], col="red", pch=16)
## save the decomposed variance table and hvg_genes into metadata for safekeeping
metadata(deng)$hvg_genes <- hvg_genes
metadata(deng)$dec_var <- dec
```
## Dimensionality reduction
The clustering problem is computationally difficult due to the high level of noise (both technical and biological) and the large number of dimensions (i.e. genes). We can solve these problems by applying dimensionality reduction methods (e.g. PCA, tSNE, and UMAP)
```{r pca}
#PCA (select the number of components to calculate)
deng <- runPCA(deng, method = "irlba",
ncomponents = 30,
feature_set = metadata(deng)$hvg_genes)
#Make a scree plot (percentage variance explained per PC) to determine the number of relevant components
X <- attributes(deng@reducedDims$PCA)
plot(X$percentVar~c(1:30), type="b", lwd=1, ylab="Percentage variance" , xlab="PCs" , bty="l" , pch=1)
```
Make a PCA plot (PC1 vs. PC2)
```{r pca_plot, warning=FALSE}
plotReducedDim(deng, "PCA", colour_by = "cell_type2")
```
Make a tSNE plot
*Note:* tSNE is a stochastic method. Everytime you run it you will get slightly different results. For convinience we can get the same results if we seet the seed.
```{r tSNE, warning=FALSE}
#tSNE
deng <- runTSNE(deng,
perplexity = 30,
feature_set = metadata(deng)$hvg_genes,
set.seed = 1)
plotReducedDim(deng, "TSNE", colour_by = "cell_type2")
```
## Clustering
### TSNE + Kmeans
```{r k-means, warning=FALSE}
# Do kmeans algorithm on TSNE coordinates
deng_kmeans <- kmeans(x = deng@reducedDims$TSNE,centers = 10)
TSNE_kmeans <- factor(deng_kmeans$cluster)
colData(deng)$TSNE_kmeans <- TSNE_kmeans
#Compare with ground truth
plot_grid(plotTSNE(deng, colour_by = "TSNE_kmeans"),
plotTSNE(deng, colour_by = "cell_type2"))
```
### Graph Based Clustering
```{r graph_clust, warning=FALSE}
#k=5
#The k parameter denfines the number of closest cells to look for each cells
SNNgraph_k5 <- buildSNNGraph(x = deng, k=5)
SNNcluster_k5 <- cluster_louvain(SNNgraph_k5)
colData(deng)$SNNcluster_k5 <- factor(SNNcluster_k5$membership)
p5<- plotPCA(deng, colour_by="SNNcluster_k5")+ guides(fill=guide_legend(ncol=2))
# k30
SNNgraph_k30 <- buildSNNGraph(x = deng, k=30)
SNNcluster_k30 <- cluster_louvain(SNNgraph_k30)
colData(deng)$SNNcluster_k30 <- factor(SNNcluster_k30$membership)
p30 <- plotPCA(deng, colour_by="SNNcluster_k30")
#plot the different clustering.
plot_grid(p5+ guides(fill=guide_legend(ncol=1)),p30)
```
### Session info
```{r}
sessionInfo()
```