Module : Clustering with Seurat

Module : Clustering with seurat

This module clusters cells using a given algorithm.

Internal name : clustering-seurat
Avalaible : local mode
Input Ports :
- matrix : filtered expression matrix (tsv)
- cells : normalized cells metadata (tsv)
- genes : genes metadata (tsv)
Output Ports :
- genesoutput : genes metadata (tsv) (completed with Boolean values indicating usage or not)
- cellsoutput : normalized cells metadata (tsv) (completed with clusters)
Optional parameters :
- Parameters for data correction

Parameter	Type	Description	Default Value
normalize	boolean	whether to normalize data or not	True
scale	boolean	whether to scale data or not	True

Parameters for gene filtering on expression

Parameter	Type	Description	Default Value
detection_threshold	int	Minimum number of counts to consider a gene as detected	10
ncells	float	Number of detections needed to keep a gene	0

Parameters for gene filtering on variability

Parameter	Type	Description	Default Value
hvg_method	text	Method for Highly Variable Genes detection : must be either scran, seurat or none (no filtering)	none
low_mean	float	Minimum mean logtransformed expression value to keep a gene	0.01
high_mean	float	Maximum mean logtransformed expression value to keep a gene	5
var	float	Minimum variability (as Z-score for Seurat or biological component for scran) to keep a gene	1
use_spike	boolean	whether to use or not spikes (only used with scran algorithm)	True

Parameters for Jackstraw randomizations

Parameter	Type	Description	Default Value
nreplicate	int	Number of jackstraw repeats	100
proportion	float	Proportion of random features at each jackstraw iteration	0.1
significativity	float	Significativity threshold for a principal component under jackstraw testing	0.05
score_threhsold	float	Significativity threshold for a gene on one component under jackstraw testing	0.00001

Parameter	Type	Description	Default Value
resolution	float	Resolution parameter for clustering (the lower it is the larger clusters are)	0.8
k	int	Number of nearest neighbors to consider (the higher it is the larger clusters are)	30
k_scale	int	Granularity option for k (steps for increasing the number of neighbors)	25
algorithm	text	Which clustering algorithm to use : must be one of Louvain, Louvain.multilevel, or SLM (Smart Local Moving)	SLM
sparse	Boolean	Whether to use a sparse matrix to store graph (recommended for 10X data)	False

Configuration example

<step id="Clustering" skip="false">
	<module>clustering-seurat</module>
	<parameters>
		<parameter>
			<name>hvg_method</name>
			<value>none</value>	
		</parameter>
	  </parameters>
</step>

Interpreting output files

Mean-dispersion plot (Seurat)

Seurat estimates (by maximum likelihood) the mean of expressed (non zero) values and estimates (by maximum likelihood) standard deviation from the mean. Genes are then grouped according to their mean expression, and a z-score is calculated for each standard deviation. Z-scores and mean are then plotted.

Mean-disp

Labels indicate kept genes. Here, a group of genes on the left seems to clearly separate from the others. These are probably noisy genes and should be removed. On the contrary, some high mean genes with high dispersion were removed, while potentially gathering interesting information.

Mean-dispersion plot (Scran)

Scran fits a non linear model across mean and variance of genes using local regression (LOESS, LOcal regrESSion). If spike-ins are given this fitting reflects the technical variability, and you should keep genes clearly above what would be expected by technical variations (for they hold much of the biological information). If no spike-ins are given, the model will expect most of the variance to be technical. Thus this method should be used if spike-ins are available.

MeanVar

PCs significativity plots

Based on conserved genes (annotated as Boolean values in the genes metadata file), Seurat computes PCA on the data matrix. Then using jackstraw randomization it estimates important components. Briefly, at each jackstraw iteration, a fraction of the genes are randomized. After numerous repetitions of the process, a null distribution of each gene's weight on the components of PCA is obtained. Genes showing weights significantly higher than randomly expected may be informational genes. Moreover, components correlated to these genes are strongly descriptive of the data. It is supposed that these components reflect latent variables (cell type, differentiation, etc...) and thus they must be conserved. Others components are considered noisy components.

To discriminate between informative and non-informative components, Seurat tests if more strongly significant genes (defined by significativity threshold) are found than would be expected under uniform distribution of the p-values.

PCs1

The above plot shows the QQplot of the expected p-values and the observed p-values. Significant PCs should show nearly vertical plot. The module always keeps at least 2 PCs, whether they are both informative or not.

PCs Heatmap

The module then calculates the heatmap of the genes more correlated to each conserved PC, using 200 cells (or all cells if you have less than 200).

PCs2

Here the heatmap clearly shows spurious PCs selection, because PC1 seems noisy and driven by only a few extreme cells and PC2 shows no informative gene.

Clustering

First, Seurat builds a K-Nearest Neighbor graph and adjusts edges' weights using Jaccard distance. Let E1 and E2 be the set of nearest neighbors of, respectively, cell 1 and cell 2. Then the Jaccard distance J between these two cells is J = 1 - |E1 U E2|/|E1 ∩ E2|. So, if E1 = E2, J = 0 and if E1 and E2 have no common element J = 1. So J represents the proportion of shared neighbors between two cells.

Secondly, the graph is refined to increase modularity. Modularity is a value between -1 and 1. For cells in distinct clusters modularity is equal to 0. When cells share a cluster, modularity increases with the weight between those cells, and decreases when weights between those cells and other cells increase. The algorithm starts with each cells in its own clusters. Then it groups cells in order to increase modularity. Subsequently, groups of cells are considered as a single node, and weights are recalculated. Then the first phase is repeated on the new graph, until no increase in modularity can be achieved. SLM, Louvain and Louvain.modularity are variations of this algorithm.

The module plots a t-SNE projection of the cells, coloring them by cluster.

TSNE

Here, poorly separated clusters are shown. The quality of the clustering can be assessed using the silhouette plot of the data. A silhouette value is calculated for all cell.

Let "in" be the average distance of one cell to all cells from the same cluster, and "next" be the minimal average distance of the aforementioned cell to all cells from a different cluster. Then the silhouette value "sil", is defined as sil = (next - in) / max(in, next). Thus for each cell, silhouette value goes from -1 to 1. Negative silhouette value indicate badly clustered elements, and positive values indicate well clustered elements. A good clustering should then show a lot of positive values close to 1.

sil

Here we clearly see a poor clustering, as one group shows only negative silhouette values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly