Skip to content

Module : Clustering with Seurat

Jaze8 edited this page Jun 7, 2018 · 8 revisions

Module : Clustering with seurat

This module clusters cells using a given algorithm.

  • Internal name : clustering-seurat

  • Avalaible : local mode

  • Input Ports :

    • matrix : filtered expression matrix (tsv)
    • cells : normalized cells metadata (tsv)
    • genes : genes metadata (tsv)
  • Output Ports :

    • genesoutput : genes metadata (tsv) (completed with Boolean values indicating usage or not)
    • cellsoutput : normalized cells metadata (tsv) (completed with clusters)
  • Optional parameters :

    • Parameters for data correction
Parameter Type Description Default Value
normalize boolean whether to normalize data or not True
scale boolean whether to scale data or not True
  • Parameters for gene filtering on expression
Parameter Type Description Default Value
detection_threshold int Minimum number of counts to consider a gene as detected 10
ncells float Number of detections needed to keep a gene 0
  • Parameters for gene filtering on variability
Parameter Type Description Default Value
hvg_method text Method for Highly Variable Genes detection : must be either scran, seurat or none (no filtering) none
low_mean float Minimum mean logtransformed expression value to keep a gene 0.01
high_mean float Maximum mean logtransformed expression value to keep a gene 5
var float Minimum variability (as Z-score for Seurat or biological component for scran) to keep a gene 1
use_spike boolean whether to use or not spikes (only used with scran algorithm) True
  • Parameters for Jackstraw randomizations
Parameter Type Description Default Value
nreplicate int Number of jackstraw repeats 100
proportion float Proportion of random features at each jackstraw iteration 0.1
significativity float Significativity threshold for a principal component under jackstraw testing 0.05
score_threhsold float Significativity threshold for a gene on one component under jackstraw testing 0.00001
Parameter Type Description Default Value
resolution float Resolution parameter for clustering (the lower it is the larger clusters are) 0.8
k int Number of nearest neighbors to consider (the higher it is the larger clusters are) 30
k_scale int Granularity option for k (steps for increasing the number of neighbors) 25
algorithm text Which clustering algorithm to use : must be one of Louvain, Louvain.multilevel, or SLM (Smart Local Moving) SLM
sparse Boolean Whether to use a sparse matrix to store graph (recommended for 10X data) False
  • Configuration example
<step id="Clustering" skip="false">
	<module>clustering-seurat</module>
	<parameters>
		<parameter>
			<name>hvg_method</name>
			<value>none</value>	
		</parameter>
	  </parameters>
</step>

Interpreting output files

Mean-dispersion plot (Seurat)

Seurat estimates (by maximum likelihood) the mean of expressed (non zero) values and estimates (by maximum likelihood) standard deviation from the mean. Genes are then grouped according to their mean expression, and a z-score is calculated for each standard deviation. Z-scores and mean are then plotted.

Mean-disp

Labels indicate kept genes. Here, a group of genes on the left seems to clearly separate from the others. These are probably noisy genes and should be removed. On the contrary, some high mean genes with high dispersion were removed, while potentially gathering interesting information.

Mean-dispersion plot (Scran)

Scran fits a non linear model across mean and variance of genes using local regression (LOESS, LOcal regrESSion). If spike-ins are given this fitting reflects the technical variability, and you should keep genes clearly above what would be expected by technical variations (for they hold much of the biological information). If no spike-ins are given, the model will expect most of the variance to be technical. Thus this method should be used if spike-ins are available.

MeanVar

PCs significativity plots

Based on conserved genes (annotated as Boolean values in the genes metadata file), Seurat computes PCA on the data matrix. Then using jackstraw randomization it estimates important components. Briefly, at each jackstraw iteration, a fraction of the genes are randomized. After numerous repetitions of the process, a null distribution of each gene's weight on the components of PCA is obtained. Genes showing weights significantly higher than randomly expected may be informational genes. Moreover, components correlated to these genes are strongly descriptive of the data. It is supposed that these components reflect latent variables (cell type, differentiation, etc...) and thus they must be conserved. Others components are considered noisy components.

To discriminate between informative and non-informative components, Seurat tests if more strongly significant genes (defined by significativity threshold) are found than would be expected under uniform distribution of the p-values.

PCs1

The above plot shows the QQplot of the expected p-values and the observed p-values. Significant PCs should show nearly vertical plot. The module always keeps at least 2 PCs, whether they are both informative or not.

PCs Heatmap

The module then calculates the heatmap of the genes more correlated to each conserved PC, using 200 cells (or all cells if you have less than 200).

PCs2

Here the heatmap clearly shows spurious PCs selection, because PC1 seems noisy and driven by only a few extreme cells and PC2 shows no informative gene.

Clustering

First, Seurat builds a K-Nearest Neighbor graph and adjusts edges' weights using Jaccard distance. Let E1 and E2 be the set of nearest neighbors of, respectively, cell 1 and cell 2. Then the Jaccard distance J between these two cells is J = 1 - |E1 U E2|/|E1 ∩ E2|. So, if E1 = E2, J = 0 and if E1 and E2 have no common element J = 1. So J represents the proportion of shared neighbors between two cells.

Secondly, the graph is refined to increase modularity. Modularity is a value between -1 and 1. For cells in distinct clusters modularity is equal to 0. When cells share a cluster, modularity increases with the weight between those cells, and decreases when weights between those cells and other cells increase. The algorithm starts with each cells in its own clusters. Then it groups cells in order to increase modularity. Subsequently, groups of cells are considered as a single node, and weights are recalculated. Then the first phase is repeated on the new graph, until no increase in modularity can be achieved. SLM, Louvain and Louvain.modularity are variations of this algorithm.

The module plots a t-SNE projection of the cells, coloring them by cluster.

TSNE

Here, poorly separated clusters are shown. The quality of the clustering can be assessed using the silhouette plot of the data. A silhouette value is calculated for all cell.

Let "in" be the average distance of one cell to all cells from the same cluster, and "next" be the minimal average distance of the aforementioned cell to all cells from a different cluster. Then the silhouette value "sil", is defined as sil = (next - in) / max(in, next). Thus for each cell, silhouette value goes from -1 to 1. Negative silhouette value indicate badly clustered elements, and positive values indicate well clustered elements. A good clustering should then show a lot of positive values close to 1.

sil

Here we clearly see a poor clustering, as one group shows only negative silhouette values.