-
Notifications
You must be signed in to change notification settings - Fork 0
Module : Clustering with Seurat
This module clusters cells using a given algorithm.
-
Internal name : clustering-seurat
-
Avalaible : local mode
-
Input Ports :
- matrix : filtered expression matrix (tsv)
- cells : normalized cells metadata (tsv)
- genes : genes metadata (tsv)
-
Output Ports :
- genesoutput : genes metadata (tsv) (completed with Boolean values indicating usage or not)
- cellsoutput : normalized cells metadata (tsv) (completed with clusters)
-
Optional parameters :
- Parameters for data correction
Parameter | Type | Description | Default Value |
---|---|---|---|
normalize | boolean | whether to normalize data or not | True |
scale | boolean | whether to scale data or not | True |
- Parameters for gene filtering on expression
Parameter | Type | Description | Default Value |
---|---|---|---|
detection_threshold | int | Minimum number of counts to consider a gene as detected | 10 |
ncells | float | Number of detections needed to keep a gene | 0 |
- Parameters for gene filtering on variability
Parameter | Type | Description | Default Value |
---|---|---|---|
hvg_method | text | Method for Highly Variable Genes detection : must be either scran, seurat or none (no filtering) | none |
low_mean | float | Minimum mean logtransformed expression value to keep a gene | 0.01 |
high_mean | float | Maximum mean logtransformed expression value to keep a gene | 5 |
var | float | Minimum variability (as Z-score for Seurat or biological component for scran) to keep a gene | 1 |
use_spike | boolean | whether to use or not spikes (only used with scran algorithm) | True |
- Parameters for Jackstraw randomizations
Parameter | Type | Description | Default Value |
---|---|---|---|
nreplicate | int | Number of jackstraw repeats | 100 |
proportion | float | Proportion of random features at each jackstraw iteration | 0.1 |
significativity | float | Significativity threshold for a principal component under jackstraw testing | 0.05 |
score_threhsold | float | Significativity threshold for a gene on one component under jackstraw testing | 0.00001 |
Parameter | Type | Description | Default Value |
---|---|---|---|
resolution | float | Resolution parameter for clustering (the lower it is the larger clusters are) | 0.8 |
k | int | Number of nearest neighbors to consider (the higher it is the larger clusters are) | 30 |
k_scale | int | Granularity option for k (steps for increasing the number of neighbors) | 25 |
algorithm | text | Which clustering algorithm to use : must be one of Louvain, Louvain.multilevel, or SLM (Smart Local Moving) | SLM |
sparse | Boolean | Whether to use a sparse matrix to store graph (recommended for 10X data) | False |
- Configuration example
<step id="Clustering" skip="false">
<module>clustering-seurat</module>
<parameters>
<parameter>
<name>hvg_method</name>
<value>none</value>
</parameter>
</parameters>
</step>
Seurat estimates (by maximum likelihood) the mean of expressed (non zero) values and estimates (by maximum likelihood) standard deviation from the mean. Genes are then grouped according to their mean expression, and a z-score is calculated for each standard deviation. Z-scores and mean are then plotted.
Labels indicate kept genes. Here, a group of genes on the left seems to clearly separate from the others. These are probably noisy genes and should be removed. On the contrary, some high mean genes with high dispersion were removed, while potentially gathering interesting information.
Scran fits a non linear model across mean and variance of genes using local regression (LOESS, LOcal regrESSion). If spike-ins are given this fitting reflects the technical variability, and you should keep genes clearly above what would be expected by technical variations (for they hold much of the biological information). If no spike-ins are given, the model will expect most of the variance to be technical. Thus this method should be used if spike-ins are available.
Based on conserved genes (annotated as Boolean values in the genes metadata file), Seurat computes PCA on the data matrix. Then using jackstraw randomization it estimates important components. Briefly, at each jackstraw iteration, a fraction of the genes are randomized. After numerous repetitions of the process, a null distribution of each gene's weight on the components of PCA is obtained. Genes showing weights significantly higher than randomly expected may be informational genes. Moreover, components correlated to these genes are strongly descriptive of the data. It is supposed that these components reflect latent variables (cell type, differentiation, etc...) and thus they must be conserved. Others components are considered noisy components.
To discriminate between informative and non-informative components, Seurat tests if more strongly significant genes (defined by significativity threshold) are found than would be expected under uniform distribution of the p-values.
The above plot shows the QQplot of the expected p-values and the observed p-values. Significant PCs should show nearly vertical plot. The module always keeps at least 2 PCs, whether they are both informative or not.
The module then calculates the heatmap of the genes more correlated to each conserved PC, using 200 cells (or all cells if you have less than 200).
Here the heatmap clearly shows spurious PCs selection, because PC1 seems noisy and driven by only a few extreme cells and PC2 shows no informative gene.
First, Seurat builds a K-Nearest Neighbor graph and adjusts edges' weights using Jaccard distance. Let E1 and E2 be the set of nearest neighbors of, respectively, cell 1 and cell 2. Then the Jaccard distance J between these two cells is J = 1 - |E1 U E2|/|E1 ∩ E2|. So, if E1 = E2, J = 0 and if E1 and E2 have no common element J = 1. So J represents the proportion of shared neighbors between two cells.
Secondly, the graph is refined to increase modularity. Modularity is a value between -1 and 1. For cells in distinct clusters modularity is equal to 0. When cells share a cluster, modularity increases with the weight between those cells, and decreases when weights between those cells and other cells increase. The algorithm starts with each cells in its own clusters. Then it groups cells in order to increase modularity. Subsequently, groups of cells are considered as a single node, and weights are recalculated. Then the first phase is repeated on the new graph, until no increase in modularity can be achieved. SLM, Louvain and Louvain.modularity are variations of this algorithm.
The module plots a t-SNE projection of the cells, coloring them by cluster.
Here, poorly separated clusters are shown. The quality of the clustering can be assessed using the silhouette plot of the data. A silhouette value is calculated for all cell.
Let "in" be the average distance of one cell to all cells from the same cluster, and "next" be the minimal average distance of the aforementioned cell to all cells from a different cluster. Then the silhouette value "sil", is defined as sil = (next - in) / max(in, next). Thus for each cell, silhouette value goes from -1 to 1. Negative silhouette value indicate badly clustered elements, and positive values indicate well clustered elements. A good clustering should then show a lot of positive values close to 1.
Here we clearly see a poor clustering, as one group shows only negative silhouette values.