-
Notifications
You must be signed in to change notification settings - Fork 32
Clustering Algorithm
When the number of cells in an expression matrix is greater than 35,000, the computational and memory demands of the pipeline quickly become intractable. To enable analysis of such data, we have developed a pooling algorithm that preserves local structure while heavily diminishing the number of representative points. As a default setting, the number of cells required for a FastProject analysis after clustering is the square root of the number of cells in the input expression matrix. The purpose of this algorithm is not to heuristically find a set of informative cells, but rather to intuitively diminish the number of samples so that when analyzing the data the user can send off interesting subsets of cells for further analysis. Refer to Using the Output Report for more information.
To cluster the cells in the expression matrix, we rely on a graph-based clustering method called the Louvain Algorithm Blondel et. al. (2008). Because forming a dense graph based on the expression matrix, we first apply three preprocessing steps to obtain a sparse K-nearest neighbors graph:
- Identify the most informative genes by applying a Fano Filter
- Reduce dimensionality further by applying PCA and preserving the first 30 Principal Components
- Find the K=30 nearest neighbors for all cells using a ball-tree based model (refer to van der Maaten (2014)).
We can then form a sparse graph from the data, involving all samples, but with only 30*N edges as opposed to N^2 in a dense distance based graph. Currently, similarities are purely the euclidean distance between each point and its 30 nearest neighbors, although an intuitively more effective measure would be applying an adaptive gaussian kernel as in van Dijk et. al. (2017). After processing the input data, we apply the Louvain Algorithm which efficiently finds a set of clusters from the input graph.
Typically, a relatively small number of clusters are found from the input graph; yet at this point each cluster contains few enough cells to apply a more traditional clustering method such as Kmeans. We find the remaining clusters (recall that by default the minimum number of clusters desired is the square root of the number of cells) by applying Kmeans iteratively to each cluster already found until a satisfactory number of clusters is found. Upon doing so, the cells in each cluster are pooled together into a "supercell" where each gene expression is the average expression of the gene in the cluster. This combined cell is then used in analysis; clusters of cells can be reanalyzed during the visualization process.
We find that our algorithm performs very well on inputs of various sizes.
Number of Cells | Kmeans | FastProjectR |
---|---|---|
430 | .005 s | .118 s |
1,000 | .0356 s | .147 s |
5,000 | .316 s | .789 s |
20,000 | 1.84 s | 5.38 s |
50,000 | 20.07 s | 17.43 s |
100,000 | 86.4 s | 52.69 s |
250,000 | 784.809 s | 347.54 s |
While Kmeans performs slightly better on smaller input sizes, for expression matrices that FastProjectR is built to analyze such a clustering method is insufficient. To this end, our algorithm scales much better and can easily handle input sizes of up to a million cells in a feasible amount of time.
These plots compare FastProjectR's clustering algorithm's performance on small and large input sizes. As can be seen, on this two dimensional data, the algorithm cleanly separates neighborhoods within the data. The first plot consists of 1M cells uniformly distributed between in the interval ([0,1], [0,1]); the second is 430 cells clustered after being projected using tSNE.