Skip to content

Clustering Algorithm

mattjones315 edited this page Aug 29, 2017 · 8 revisions

When Clustering Is Applied

When the number of cells in an expression matrix is greater than 35,000, the computational and memory demands of the pipeline quickly become intractable. To enable analysis of such data, we have developed a pooling algorithm that preserves local structure while heavily diminishing the number of representative points. As a default setting, the number of cells required for a FastProject analysis after clustering is the square root of the number of cells in the input expression matrix. The purpose of this algorithm is not to heuristically find a set of informative cells, but rather to intuitively diminish the number of samples so that when analyzing the data the user can send off interesting subsets of cells for further analysis. Refer to Using the Output Report for more information.

Overview of the Algorithm

To cluster the cells in the expression matrix, we rely on a graph-based clustering method called the Louvain Algorithm Blondel et. al. (2008). Because forming a dense graph based on the expression matrix, we first apply three preprocessing steps to obtain a sparse K-nearest neighbors graph:

  • Identify the most informative genes by applying a Fano Filter
  • Reduce dimensionality further by applying PCA and preserving the first 30 Principal Components
  • Find the K=30 nearest neighbors for all cells using a ball-tree based model (refer to van der Maaten (2014)).

We can then form a sparse graph from the data, involving all samples, but with only 30*N edges as opposed to N^2 in a dense distance based graph. Currently, similarities are purely the euclidean distance between each point and its 30 nearest neighbors, although an intuitively more effective measure would be applying an adaptive gaussian kernel as in van Dijk et. al. (2017). After processing the input data, we apply the Louvain Algorithm which efficiently finds a set of clusters from the input graph.

Typically, a relatively small number of clusters are found from the input graph; yet at this point each cluster contains few enough cells to apply a more traditional clustering method such as Kmeans. We find the remaining clusters (recall that by default the minimum number of clusters desired is the square root of the number of cells) by applying Kmeans iteratively to each cluster already found until a satisfactory number of clusters is found. Upon doing so, the cells in each cluster are pooled together into a "supercell" where each gene expression is the average expression of the gene in the cluster. This combined cell is then used in analysis; clusters of cells can be reanalyzed during the visualization process.

Benchmarking our Clustering Method

We find that our algorithm performs very well on inputs of various sizes.

Number of Cells Kmeans FastProjectR
430 .005 s .118 s
1,000 .0356 s .147 s
5,000 .316 s .789 s
20,000 1.84 s 5.38 s
50,000 20.07 s 17.43 s
100,000 86.4 s 52.69 s
250,000 784.809 s 347.54 s

While Kmeans performs slightly better on smaller input sizes, for expression matrices that FastProjectR is built to analyze such a clustering method is insufficient. To this end, our algorithm scales much better and can easily handle input sizes of up to a million cells in a feasible amount of time.

Qualitative Performance of FastProjectR's Clustering Algorithm

These plots compare FastProjectR's clustering algorithm's performance on small and large input sizes. As can be seen, on this two dimensional data, the algorithm cleanly separates neighborhoods within the data. The first plot consists of 1M cells uniformly distributed between in the interval ([0,1], [0,1]); the second is 430 cells clustered after being projected using tSNE.

Clustering 1 Million Cells

Clustering 430 Cells