K-Means++ implementation for the .NET platform, includes Silhouette K-Estimator and Anderson-Darling statistical test.
Install-Package sharpkmeans
Create a dataset of N IEnumerable<float>
items where each item represents an embedding in M-dimensional space. For example, this could be ada-002 embeddings of answers to some semi-open-ended questions. In the case of large M, consider first reducing the dimensions via UMAP as SharpKMeans uses Euclidean distance as its distance function which loses meaning in higher order dimensions fast.
SharpKMeans allows for the clusterization of the dataset if we expect it to have distinct groups. KMeans works best on spherical data, in the case of non-regular shapes, consider DBSCAN.
There are two routines available for this:
Evaluate(int clustersMin, int clustersMax, IEnumerable<IEnumerable<float>> data)
if we don't know the exact K but we know a range in which K is.Evaluate(int clusters, IEnumerable<IEnumerable<float>> data)
if we know the exact K.
Both routines are thread-safe and can take an optional argument with settings of type KMeansSettings
. The settings available are:
Iterations
- increase the value if suboptimal clusters are found.RequiredDifferenceBetweenIterations
- allows to skip slow convergence near the end and end the algorithm eagerly if the centroids shift only by a very small amount.
An example of usage:
float[][] data = {
new [] { 0f, 0.2f, 6f },
new [] { 2.0f, 4f, 1.2f }
// more data, the data should have at least two dimensions
// for anderson-darling check, each cluster needs at least 5 datapoints
};
KMeansResultSilhouette[] result = KMeans.Evaluate(3, 20, data));
KMeansResult bestResult = result[0].Result;
The output structure contains:
Clusters
- an array of clusters where each cluster is defined by its medoidDatapoints
- input datapoints assigned to the clustersConvergence
- the convergence progression report
An example of plotted results via ImageSharp, here K was inferred as 7:
- Aneta Kahleová - for help with implementing the silhouette coefficient.