Automating KMeans clustering #1377

NewsSoup · 2023-06-29T10:02:53Z

NewsSoup
Jun 29, 2023

Currently, BERTopic functions with a standard setting, and each modification a user wants to make requires them to import a library and pass an object into BERTopic. The relevant example is the hdbscan_model parameter, which can be changed by passing through various models - such as KMeans from SciKit-Learn.

Below I am positing a function that would allow you to set hdbscan_model = "KMeans" (a string) and it will automatically find the right value for K. However, because the centroids are initiated randomly, it repeats the test and returns the best-performing KMeans model. I have added some redundancy that would make it easier to apply to other clustering methods too. It uses the Silhouette score to rank various clustering results.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def best_cluster(
    self,                                                     # BERTopic
    method: Literal["KMeans", "Other"],   # Switch functionality
    X: list | np.ndarray, y = None,             # Passed into hdbscan_model during clustering
    n_comparisons: int= 5,                       # Number of repeats to find best clusters
    min_K: int = 5, max_K: int = 16,          # Default range K=5 to k=15
    *args, **kwargs
    ):
    
    #* Generate a valid range of k values to test
    range_ = range(min(min_K, len(X)-1), min(max_K, len(X)))

    #* Generate a list of clusters
    if method.lower() == "kmeans":
        cluster_list = []    
        for _ in range(n_comparisons):        
            for i in range_:
                cluster_list.append(KMeans(n_clusters=i, **kwargs).fit(X)) 
                    #? Test repeats of a range of k-values

    #* Generalises to other clustering methods
    else:
        cluster_list = [self.hdbscan_model.fit(X, y) for _ in range(n_comparisons)]


    #* Calculate silhouette scores
    try:
        scores = []
        for cluster_model in cluster_list:
            scores.append(silhouette_score(X, cluster_model.predict(X)))   

        top_cluster_model = cluster_list[scores.index(max(scores))]

        return top_cluster_model

    #* If silhouette scores fails, return unaltered input
    except:
        print("Default cluster model used")
        return self.hdbscan_model

NewsSoup · 2023-06-29T10:09:06Z

NewsSoup
Jun 29, 2023
Author

I repeat the entire experiment 5 times as default, my use case has 1000 to 2000 vectors and the test takes about 300ms on a Core i7 CPU on my laptop to generate the best cluster. Compared to the time UMAP takes, this is negligible.

Anecdotally, I am finding that collecting articles based on a single topic produces a K value of 8. It happens so regularly that I find it worthy of mention.

3 replies

MaartenGr Jun 29, 2023
Maintainer

This is something I think is best left to the user as every model is different, multiple versions of single models exist, and the objective measure over which you optimize the model can be different for the same model. For example, there exist 4 different versions of HDBSCAN that all are used with each their own nuances. To give another example, finding the "right" value for k is highly dependent on the use case and there is no single best way of doing so. You can optimize k for coherence, topic diversity, cluster-related evaluation metrics, elbow method, etc.

I believe that user-defined code for optimizing clusters might be more suitable. For example, a function that focuses on optimizing the clusters based on what the user wants to optimize it for.

NewsSoup Jun 30, 2023
Author

I get you point, so I have modified the code to be less specific. It only requires that the input model has a function called fit(X,y), where y can be None, and that models which require manual input of number of clusters call that attribute n_clusters. This is a standard API for machine learning models.

The function now only performs the silhoutette test, and has been renamed accordingly. Adapting it to other techniques only requires a change in the scoring algorithm. If you want to add a cluster verification suite, this can be done quickly with the aid of the community (if you ask). HDBSCAN does not need verification as all the results/scores will be identical in any way.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from copy import deepcopy

def silhoutette_test(
    self, # BERTopic
    X: list | np.ndarray, y = None, # Passed into hdbscan_model during clustering
    n_comparisons: int= 5, # Number of repeats to find best clusters
    find_K = False, # Use when a clustering algorithm needs to determine the number of clusters experimentally.
    min_K: int = 5, max_K: int = 16, # Default range K=5 to k=15
    *args, **kwargs
    ):
    
    #* Generate a valid range of k values to test
    range_ = range(min(min_K, len(X)-1), min(max_K, len(X)))
    try:
        #* Generate a list of clusters
        if find_K:
            cluster_list = []    
            for _ in range(n_comparisons):        
                for i in range_:
                    # Copy model with settings
                    model = deepcopy(self.hdbscan_model)
                    # Change n_clusters attribute
                    model.n_clusters = i
                    cluster_list.append(model.fit(X,y)) 
                        #? Test repeats of a range of k-values

        #* Generalises to other clustering methods
        else:
            cluster_list = [self.hdbscan_model.fit(X, y) for _ in range(n_comparisons)]


        #* Calculate silhouette scores
        scores = []
        for cluster_model in cluster_list:
            scores.append(silhouette_score(X, cluster_model.predict(X)))   

        top_cluster_model = cluster_list[scores.index(max(scores))]

        return top_cluster_model

    #* If silhouette scores fails, return unaltered input
    except:
        print("Default cluster model used")
        return self.hdbscan_model

Not every user will need it for every project, but it is a convenient feature to have built-in. In my use case HDBSCAN produces clusters that are unusable, even though they capture the variation in the data well. There will always be a trade-off between hypothetical perfection and practical approximation, which is why K-Means is still a dominant algorithm. All my filtering happens through HDBSCAN clusters though.

I'm not insisting you use my code, I'm just giving back in my own way by suggesting practical improvements in the form of features that improve user convenience.

MaartenGr Jun 30, 2023
Maintainer

Thanks for the updated code, I think this will help users out looking for something similar! I think if we were to go the route of cluster optimization, I would prefer to have a general function that allows you to optimize any clustering algorithm with any objective function.

HDBSCAN does not need verification as all the results/scores will be identical in any way.

This is not necessarily true as there are many parameters to tweak for HDBSCAN. Some users might want to optimize for the number of clusters, outliers, cluster distribution, coherence, diversity, etc. So I think a "cluster optimizer" would be an interesting feature of BERTopic. Especially if it also includes the representations as those also impact what some might define as a "good" cluster.

NewsSoup · 2023-07-18T10:32:21Z

NewsSoup
Jul 18, 2023
Author

I'm going to close the discussion, it served its purpose. Thank you

1 reply

karwester Jan 15, 2024

@NewsSoup Hi, I was just looking for a way to evaluate different number of clusters for a BERTopic model with k-Means. My current model has 50 clusters and looks better than the default model with over 170 clusters and a huge -1 topic. I came across this discussion and would like to try out your code but I'm not sure what the first argument should be. All my attempts resulted in the except code.

My current model is:

from bertopic import BERTopic
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=50)
titles = data.textClean.to_list()
topic_model = BERTopic(language="english", calculate_probabilities=True, hdbscan_model=cluster_model,verbose=True)
topics, probs = topic_model.fit_transform(titles)

Could you give me an example of using your function?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automating KMeans clustering #1377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Automating KMeans clustering #1377

NewsSoup Jun 29, 2023

Replies: 2 comments · 4 replies

NewsSoup Jun 29, 2023 Author

MaartenGr Jun 29, 2023 Maintainer

NewsSoup Jun 30, 2023 Author

MaartenGr Jun 30, 2023 Maintainer

NewsSoup Jul 18, 2023 Author

karwester Jan 15, 2024

NewsSoup
Jun 29, 2023

Replies: 2 comments 4 replies

NewsSoup
Jun 29, 2023
Author

MaartenGr Jun 29, 2023
Maintainer

NewsSoup Jun 30, 2023
Author

MaartenGr Jun 30, 2023
Maintainer

NewsSoup
Jul 18, 2023
Author