Replies: 2 comments 4 replies
-
I repeat the entire experiment 5 times as default, my use case has 1000 to 2000 vectors and the test takes about 300ms on a Core i7 CPU on my laptop to generate the best cluster. Compared to the time UMAP takes, this is negligible. Anecdotally, I am finding that collecting articles based on a single topic produces a K value of 8. It happens so regularly that I find it worthy of mention. |
Beta Was this translation helpful? Give feedback.
3 replies
-
I'm going to close the discussion, it served its purpose. Thank you |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently, BERTopic functions with a standard setting, and each modification a user wants to make requires them to import a library and pass an object into BERTopic. The relevant example is the hdbscan_model parameter, which can be changed by passing through various models - such as KMeans from SciKit-Learn.
Below I am positing a function that would allow you to set hdbscan_model = "KMeans" (a string) and it will automatically find the right value for K. However, because the centroids are initiated randomly, it repeats the test and returns the best-performing KMeans model. I have added some redundancy that would make it easier to apply to other clustering methods too. It uses the Silhouette score to rank various clustering results.
Beta Was this translation helpful? Give feedback.
All reactions