Replies: 1 comment 1 reply
-
Applying L2 normalization to embeddings and then computing the Euclidean distance between them will result in a measure that is closely related to cosine similarity.
The topic embeddings are generally created by taking the average of the input embeddings, so not the reduced embeddings that come out of UMAP.
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I was wondering on what exactly the clustering for the automatic topic reduction does operate (looking at the embeddings case, not the c-tf-idf case)
HDBScan is using an euclidean distance and not a cosine similarity, but the embeddings are generally made to be compared with cosine similarity. In 'normal' use, UMAP projects embeddings in euclidean space, and then HDBScan is used.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions