About topic reduction "auto" #2054

HerrmannM · 2024-06-18T06:41:01Z

HerrmannM
Jun 18, 2024

Hi all,

I was wondering on what exactly the clustering for the automatic topic reduction does operate (looking at the embeddings case, not the c-tf-idf case)

Topic embeddings are created by taking the average of all document embedding.
Embeddings are normalised
HDBScan is applied with some predefined params, giving the new clusters

HDBScan is using an euclidean distance and not a cosine similarity, but the embeddings are generally made to be compared with cosine similarity. In 'normal' use, UMAP projects embeddings in euclidean space, and then HDBScan is used.

Did I miss something ? E.g. are the topic embeddings computed by averaging what comes out of umap?
If not, how could we evaluate the impact of using the euclidean distance rather than the cosine similarity?

Thank you!

MaartenGr · 2024-06-18T08:43:39Z

MaartenGr
Jun 18, 2024
Maintainer

Applying L2 normalization to embeddings and then computing the Euclidean distance between them will result in a measure that is closely related to cosine similarity.

Did I miss something ? E.g. are the topic embeddings computed by averaging what comes out of umap?

The topic embeddings are generally created by taking the average of the input embeddings, so not the reduced embeddings that come out of UMAP.

If not, how could we evaluate the impact of using the euclidean distance rather than the cosine similarity?
See above.

1 reply

HerrmannM Jun 18, 2024
Author

Thank you!

For anyone else wondering how the distances are related: https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance (so both should lead to the same results when use in a 'relative' manner, e.g. ranking).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About topic reduction "auto" #2054

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

About topic reduction "auto" #2054

HerrmannM Jun 18, 2024

Replies: 1 comment · 1 reply

MaartenGr Jun 18, 2024 Maintainer

HerrmannM Jun 18, 2024 Author

HerrmannM
Jun 18, 2024

Replies: 1 comment 1 reply

MaartenGr
Jun 18, 2024
Maintainer

HerrmannM Jun 18, 2024
Author