Number of Topics vs Probability Threshold #1372

noahberhe · 2023-06-28T10:45:53Z

noahberhe
Jun 28, 2023

Hello,

Firstly:
Am I right in thinking that the greater the number of clusters created by the model then the lower the probability needed for a document to be assigned to a cluster?
I can see in my dataset of about 100,000 docs there are 120 clusters created, and docs mapped to a cluster can have probabilities as low as e.g. 0.01.
Is there a way of thinking about setting a threshold here, if I wanted to manually tune the mapping of outlier docs to a cluster?

Secondly:
I was also looking at the probs array for docs that were left as outliers, e.g. one doc looks like it could've been rightly mapped to a topic and the prob was calculated as 0.65 (whereas the probs for most of the other 120 topics was 1e-180) yet the doc was still not mapped to that cluster? Why is this so?

Thirdly:
Here's a quick comparison of the max(probability) in clustered vs. outlier docs, although on balance probabilities in clustered docs are higher, there is a fair bit of overlap in the middle, e.g. probabilities of ~0.3. Why isn't there a clear threshold either side of which a doc will be clustered or not?

Thanks
Noah

Answered by MaartenGr

Jun 28, 2023

Am I right in thinking that the greater the number of clusters created by the model then the lower the probability needed for a document to be assigned to a cluster?

It depends on where the probability is retrieved from, namely the underlying cluster model. However, the probabilities are generally more dispersed across topics which results in lower probabilities. That, however, is from an absolute perspective and you generally want to compare relatively.

I can see in my dataset of about 100,000 docs there are 120 clusters created, and docs mapped to a cluster can have probabilities as low as e.g. 0.01.

As mentioned above, it depends on the underlying cluster model. The probabilities w…

View full answer

MaartenGr · 2023-06-28T17:28:28Z

MaartenGr
Jun 28, 2023
Maintainer

Am I right in thinking that the greater the number of clusters created by the model then the lower the probability needed for a document to be assigned to a cluster?

It depends on where the probability is retrieved from, namely the underlying cluster model. However, the probabilities are generally more dispersed across topics which results in lower probabilities. That, however, is from an absolute perspective and you generally want to compare relatively.

I can see in my dataset of about 100,000 docs there are 120 clusters created, and docs mapped to a cluster can have probabilities as low as e.g. 0.01.

As mentioned above, it depends on the underlying cluster model. The probabilities with HDBSCAN, for example, are created after the actual assignment of clusters and therefore does not necessarily represent the training process. It is merely an approximation.
Is there a way of thinking about setting a threshold here, if I wanted to manually tune the mapping of outlier docs to a cluster?

Is there a way of thinking about setting a threshold here, if I wanted to manually tune the mapping of outlier docs to a cluster?

You can use .reduce_outliers for that. Please refer to the documentation.

I was also looking at the probs array for docs that were left as outliers, e.g. one doc looks like it could've been rightly mapped to a topic and the prob was calculated as 0.65 (whereas the probs for most of the other 120 topics was 1e-180) yet the doc was still not mapped to that cluster? Why is this so?

As mentioned above, this is a result of HDBSCAN that calculates the probabilities after assigning topics. So the probabilities are merely an approximation and may not match the inherent training process.

Here's a quick comparison of the max(probability) in clustered vs. outlier docs, although on balance probabilities in clustered docs are higher, there is a fair bit of overlap in the middle, e.g. probabilities of ~0.3. Why isn't there a clear threshold either side of which a doc will be clustered or not?

Also related to HDBSCAN calculates its probabilities. I would advise reading through HDBSCAN's documentation here and here.

2 replies

noahberhe Jun 30, 2023
Author

It depends on where the probability is retrieved from, namely the underlying cluster model. However, the probabilities are generally more dispersed across topics which results in lower probabilities. That, however, is from an absolute perspective and you generally want to compare relatively.

Ok thanks, I will calculate the relative probability in that case, which for each topic will be prob[topic] / sum(probs)
And in sum(probs) I will allow it to be < 1, ie. exclude the probability of it being an outlier.

Is this what you meant by Relative Probability?

MaartenGr Jul 2, 2023
Maintainer

What I meant with relative probability it is indeed about which probability is the highest compared to all others instead of looking which probability exceeds a certain value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of Topics vs Probability Threshold #1372

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Number of Topics vs Probability Threshold #1372

noahberhe Jun 28, 2023

Replies: 1 comment · 2 replies

MaartenGr Jun 28, 2023 Maintainer

noahberhe Jun 30, 2023 Author

MaartenGr Jul 2, 2023 Maintainer

noahberhe
Jun 28, 2023

Replies: 1 comment 2 replies

MaartenGr
Jun 28, 2023
Maintainer

noahberhe Jun 30, 2023
Author

MaartenGr Jul 2, 2023
Maintainer