Reassigning documents to second most probable category (and estimating probability of belonging to updated topics) #1405

SoranHD · 2023-07-12T15:33:30Z

SoranHD
Jul 12, 2023

Hi all,

While BERTopic does a fantastic job finding meaningful topics in my data, some of the topics it identifies are substantially uninteresting given my particular use case. I am therefore trying to find a way to reassign documents from these topics to their second highest probability topics, effectively performing outlier reduction on them.

To give some more context, I am analyzing open-ended survey responses containing respondents' arguments for their position on immigration. Among the topics generated is a fairly large one relating to immigration itself, containing arguments like "immigration boosts the economy" and "immigrants enrich our culture". Since the general topic of immigration is given by the context of the survey question, these responses would be better placed in topics related to the economy and culture respectively (which BERTopic does an excellent job identifying). Is there a straightforward way of reassigning the documents placed in the "immigration" topic to their second most likely topic?

A potentially complicating factor is that I'd eventually like to run covariate analyses on the topics as discussed here. It would therefore be ideal if I was able to estimate the probabilities of each document belonging to each of the new, updated topics. Any advice on how this could be incorporated into a solution to the above would be very much appreciated!

MaartenGr · 2023-07-19T05:55:20Z

MaartenGr
Jul 19, 2023
Maintainer

Is there a straightforward way of reassigning the documents placed in the "immigration" topic to their second most likely topic?

What you can do is calculate the document embeddings and compare them through cosine similarity with the topic embeddings. Then, you simply pick the ones with the second-highest probability and re-assign them using .update_topics. Another way to approach this is by setting calculate_probabilities=True to directly get the topic-document probability matrix. You can find more about .update_topics here.

Another solution would be to use the .merge_topics function to merge certain topics together.

Lastly, you could use manual topic modeling to create a new model if you want to manually assign topics to documents. It is similar to .update_topics but might be a bit more robust if you plan to perform a bunch of additional calculations afterwards.

A potentially complicating factor is that I'd eventually like to run covariate analyses on the topics as discussed #360. It would therefore be ideal if I was able to estimate the probabilities of each document belonging to each of the new, updated topics. Any advice on how this could be incorporated into a solution to the above would be very much appreciated!

I believe you would have to use the similarity matrix as probabilities after updating the topics.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reassigning documents to second most probable category (and estimating probability of belonging to updated topics) #1405

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Reassigning documents to second most probable category (and estimating probability of belonging to updated topics) #1405

SoranHD Jul 12, 2023

Replies: 1 comment

MaartenGr Jul 19, 2023 Maintainer

SoranHD
Jul 12, 2023

MaartenGr
Jul 19, 2023
Maintainer