-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
. . . About pyLDAvis Visualization in BERTopic #196
Comments
Thank you for your kind words! I believe it should be possible to visualize BERTopic using pyLDAvis, although I have not done so myself. The main issue with doing so is that the topic-term distributions will not entirely be accurate. This has mostly to do with how BERTopic creates those representations. There are two steps involved in creating the topic representations. First, we apply c-TF-IDF to the clusters of documents to generate candidate words for each topic. This would be your topic-term distributions that you could use for pyLDAvis. In other words, the topic-term distributions generated in the first step do not perfectly match the topic representations as generated in the second. The reason for me explaining this, is that the visualization you will get in pyLDAvis is an un-optimized view of BERTopic. By no means is it a poor view, but just not the entire picture. |
Technically, this is how you would approach using BERTopic with pyLDAvis. However, it does not seem to work as of right now due to a nasty import pyLDAvis
import numpy as np
from bertopic import BERTopic
# Train Model
topic_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)
# Prepare data for PyLDAVis
top_n = 5
topic_term_dists = topic_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in topic_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [topic_model.vectorizer_model.vocabulary_[word] for word in vocab]
data = {'topic_term_dists': topic_term_dists,
'doc_topic_dists': doc_topic_dists,
'doc_lengths': doc_lengths,
'vocab': vocab,
'term_frequency': term_frequency}
# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data) Having said that, it might work on your dataset. |
Thanks for the prompt response. I tried out the code on a small sample of data (just to see how it will work out). Besides some deprecation warnings, I also got an error message.
The error message was related to number of rows of topic_term_dists does not match number of columns of doc_topic_dists
|
Could you share the entire snippet of code that you tried it out on? It could also be that you simply need more data for this to work. |
Below is the code I used along with the text data
I have attached the snippet of text used |
Sorry it took a while to figure this out but it seems that the text you used is simply too small to be usable in this case. Since it only generates 2 topics, one of which being outliers, there are issues with slicing the data. If you use a larger dataset that generates multiple topics, like 20Newsgroups it should not give that ValidationError. |
Hi Maarten, Thanks for your assistance. I ran the model on a larger dataset, but now it appears that some of the topic_term_distributions have very small probabilities, so their sum does not equal 1. Below is the message I get. Any suggestion(s) on how to resolve this issue?
|
That is strange since the topic_term_dists never were summing to 1 as those values do not represent probabilities at all. Have you changed any of the code? You could try to normalize the c-TF-IDF matrix and have it sum to 1. |
Due to inactivity, I will be closing this for now. However, if you run into issues please let me know and I'll re-open the issue! |
Thanks for the comments, I took the above to visualise the topics and looks fine. In case: |
Just fyi it would |
Hi, I think there is a version change and probs is no longer a 2d Array? I am unable to use the same code by rafaelvalero on the trained model using 0.14.0 |
@allanckw The probabilities are either 1d or 2d dependent on whether you have set |
Hi @MaartenGr Thanks for the tip! 👍 Off to retrain my models |
@MaartenGr
I am an R enthusiast who is new to Python. I have read your posts on "Interactive Topic Modeling with BERTopic", and it predecessor, "Topic Modeling with BERT". Thanks for an awesome package. I know BERTopic has a visualization similar to pyLDAVis, I was wondering if it's possible to extract information from BERTopic that can be used in pyLDAvis.
To visualize BERTopic using pyLDAvis, I would need the topic-term distributions, document-topic distributions, and information about the corpus which the model was trained on
The text was updated successfully, but these errors were encountered: