Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

. . . About pyLDAvis Visualization in BERTopic #196

Closed
gsalfourn opened this issue Aug 8, 2021 · 15 comments
Closed

. . . About pyLDAvis Visualization in BERTopic #196

gsalfourn opened this issue Aug 8, 2021 · 15 comments

Comments

@gsalfourn
Copy link

@MaartenGr

I am an R enthusiast who is new to Python. I have read your posts on "Interactive Topic Modeling with BERTopic", and it predecessor, "Topic Modeling with BERT". Thanks for an awesome package. I know BERTopic has a visualization similar to pyLDAVis, I was wondering if it's possible to extract information from BERTopic that can be used in pyLDAvis.

To visualize BERTopic using pyLDAvis, I would need the topic-term distributions, document-topic distributions, and information about the corpus which the model was trained on

@MaartenGr
Copy link
Owner

Thank you for your kind words!

I believe it should be possible to visualize BERTopic using pyLDAvis, although I have not done so myself. The main issue with doing so is that the topic-term distributions will not entirely be accurate. This has mostly to do with how BERTopic creates those representations.

There are two steps involved in creating the topic representations. First, we apply c-TF-IDF to the clusters of documents to generate candidate words for each topic. This would be your topic-term distributions that you could use for pyLDAvis.
The second step leverages MMR to make sure that the topic representations are a bit more coherent and stable. However, this does not generate a topic-term distribution but is merely a selection of terms.

In other words, the topic-term distributions generated in the first step do not perfectly match the topic representations as generated in the second. The reason for me explaining this, is that the visualization you will get in pyLDAvis is an un-optimized view of BERTopic. By no means is it a poor view, but just not the entire picture.

@MaartenGr
Copy link
Owner

Technically, this is how you would approach using BERTopic with pyLDAvis. However, it does not seem to work as of right now due to a nasty Int64Index error which I cannot figure out:

import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
topic_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = topic_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in topic_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [topic_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

Having said that, it might work on your dataset.

@gsalfourn
Copy link
Author

Thanks for the prompt response. I tried out the code on a small sample of data (just to see how it will work out). Besides some deprecation warnings, I also got an error message.

2021-08-09 22:31:22,038 - BERTopic - Transformed documents to Embeddings
2021-08-09 22:31:27,828 - BERTopic - Reduced dimensionality with UMAP
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:275: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:56: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  condensed_tree = condense_tree(single_linkage_tree,
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:59: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels, probabilities, stabilities = get_clusters(condensed_tree,
2021-08-09 22:31:27,852 - BERTopic - Clustered UMAP embeddings with HDBSCAN
---------------------------------------------------------------------------

The error message was related to number of rows of topic_term_dists does not match number of columns of doc_topic_dists

ValidationError                           Traceback (most recent call last)
<ipython-input-2-775585b401be> in <module>
     27 
     28 # Visualize using pyLDAvis
---> 29 vis_data= pyLDAvis.prepare(**data, mds='mmds')
     30 pyLDAvis.display(vis_data)

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
    413     doc_lengths = _series_with_name(doc_lengths, 'doc_length')
    414     vocab = _series_with_name(vocab, 'vocab')
--> 415     _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    416     R = min(R, len(vocab))
    417 

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     72     res = _input_check(*args)
     73     if res:
---> 74         raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     75 
     76 

ValidationError: 
 * Number of rows of topic_term_dists does not match number of columns of doc_topic_dists; both should be equal to the number of topics in the model.

@MaartenGr
Copy link
Owner

Could you share the entire snippet of code that you tried it out on? It could also be that you simply need more data for this to work.

@gsalfourn
Copy link
Author

@MaartenGr

Below is the code I used along with the text data

## import regex module
import re

## path to the data file
# path = 'D:/Python/bertopic/fl_data/fl_data_excerpts.txt'

## reading the data
with open(path, 'r', encoding='utf8') as f:
    contents = f.read()
    line_tabs = re.sub('\t', ' ', contents)
    line_spaces = re.sub(' +', ' ', line_tabs)
    text_data = re.split(r"\.|\?|\!", line_spaces)
    
print(text_data[:5])
print(type(text_data))

## pyLDAvis implementation
import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
topic_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(text_data)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = topic_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in text_data]
vocab = [word for word in topic_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [topic_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

I have attached the snippet of text used
fl_data_excerpts.txt

@MaartenGr
Copy link
Owner

Sorry it took a while to figure this out but it seems that the text you used is simply too small to be usable in this case. Since it only generates 2 topics, one of which being outliers, there are issues with slicing the data.

If you use a larger dataset that generates multiple topics, like 20Newsgroups it should not give that ValidationError.

@gsalfourn
Copy link
Author

gsalfourn commented Aug 15, 2021

Hi Maarten,

Thanks for your assistance. I ran the model on a larger dataset, but now it appears that some of the topic_term_distributions have very small probabilities, so their sum does not equal 1. Below is the message I get. Any suggestion(s) on how to resolve this issue?

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-12-f75802444872> in <module>
     29 
     30 # Visualize using pyLDAvis
---> 31 vis_data= pyLDAvis.prepare(**data, mds='mmds')
     32 pyLDAvis.display(vis_data)

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
    413     doc_lengths = _series_with_name(doc_lengths, 'doc_length')
    414     vocab = _series_with_name(vocab, 'vocab')
--> 415     _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    416     R = min(R, len(vocab))
    417 

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     72     res = _input_check(*args)
     73     if res:
---> 74         raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     75 
     76 

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.

@MaartenGr
Copy link
Owner

That is strange since the topic_term_dists never were summing to 1 as those values do not represent probabilities at all. Have you changed any of the code? You could try to normalize the c-TF-IDF matrix and have it sum to 1.

@MaartenGr
Copy link
Owner

Due to inactivity, I will be closing this for now. However, if you run into issues please let me know and I'll re-open the issue!

@rafaelvalero
Copy link
Contributor

Thanks for the comments, I took the above to visualise the topics and looks fine. In case:
https://github.com/rafaelvalero/different_notebooks/blob/master/bertopics_pyldavis.ipynb

@bala1802
Copy link

bala1802 commented Aug 8, 2022

Hi Maarten,

outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)

The code snippet will work only when the Outlier is present in the first index. Sometimes the BerTopic is generating the outlier -1 in the different location.

image

As you can see in the above, the Topic -1 is present at index = 4

@spookyuser
Copy link

Just fyi it would c_tf_idf_ now :)

@allanckw
Copy link

Hi, I think there is a version change and probs is no longer a 2d Array? I am unable to use the same code by rafaelvalero on the trained model using 0.14.0

@MaartenGr
Copy link
Owner

@allanckw The probabilities are either 1d or 2d dependent on whether you have set calculate_probabilities=True.

@allanckw
Copy link

Hi @MaartenGr

Thanks for the tip! 👍

Off to retrain my models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants