. . . About pyLDAvis Visualization in BERTopic #196

gsalfourn · 2021-08-08T20:50:36Z

I am an R enthusiast who is new to Python. I have read your posts on "Interactive Topic Modeling with BERTopic", and it predecessor, "Topic Modeling with BERT". Thanks for an awesome package. I know BERTopic has a visualization similar to pyLDAVis, I was wondering if it's possible to extract information from BERTopic that can be used in pyLDAvis.

To visualize BERTopic using pyLDAvis, I would need the topic-term distributions, document-topic distributions, and information about the corpus which the model was trained on

MaartenGr · 2021-08-09T07:47:42Z

Thank you for your kind words!

I believe it should be possible to visualize BERTopic using pyLDAvis, although I have not done so myself. The main issue with doing so is that the topic-term distributions will not entirely be accurate. This has mostly to do with how BERTopic creates those representations.

There are two steps involved in creating the topic representations. First, we apply c-TF-IDF to the clusters of documents to generate candidate words for each topic. This would be your topic-term distributions that you could use for pyLDAvis.
The second step leverages MMR to make sure that the topic representations are a bit more coherent and stable. However, this does not generate a topic-term distribution but is merely a selection of terms.

In other words, the topic-term distributions generated in the first step do not perfectly match the topic representations as generated in the second. The reason for me explaining this, is that the visualization you will get in pyLDAvis is an un-optimized view of BERTopic. By no means is it a poor view, but just not the entire picture.

MaartenGr · 2021-08-09T08:48:33Z

Technically, this is how you would approach using BERTopic with pyLDAvis. However, it does not seem to work as of right now due to a nasty Int64Index error which I cannot figure out:

import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
topic_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = topic_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in topic_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [topic_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

Having said that, it might work on your dataset.

gsalfourn · 2021-08-10T02:41:19Z

Thanks for the prompt response. I tried out the code on a small sample of data (just to see how it will work out). Besides some deprecation warnings, I also got an error message.

2021-08-09 22:31:22,038 - BERTopic - Transformed documents to Embeddings
2021-08-09 22:31:27,828 - BERTopic - Reduced dimensionality with UMAP
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:275: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:56: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  condensed_tree = condense_tree(single_linkage_tree,
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:59: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels, probabilities, stabilities = get_clusters(condensed_tree,
2021-08-09 22:31:27,852 - BERTopic - Clustered UMAP embeddings with HDBSCAN
---------------------------------------------------------------------------

The error message was related to number of rows of topic_term_dists does not match number of columns of doc_topic_dists

ValidationError                           Traceback (most recent call last)
<ipython-input-2-775585b401be> in <module>
     27 
     28 # Visualize using pyLDAvis
---> 29 vis_data= pyLDAvis.prepare(**data, mds='mmds')
     30 pyLDAvis.display(vis_data)

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
    413     doc_lengths = _series_with_name(doc_lengths, 'doc_length')
    414     vocab = _series_with_name(vocab, 'vocab')
--> 415     _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    416     R = min(R, len(vocab))
    417 

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     72     res = _input_check(*args)
     73     if res:
---> 74         raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     75 
     76 

ValidationError: 
 * Number of rows of topic_term_dists does not match number of columns of doc_topic_dists; both should be equal to the number of topics in the model.

MaartenGr · 2021-08-10T14:14:33Z

Could you share the entire snippet of code that you tried it out on? It could also be that you simply need more data for this to work.

gsalfourn · 2021-08-10T17:49:16Z

@MaartenGr

Below is the code I used along with the text data

## import regex module
import re

## path to the data file
# path = 'D:/Python/bertopic/fl_data/fl_data_excerpts.txt'

## reading the data
with open(path, 'r', encoding='utf8') as f:
    contents = f.read()
    line_tabs = re.sub('\t', ' ', contents)
    line_spaces = re.sub(' +', ' ', line_tabs)
    text_data = re.split(r"\.|\?|\!", line_spaces)
    
print(text_data[:5])
print(type(text_data))

## pyLDAvis implementation
import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
topic_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(text_data)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = topic_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in text_data]
vocab = [word for word in topic_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [topic_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

I have attached the snippet of text used
fl_data_excerpts.txt

MaartenGr · 2021-08-13T07:44:11Z

Sorry it took a while to figure this out but it seems that the text you used is simply too small to be usable in this case. Since it only generates 2 topics, one of which being outliers, there are issues with slicing the data.

If you use a larger dataset that generates multiple topics, like 20Newsgroups it should not give that ValidationError.

gsalfourn · 2021-08-15T03:32:40Z

Hi Maarten,

Thanks for your assistance. I ran the model on a larger dataset, but now it appears that some of the topic_term_distributions have very small probabilities, so their sum does not equal 1. Below is the message I get. Any suggestion(s) on how to resolve this issue?

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-12-f75802444872> in <module>
     29 
     30 # Visualize using pyLDAvis
---> 31 vis_data= pyLDAvis.prepare(**data, mds='mmds')
     32 pyLDAvis.display(vis_data)

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
    413     doc_lengths = _series_with_name(doc_lengths, 'doc_length')
    414     vocab = _series_with_name(vocab, 'vocab')
--> 415     _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    416     R = min(R, len(vocab))
    417 

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     72     res = _input_check(*args)
     73     if res:
---> 74         raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     75 
     76 

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.

MaartenGr · 2021-08-15T05:45:09Z

That is strange since the topic_term_dists never were summing to 1 as those values do not represent probabilities at all. Have you changed any of the code? You could try to normalize the c-TF-IDF matrix and have it sum to 1.

MaartenGr · 2021-09-26T06:32:46Z

Due to inactivity, I will be closing this for now. However, if you run into issues please let me know and I'll re-open the issue!

rafaelvalero · 2022-01-01T15:12:53Z

Thanks for the comments, I took the above to visualise the topics and looks fine. In case:
https://github.com/rafaelvalero/different_notebooks/blob/master/bertopics_pyldavis.ipynb

bala1802 · 2022-08-08T11:00:42Z

Hi Maarten,

outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)

The code snippet will work only when the Outlier is present in the first index. Sometimes the BerTopic is generating the outlier -1 in the different location.

As you can see in the above, the Topic -1 is present at index = 4

spookyuser · 2022-10-07T20:18:49Z

Just fyi it would c_tf_idf_ now :)

allanckw · 2023-03-11T20:00:20Z

Hi, I think there is a version change and probs is no longer a 2d Array? I am unable to use the same code by rafaelvalero on the trained model using 0.14.0

MaartenGr · 2023-03-12T05:43:43Z

@allanckw The probabilities are either 1d or 2d dependent on whether you have set calculate_probabilities=True.

allanckw · 2023-03-12T06:55:44Z

Hi @MaartenGr

Thanks for the tip! 👍

Off to retrain my models

MaartenGr closed this as completed Sep 26, 2021

MaartenGr mentioned this issue Mar 22, 2023

Keyword frequencies for each clusters #1110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

. . . About pyLDAvis Visualization in BERTopic #196

. . . About pyLDAvis Visualization in BERTopic #196

gsalfourn commented Aug 8, 2021

MaartenGr commented Aug 9, 2021

MaartenGr commented Aug 9, 2021

gsalfourn commented Aug 10, 2021

MaartenGr commented Aug 10, 2021

gsalfourn commented Aug 10, 2021

MaartenGr commented Aug 13, 2021

gsalfourn commented Aug 15, 2021 •

edited

Loading

MaartenGr commented Aug 15, 2021

MaartenGr commented Sep 26, 2021

rafaelvalero commented Jan 1, 2022

bala1802 commented Aug 8, 2022

spookyuser commented Oct 7, 2022

allanckw commented Mar 11, 2023

MaartenGr commented Mar 12, 2023

allanckw commented Mar 12, 2023

. . . About pyLDAvis Visualization in BERTopic #196

. . . About pyLDAvis Visualization in BERTopic #196

Comments

gsalfourn commented Aug 8, 2021

MaartenGr commented Aug 9, 2021

MaartenGr commented Aug 9, 2021

gsalfourn commented Aug 10, 2021

MaartenGr commented Aug 10, 2021

gsalfourn commented Aug 10, 2021

MaartenGr commented Aug 13, 2021

gsalfourn commented Aug 15, 2021 • edited Loading

MaartenGr commented Aug 15, 2021

MaartenGr commented Sep 26, 2021

rafaelvalero commented Jan 1, 2022

bala1802 commented Aug 8, 2022

spookyuser commented Oct 7, 2022

allanckw commented Mar 11, 2023

MaartenGr commented Mar 12, 2023

allanckw commented Mar 12, 2023

gsalfourn commented Aug 15, 2021 •

edited

Loading