v0.15
Highlights:
- Multimodal Topic Modeling
- Train your topic modeling on text, images, or images and text!
- Use the
bertopic.backend.MultiModalBackend
to embed images, text, both or even caption images!
- Multi-Aspect Topic Modeling
- Create multiple topic representations simultaneously
- Improved Serialization options
- Push your model to the HuggingFace Hub with
.push_to_hf_hub
- Safer, smaller and more flexible serialization options with
safetensors
- Thanks to a great collaboration with HuggingFace and the authors of BERTransfer!
- Push your model to the HuggingFace Hub with
- Added new embedding models
- OpenAI:
bertopic.backend.OpenAIBackend
- Cohere:
bertopic.backend.CohereBackend
- OpenAI:
- Added example of summarizing topics with OpenAI's GPT-models
- Added
nr_docs
anddiversity
parameters to OpenAI and Cohere representation models - Use
custom_labels="Aspect1"
to use the aspect labels for visualizations instead - Added cuML support for probability calculation in
.transform
- Updated topic embeddings
- Centroids by default and c-TF-IDF weighted embeddings for
partial_fit
and.update_topics
- Centroids by default and c-TF-IDF weighted embeddings for
- Added
exponential_backoff
parameter toOpenAI
model
Fixes:
- Fixed custom prompt not working in
TextGeneration
- Fixed #1142
- Add additional logic to handle cupy arrays by @metasyn in #1179
- Fix hierarchy viz and handle any form of distance matrix by @elashrry in #1173
- Updated languages list by @sam9111 in #1099
- Added level_scale argument to visualize_hierarchical_documents by @zilch42 in #1106
- Fix inconsistent naming by @rolanderdei in #1073
Multimodal Topic Modeling
With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.
In this example, we are going to use images from flickr
that each have a caption accociated to it:
# NOTE: This requires the `datasets` package which you can
# install with `pip install datasets`
from datasets import load_dataset
ds = load_dataset("maderix/flickr_bw_rgb")
images = ds["train"]["image"]
docs = ds["train"]["caption"]
The docs
variable contains the captions for each image in images
. We can now use these variables to run our multimodal example:
from bertopic import BERTopic
from bertopic.representation import VisualRepresentation
# Additional ways of representing a topic
visual_model = VisualRepresentation()
# Make sure to add the `visual_model` to a dictionary
representation_model = {
"Visual_Aspect": visual_model,
}
topic_model = BERTopic(representation_model=representation_model, verbose=True)
We can now access our image representations for each topic with topic_model.topic_aspects_["Visual_Aspect"]
.
If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:
import base64
from io import BytesIO
from IPython.display import HTML
def image_base64(im):
if isinstance(im, str):
im = get_thumbnail(im)
with BytesIO() as buffer:
im.save(buffer, 'jpeg')
return base64.b64encode(buffer.getvalue()).decode()
def image_formatter(im):
return f'<img src="data:image/jpeg;base64,{image_base64(im)}">'
# Extract dataframe
df = topic_model.get_topic_info().drop("Representative_Docs", 1).drop("Name", 1)
# Visualize the images
HTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))
Multi-aspect Topic Modeling
In this new release, we introduce multi-aspect topic modeling
! During the .fit
or .fit_transform
stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).
The approach is rather straightforward. We might want to represent our topics using a PartOfSpeech
representation model but we might also want to try out KeyBERTInspired
and compare those representation models. We can do this as follows:
from bertopic.representation import KeyBERTInspired
from bertopic.representation import PartOfSpeech
from bertopic.representation import MaximalMarginalRelevance
from sklearn.datasets import fetch_20newsgroups
# Documents to train on
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# The main representation of a topic
main_representation = KeyBERTInspired()
# Additional ways of representing a topic
aspect_model1 = PartOfSpeech("en_core_web_sm")
aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]
# Add all models together to be run in a single `fit`
representation_model = {
"Main": main_representation,
"Aspect1": aspect_model1,
"Aspect2": aspect_model2
}
topic_model = BERTopic(representation_model=representation_model).fit(docs)
As show above, to perform multi-aspect topic modeling, we make sure that representation_model
is a dictionary where each representation model pipeline is defined.
The main pipeline, that is used in most visualization options, is defined with the "Main"
key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as "Aspect1"
and "Aspect2"
.
After we have fitted our model, we can access all representations with topic_model.get_topic_info()
:
As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in topic_model.topic_aspects_
.
Serialization
Saving, loading, and sharing a BERTopic model can be done in several ways. With this new release, it is now advised to go with .safetensors
as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as .pickle
and pytorch .bin
are also possible.
The methods are used as follows:
topic_model = BERTopic().fit(my_docs)
# Method 1 - safetensors
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)
# Method 2 - pytorch
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)
# Method 3 - pickle
topic_model.save("my_model", serialization="pickle")
Saving the topic modeling with .safetensors
or pytorch
has a number of advantages:
.safetensors
is a relatively safe format- The resulting model can be very small (often < 20MB>) since no sub-models need to be saved
- Although version control is important, there is a bit more flexibility with respect to specific versions of packages
- More easily used in production
- Share models with the HuggingFace Hub
The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing safetensors
, pytorch
, and pickle
. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.
HuggingFace Hub
When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account:
from huggingface_hub import login
login()
When you have logged in to your HuggingFace account, you can save and upload the model as follows:
from bertopic import BERTopic
# Train model
topic_model = BERTopic().fit(my_docs)
# Push to HuggingFace Hub
topic_model.push_to_hf_hub(
repo_id="MaartenGr/BERTopic_ArXiv",
save_ctfidf=True
)
# Load from HuggingFace
loaded_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")