Working with too big datasets #1432

huycke · 2023-07-27T23:20:44Z

huycke
Jul 27, 2023

Heyo folx,

I'm struggling getting BERTopic to work on a relatively large dataset (a few million reddit posts/comments). I've tried a bunch of things to change how BERTopic works, but it needs to allocate 149GB for the array and I can't get that even with a large swap setup on my SSD.

Following the BERTopic with Big Data file helped produce embeddings for the data efficiently, but I think it's when I pass the embeddings and docs to BERTopic that I run into issues. As a result, I get the same error.

Can anyone help me sort this out?

I'm thinking my options are - get a second job to get a higher spec PC or be able to afford time tinkering with virtual machines.... or

Maybe I could take a subset of the data to train a model, then feed the majority of the data through the already trained model?

...or maybe I could do a better job breaking the dataset into chunks?
I tried chunking the data into smaller bits but wasn't able to combine/concatenate them in a way that lead to good outputs. BERTopic doesn't produce as high quality of topics when it's doing it in batches, as far as I can tell.

Any suggestions welcome. Thanks!

Answered by MaartenGr

Jul 31, 2023

It would return an error like the following:

Are you sure that that is the exact error log you get when following the exact code as mentioned in the notebook "Topic Modeling on Large Data"? The reason why I am asking is that the error log you shared shows that you initialized the topic model as follows:

topic_model = BERTopic(language="english")

# Step 4: Fit the model to your data
topics, probabilities = topic_model.fit_transform(df['text'])

which is not according to the instructions of the notebook. Please share the error log that you get when you follow along with the notebook without changing any parameters.

If however, that is the exact error log that you get regardless of how you…

View full answer

MaartenGr · 2023-07-28T04:43:54Z

MaartenGr
Jul 28, 2023
Maintainer

I'm struggling getting BERTopic to work on a relatively large dataset (a few million reddit posts/comments). I've tried a bunch of things to change how BERTopic works, but it needs to allocate 149GB for the array and I can't get that even with a large swap setup on my SSD.

To start off, can you share your full code? Which version of BERTopic are you using?

Following the BERTopic with Big Data file helped produce embeddings for the data efficiently, but I think it's when I pass the embeddings and docs to BERTopic that I run into issues. As a result, I get the same error.

Which error are you exactly getting and when do you get that error? When you set verbose=True, you will see the steps that it has successfully taken before you get an error, could you share these steps as well as the full error log?

I'm thinking my options are - get a second job to get a higher spec PC or be able to afford time tinkering with virtual machines.... or
Maybe I could take a subset of the data to train a model, then feed the majority of the data through the already trained model?

I believe we can get quite far before you would have to buy a higher-spec PC. There are quite a few options that we can go through to optimize your pipeline. First, we would need to find out what exactly is the bottleneck of your setup. Could you provide the specs of your environment? Are you working in a Google Colab session?

1 reply

huycke Jul 31, 2023
Author

Heyo, thanks for the reply!

To start off, can you share your full code? Which version of BERTopic are you using?

I'm using the most recent version of BERTopic. I tried going straight through the google colab notebook for "Topic Modeling on Large Data". Everytime I got to the "Train BERTopic" codeblock

`from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic

Prepare sub-models
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20, verbose=True)
vectorizer_model = CountVectorizer(vocabulary=vocab, stop_words="english")

Fit BERTopic without actually performing any clustering
topic_model= BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
verbose=True
).fit(docs, embeddings=embeddings)`

It would return an error like the following:

MemoryError Traceback (most recent call last)
Cell In[1], line 17
14 topic_model = BERTopic(language="english")
16 # Step 4: Fit the model to your data
---> 17 topics, probabilities = topic_model.fit_transform(df['text'])
19 # Save the model for later use
20 topic_model.save(r"C:\Projects\BERTopic\saves\adhd_model")

File c:\Projects\Matt.conda\lib\site-packages\bertopic_bertopic.py:386, in BERTopic.fit_transform(self, documents, embeddings, images, y)
384 if self.seed_topic_list is not None and self.embedding_model is not None:
385 y, embeddings = self._guided_topic_modeling(embeddings)
--> 386 umap_embeddings = self._reduce_dimensionality(embeddings, y)
388 # Cluster reduced embeddings
389 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

File c:\Projects\Matt.conda\lib\site-packages\bertopic_bertopic.py:3183, in BERTopic._reduce_dimensionality(self, embeddings, y, partial_fit)
3180 # Regular fit
3181 else:
3182 try:
-> 3183 self.umap_model.fit(embeddings, y=y)
3184 except TypeError:
3185 logger.info("The dimensionality reduction algorithm did not contain the y parameter and"
3186 " therefore the y parameter was not used")
...
--> 343 self.v = np.zeros((n, ncv), tp) # holds Ritz vectors
344 self.iparam = np.zeros(11, arpack_int)
346 # set solver mode and parameters

MemoryError: Unable to allocate 149. GiB for an array with shape (7370891, 2714) and data type float64

The following code produced the same error:
'import pandas as pd
from hdbscan import HDBSCAN
from umap import UMAP
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

Load the data
df = pd.read_csv('C:\Projects\BERTopic\adhd\ADHD_final.csv', low_memory=False)

Specify what the 'docs' are
docs = df['text'].tolist()

sentence_model = SentenceTransformer('C:\Projects\Models\all-MiniLM-L6-v2\')
umap_model = UMAP(n_neighbors=10, n_components=3, min_dist=0.0, metric='cosine', low_memory=True, verbose=True)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True, verbose=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english",min_df=5)
ctfidf_model = ClassTfidfTransformer()
representation_model = KeyBERTInspired()

topic_model = BERTopic(
embedding_model=sentence_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
representation_model=representation_model,
calculate_probabilities=False,
low_memory=True,
verbose=True
)

topics, probabilities = topic_model.fit_transform(docs)

topic_model.save(
path=(r'C:\Projects\BERTopic\models\adhd_base'),
serialization="safetensors",
save_ctfidf=True,
save_embedding_model=('C:\Projects\Models\all-MiniLM-L6-v2\')
)'

Which error are you exactly getting and when do you get that error? When you set verbose=True, you will see the steps that it has successfully taken before you get an error, could you share these steps as well as the full error log?

The error always happens on the 'topics, probabilities = topic_model.fit_transform(df['text'])' line. It returned the same error when pecalculating embeddings, vocab, and countvectorizer stuff or when trying to run everything together.

Could you provide the specs of your environment? Are you working in a Google Colab session?

I think the bottleneck is my RAM. I've got about 64 gb on windows 11, so maybe 55ish free, and am never using much VRAM. I'm running this in a Jypter Notebook (is that what you call an .ipynb file in VS Code?). I don't know enough about computers to know if that's an issue.

I tried goofing around with some settings (umap and hdbscan) to create a smaller array, but it always reported needing 149GB. Here is another codeblock I was unsuccessful with:

'import pandas as pd
from hdbscan import HDBSCAN
from umap import UMAP
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

Load the data
df = pd.read_csv('C:\Projects\BERTopic\adhd\ADHD_final.csv', low_memory=False)

Specify what the 'docs' are
docs = df['text'].tolist()

sentence_model = SentenceTransformer('C:\Projects\Models\all-MiniLM-L6-v2\')
umap_model = UMAP(n_neighbors=10, n_components=3, min_dist=0.0, metric='cosine', low_memory=True, verbose=True)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True, verbose=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english",min_df=5)
ctfidf_model = ClassTfidfTransformer()
representation_model = KeyBERTInspired()

topic_model = BERTopic(
embedding_model=sentence_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
representation_model=representation_model,
calculate_probabilities=False,
low_memory=True,
verbose=True
)

topics, probabilities = topic_model.fit_transform(docs)

topic_model.save(
path=(r'C:\Projects\BERTopic\models\adhd_base'),
serialization="safetensors",
save_ctfidf=True,
save_embedding_model=('C:\Projects\Models\all-MiniLM-L6-v2\')
)'

MaartenGr · 2023-07-31T04:18:31Z

MaartenGr
Jul 31, 2023
Maintainer

It would return an error like the following:

Are you sure that that is the exact error log you get when following the exact code as mentioned in the notebook "Topic Modeling on Large Data"? The reason why I am asking is that the error log you shared shows that you initialized the topic model as follows:

topic_model = BERTopic(language="english")

# Step 4: Fit the model to your data
topics, probabilities = topic_model.fit_transform(df['text'])

which is not according to the instructions of the notebook. Please share the error log that you get when you follow along with the notebook without changing any parameters.

If however, that is the exact error log that you get regardless of how you initialized BERTopic, then it seems that it is a result of UMAP:

-> 3183 self.umap_model.fit(embeddings, y=y)
3184 except TypeError:
3185 logger.info("The dimensionality reduction algorithm did not contain the y parameter and"
3186 " therefore the y parameter was not used")
...
--> 343 self.v = np.zeros((n, ncv), tp) # holds Ritz vectors
344 self.iparam = np.zeros(11, arpack_int)
346 # set solver mode and parameters

What you could do is follow along with the "UMAP" section in the notebook "Topic Modeling on Large Data" about pre-calculating the dimensionality reduction. I would advise fitting on a subset of your data and then transforming the entire set in order to prevent memory errors.

Also, which steps are logged when you set verbose=True?

0 replies

huycke · 2023-08-03T16:50:04Z

huycke
Aug 3, 2023
Author

which is not according to the instructions of the notebook

This solved the issue.

I'd made adjustments to try and speed up the training time on my machine (can't get cuML to work on Windows), and those were causing the memory error. When I defaulted back to the notebook it runs without issue, using available RAM and part of the swap on the SSD (at least I think so, the drive reads at 100% utilization during the UMAP and HDBSCAN portions).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with too big datasets #1432

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Working with *too big* datasets #1432

huycke Jul 27, 2023

Replies: 3 comments · 1 reply

MaartenGr Jul 28, 2023 Maintainer

huycke Jul 31, 2023 Author

It would return an error like the following:

MaartenGr Jul 31, 2023 Maintainer

huycke Aug 3, 2023 Author

Working with too big datasets #1432

huycke
Jul 27, 2023

Replies: 3 comments 1 reply

MaartenGr
Jul 28, 2023
Maintainer

huycke Jul 31, 2023
Author

MaartenGr
Jul 31, 2023
Maintainer

huycke
Aug 3, 2023
Author