-
Heyo folx, I'm struggling getting BERTopic to work on a relatively large dataset (a few million reddit posts/comments). I've tried a bunch of things to change how BERTopic works, but it needs to allocate 149GB for the array and I can't get that even with a large swap setup on my SSD. Following the BERTopic with Big Data file helped produce embeddings for the data efficiently, but I think it's when I pass the embeddings and docs to BERTopic that I run into issues. As a result, I get the same error. Can anyone help me sort this out? I'm thinking my options are - get a second job to get a higher spec PC or be able to afford time tinkering with virtual machines.... or Maybe I could take a subset of the data to train a model, then feed the majority of the data through the already trained model? ...or maybe I could do a better job breaking the dataset into chunks? Any suggestions welcome. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
To start off, can you share your full code? Which version of BERTopic are you using?
Which error are you exactly getting and when do you get that error? When you set
I believe we can get quite far before you would have to buy a higher-spec PC. There are quite a few options that we can go through to optimize your pipeline. First, we would need to find out what exactly is the bottleneck of your setup. Could you provide the specs of your environment? Are you working in a Google Colab session? |
Beta Was this translation helpful? Give feedback.
-
Are you sure that that is the exact error log you get when following the exact code as mentioned in the notebook "Topic Modeling on Large Data"? The reason why I am asking is that the error log you shared shows that you initialized the topic model as follows: topic_model = BERTopic(language="english")
# Step 4: Fit the model to your data
topics, probabilities = topic_model.fit_transform(df['text']) which is not according to the instructions of the notebook. Please share the error log that you get when you follow along with the notebook without changing any parameters. If however, that is the exact error log that you get regardless of how you initialized BERTopic, then it seems that it is a result of UMAP:
What you could do is follow along with the "UMAP" section in the notebook "Topic Modeling on Large Data" about pre-calculating the dimensionality reduction. I would advise fitting on a subset of your data and then transforming the entire set in order to prevent memory errors. Also, which steps are logged when you set |
Beta Was this translation helpful? Give feedback.
-
This solved the issue. I'd made adjustments to try and speed up the training time on my machine (can't get cuML to work on Windows), and those were causing the memory error. When I defaulted back to the notebook it runs without issue, using available RAM and part of the swap on the SSD (at least I think so, the drive reads at 100% utilization during the UMAP and HDBSCAN portions). |
Beta Was this translation helpful? Give feedback.
Are you sure that that is the exact error log you get when following the exact code as mentioned in the notebook "Topic Modeling on Large Data"? The reason why I am asking is that the error log you shared shows that you initialized the topic model as follows:
which is not according to the instructions of the notebook. Please share the error log that you get when you follow along with the notebook without changing any parameters.
If however, that is the exact error log that you get regardless of how you…