IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

yml-blog · 2024-01-13T06:32:02Z

I almost not change too many for the example code of the zero shot but has this error. Could you help me to solve it? Thanks. :from datasets import load_dataset

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

We select a subsample of 5000 abstracts from ArXiv

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

We define a number of topics that we know are in the documents

zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

We fit our model using the zero-shot topics

and we define a minimum similarity. For each document,

if the similarity does not exceed that value, it will be used

for clustering instead.

topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)

MaartenGr · 2024-01-13T06:33:13Z

I believe this is a result of setting zershot_min_similarity too high. If you lower the value, the issue might resolve itself.

MaartenGr · 2024-01-23T05:05:49Z

Note that there is also a preliminary fix available at #1762 which should resolve the issue entirely.

hubernst · 2024-02-06T13:41:38Z

Hello,

Zero-Shot is a perfect extension. Thanks so much you.
Unfortunately, I have the same problem as described above. I have already added your fix #1688 to _bertopic.py.
For a value zeroshot_min_similarity=0.2 or even 0.8 the code runs, in between success is unlikely. Do you have a solution?

`

All steps together

topic_model = BERTopic(
verbose=True,
min_topic_size = 20,
#nr_topics = 5,
zeroshot_topic_list=kategorien_1,
zeroshot_min_similarity=.70,
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

2024-02-06 13:28:02,189 - BERTopic - Embedding - Transforming documents to embeddings.
100%|██████████| 1341/1341 [02:23<00:00, 9.35it/s]
2024-02-06 13:30:25,711 - BERTopic - Embedding - Completed ✓
2024-02-06 13:30:25,713 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2024-02-06 13:30:26,642 - BERTopic - Zeroshot Step 1 - Completed ✓
2024-02-06 13:30:26,643 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-06 13:30:29,187 - BERTopic - Dimensionality - Completed ✓
2024-02-06 13:30:29,190 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-06 13:30:29,207 - BERTopic - Cluster - Completed ✓
2024-02-06 13:30:29,214 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-06 13:30:38,532 - BERTopic - Representation - Completed ✓
2024-02-06 13:30:38,558 - BERTopic - Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model...
2024-02-06 13:30:38,565 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-06 13:30:38,567 - BERTopic - Dimensionality - Completed ✓
2024-02-06 13:30:38,577 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-06 13:30:38,581 - BERTopic - Cluster - Completed ✓
2024-02-06 13:30:38,587 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-06 13:31:33,230 - BERTopic - Representation - Completed ✓
2024-02-06 13:31:33,298 - BERTopic - Zeroshot Step 2 - Completed ✓
2024-02-06 13:31:33,299 - BERTopic - Zeroshot Step 3 - Combining clustered topics with the zeroshot model

IndexError Traceback (most recent call last)
Input In [67], in <cell line: 2>()
1 #topics, probabilities = topic_model.fit_transform(sentences_nlp)
----> 2 topics, probabilities = topic_model.fit_transform(freitextantwort_list)

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in fit_transform(self, documents, embeddings, images, y)
446 # Combine Zero-shot with outliers
447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448 predictions = self.combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
449
450 return predictions, self.probabilities

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/bertopic.py:3553, in combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
3551 return self.topics, self.probabilities
3552
-> 3553 # Merge the two topic models
3554 merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1)
3555

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3166, in merge_models(cls, models, min_similarity, embedding_model)
3164 merged_topics["topic_aspects"][aspect][str(new_topic_val)] = values[str(new_topic)]
3165
-> 3166 # Add new embeddings
3167 new_tensors = tensors[new_topic - selected_topics["_outliers"]]
3168 merged_tensors = np.vstack([merged_tensors, new_tensors])

IndexError: index -2 is out of bounds for axis 0 with size 1`

Thanks alot

MaartenGr · 2024-02-06T13:56:20Z

@hubernst You mention using #1688 but the actual fix is found in #1762 which you should install through pip. Have you tried that? Make sure to start from a fresh and empty environment.

hubernst · 2024-02-06T14:12:22Z

Thanks for your realy quick response.
It's terrible, but I'm in a network environment without a Git connection. That's why I customized _bertopic.py directly as specified in the fix... And sorry, of course #1762

MaartenGr · 2024-02-06T17:10:21Z

@hubernst Can you provide a reproducible example? You shared very limited code so it's unclear for example what is in representation_model or which versions you are using. Also, I get no issues using the code from the PR on my end using the examples in the related issues.

hubernst · 2024-02-09T15:49:48Z

Hi, thanks for your answer.
I'm using bertopic in version 0.16.0 and python 3.10.
My code looks like this,

# Step 1 - Extract embeddings
embedding_model = sentence_transformers.SentenceTransformer('/userfs/assets/data_asset/huggingface/paraphrase-multilingual-MiniLM-L12-v2')
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=10, min_dist=0.0, metric='cosine', random_state=42)
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom', prediction_data=False)
# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopwords_german)
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()
# All steps together
topic_model = BERTopic(
    verbose=True,
    min_topic_size = 30,
    #nr_topics = 5,
    zeroshot_topic_list=kategorien_1,
    zeroshot_min_similarity=.45,
    embedding_model=embedding_model,          # Step 1 - Extract embeddings
    umap_model=umap_model,                    # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
    representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)
topics = topic_model.fit_transform(freitextantwort_list)

2024-02-09 15:44:24,639 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%
42/42 [00:18<00:00, 4.04it/s]
2024-02-09 15:44:43,544 - BERTopic - Embedding - Completed ✓
2024-02-09 15:44:43,546 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2024-02-09 15:44:43,747 - BERTopic - Zeroshot Step 1 - Completed ✓
2024-02-09 15:44:43,748 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-09 15:44:56,807 - BERTopic - Dimensionality - Completed ✓
2024-02-09 15:44:56,808 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-09 15:44:56,835 - BERTopic - Cluster - Completed ✓
2024-02-09 15:44:56,841 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-09 15:44:58,442 - BERTopic - Representation - Completed ✓
2024-02-09 15:44:58,469 - BERTopic - Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model...
2024-02-09 15:44:58,475 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-09 15:44:58,477 - BERTopic - Dimensionality - Completed ✓
2024-02-09 15:44:58,481 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-09 15:44:58,484 - BERTopic - Cluster - Completed ✓
2024-02-09 15:44:58,490 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-09 15:45:10,157 - BERTopic - Representation - Completed ✓
2024-02-09 15:45:10,231 - BERTopic - Zeroshot Step 2 - Completed ✓
2024-02-09 15:45:10,232 - BERTopic - Zeroshot Step 3 - Combining clustered topics with the zeroshot model

IndexError Traceback (most recent call last)
Input In [55], in <cell line: 2>()
1 #topics, probabilities = topic_model.fit_transform(sentences_nlp)
----> 2 topics = topic_model.fit_transform(freitextantwort_list)

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y)
446 # Combine Zero-shot with outliers
447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448 predictions = self.combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
450 return predictions, self.probabilities

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/bertopic.py:3554, in BERTopic.combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
3551 return self.topics, self.probabilities
3553 # Merge the two topic models
-> 3554 merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1)
3556 # Update topic labels and representative docs of the zero-shot model
3557 for topic in range(len(set(y))):

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3167, in BERTopic.merge_models(cls, models, min_similarity, embedding_model)
3164 merged_topics["topic_aspects"][aspect][str(new_topic_val)] = values[str(new_topic)]
3166 # Add new embeddings
-> 3167 new_tensors = tensors[new_topic - selected_topics["_outliers"]]
3168 merged_tensors = np.vstack([merged_tensors, new_tensors])
3170 # Topic Mapper

IndexError: index -2 is out of bounds for axis 0 with size 1

It works if I am not using zero-shot topic modeling.

Many greetings

MaartenGr · 2024-02-09T17:57:14Z

I think this issue then relates to #1797 which should be relatively straightforward to fix. I would advise keeping an eye on that issue until a fix is released.

MaartenGr · 2024-02-12T12:59:41Z

@hubernst

I created a PR in #1804 that should solve both issues, the ordering of the embeddings as well as moving the outlier class back to the 0th position (which is necessary for many other functions).

Could you test whether it works for you?

hubernst · 2024-02-12T13:19:53Z

Hello, yes, of course I will check it, thank you for the fix! Hopefully today, tomorrow afternoon at the latest.

hubernst · 2024-02-12T17:04:46Z

Hi, thanks for the quick help. For the problem described here, the fix #1804 works! I.e. I can now specify different values for zeroshot_min_similarity. Unfortunately the fix does not solve issue #1792, I can also comment on that there. Furthermore, there is an error with topics_per_class(). Sorry.

MaartenGr · 2024-02-12T18:15:18Z

Glad to hear that it resolved at least this issue ;-) I added my response to that specific issue there.

James-Leslie · 2025-01-06T01:19:44Z

When running zero-shot topic modelling, I encounter the following error:
IndexError: index 62 is out of bounds for axis 0 with size 62

I had been using this same approach on a weekly basis for a few months with no issues, but have recently changed my embedding model from Open AI's text-embedding-ada-002 to their newer text-embedding-3-large model.

I cannot share my documents, as its sensitive for my company, but my code is below. If I change the zeroshot_min_similarity argument to something high, like 0.85, then the code will run, but there will be no zeroshot topics, only new ones.

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from bertopic import BERTopic
from bertopic.backend import OpenAIBackend
from bertopic.representation import BaseRepresentation, OpenAI
from bertopic.vectorizers import ClassTfidfTransformer
from hdbscan import HDBSCAN
from openai import AzureOpenAI
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP


# create Azure OpenAI client
client = AzureOpenAI(
    api_key=...,
    api_version=2024-10-21,
    azure_endpoint=...,
)

# 1. embeddings
embedding_model = OpenAIBackend(
    client,
    "text-embedding-3-large",
    generator_kwargs={
        "dimensions": 768
    }
)

# 2. dimensionality reduction
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric='cosine',
    random_state=42  # prevents stochastic behaviour
)

# 3. clustering
hdbscan_model = HDBSCAN(
    min_cluster_size=10,
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True
)

# 4. bag-of-words
vectorizer_model = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 2)
)

# 5. topic representation
ctfidf_model = ClassTfidfTransformer()

# 6. list of zero-shot topics
zeroshot_topic_list = user_topics["name"].tolist()  # have to keep this secret, but it's just a list of strings


# fit model to data
topic_model = BERTopic(
    # algorithm components
    embedding_model=embedding_model,  # Step 1 - Embedding model backend
    umap_model=umap_model,  # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,  # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model,  # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,  # Step 5 - Extract topic words
    # hyperparameters
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=0.75,
    min_topic_size=5,
    nr_topics="auto",
    verbose=True,
)

# Fit BERTopic using pre-computed embeddings
topic_model.fit(docs, embeddings=embeddings)

Here is the output before the error:

2025-01-06 01:15:52,232 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-06 01:16:18,496 - BERTopic - Dimensionality - Completed ✓
2025-01-06 01:16:18,498 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2025-01-06 01:16:18,694 - BERTopic - Zeroshot Step 1 - Completed ✓
2025-01-06 01:16:52,137 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-06 01:16:52,227 - BERTopic - Cluster - Completed ✓
2025-01-06 01:16:52,228 - BERTopic - Zeroshot Step 2 - Combining topics from zero-shot topic modeling with topics from clustering...
2025-01-06 01:16:52,247 - BERTopic - Zeroshot Step 2 - Completed ✓
2025-01-06 01:16:52,248 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-06 01:16:52,402 - BERTopic - Representation - Completed ✓
2025-01-06 01:16:52,404 - BERTopic - Topic reduction - Reducing number of topics

I am using pre-computed embeddings, which are a passed to the fit() method as a numpy array of shape (n_documents x 768)

MaartenGr · 2025-01-08T11:16:25Z

@James-Leslie It might be a result of the updated embedding model (which might change the distribution of similarities) but also a bug that was in earlier versions of BERTopic. Are you using the latest (v0.16.4)?

JamesLeslieAT · 2025-01-08T20:15:05Z

Hi @MaartenGr, I have lowered the threshold from 0.85 to 0.75 to account for the new model's distribution. Using version 0.16.4.

I found the error doesn't happen if I leave the nr_topics parameter out, however I like the feature of reducing the number of topics automatically.

If I leave nr_topics="auto" then it only works if I set the min_similarity high (which effectively just means that the zero-shot model doesn't match any documents)

MaartenGr · 2025-01-17T11:17:29Z

@JamesLeslieAT @James-Leslie I just created a PR that should have fixed the issue. Could you try it out?

MaartenGr added a commit that referenced this issue Jan 17, 2025

Fix #1749

725f7d7

MaartenGr linked a pull request Jan 17, 2025 that will close this issue

Fix #1749 #2267

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

yml-blog commented Jan 13, 2024

MaartenGr commented Jan 13, 2024

MaartenGr commented Jan 23, 2024

hubernst commented Feb 6, 2024

MaartenGr commented Feb 6, 2024

hubernst commented Feb 6, 2024

MaartenGr commented Feb 6, 2024

hubernst commented Feb 9, 2024

MaartenGr commented Feb 9, 2024

MaartenGr commented Feb 12, 2024

hubernst commented Feb 12, 2024

hubernst commented Feb 12, 2024

MaartenGr commented Feb 12, 2024

James-Leslie commented Jan 6, 2025 •

edited

Loading

MaartenGr commented Jan 8, 2025

JamesLeslieAT commented Jan 8, 2025

MaartenGr commented Jan 17, 2025

IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

Comments

yml-blog commented Jan 13, 2024

We select a subsample of 5000 abstracts from ArXiv

We define a number of topics that we know are in the documents

We fit our model using the zero-shot topics

and we define a minimum similarity. For each document,

if the similarity does not exceed that value, it will be used

for clustering instead.

MaartenGr commented Jan 13, 2024

MaartenGr commented Jan 23, 2024

hubernst commented Feb 6, 2024

All steps together

MaartenGr commented Feb 6, 2024

hubernst commented Feb 6, 2024

MaartenGr commented Feb 6, 2024

hubernst commented Feb 9, 2024

MaartenGr commented Feb 9, 2024

MaartenGr commented Feb 12, 2024

hubernst commented Feb 12, 2024

hubernst commented Feb 12, 2024

MaartenGr commented Feb 12, 2024

James-Leslie commented Jan 6, 2025 • edited Loading

MaartenGr commented Jan 8, 2025

JamesLeslieAT commented Jan 8, 2025

MaartenGr commented Jan 17, 2025

James-Leslie commented Jan 6, 2025 •

edited

Loading