Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

Open
yml-blog opened this issue Jan 13, 2024 · 16 comments · May be fixed by #2267
Open

IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

yml-blog opened this issue Jan 13, 2024 · 16 comments · May be fixed by #2267

Comments

@yml-blog
Copy link

I almost not change too many for the example code of the zero shot but has this error. Could you help me to solve it? Thanks. :from datasets import load_dataset

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

We select a subsample of 5000 abstracts from ArXiv

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

We define a number of topics that we know are in the documents

zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

We fit our model using the zero-shot topics

and we define a minimum similarity. For each document,

if the similarity does not exceed that value, it will be used

for clustering instead.

topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)

@MaartenGr
Copy link
Owner

I believe this is a result of setting zershot_min_similarity too high. If you lower the value, the issue might resolve itself.

@MaartenGr
Copy link
Owner

Note that there is also a preliminary fix available at #1762 which should resolve the issue entirely.

@hubernst
Copy link

hubernst commented Feb 6, 2024

Hello,

Zero-Shot is a perfect extension. Thanks so much you.
Unfortunately, I have the same problem as described above. I have already added your fix #1688 to _bertopic.py.
For a value zeroshot_min_similarity=0.2 or even 0.8 the code runs, in between success is unlikely. Do you have a solution?

`

All steps together

topic_model = BERTopic(
verbose=True,
min_topic_size = 20,
#nr_topics = 5,
zeroshot_topic_list=kategorien_1,
zeroshot_min_similarity=.70,
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

2024-02-06 13:28:02,189 - BERTopic - Embedding - Transforming documents to embeddings.
100%|██████████| 1341/1341 [02:23<00:00, 9.35it/s]
2024-02-06 13:30:25,711 - BERTopic - Embedding - Completed ✓
2024-02-06 13:30:25,713 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2024-02-06 13:30:26,642 - BERTopic - Zeroshot Step 1 - Completed ✓
2024-02-06 13:30:26,643 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-06 13:30:29,187 - BERTopic - Dimensionality - Completed ✓
2024-02-06 13:30:29,190 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-06 13:30:29,207 - BERTopic - Cluster - Completed ✓
2024-02-06 13:30:29,214 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-06 13:30:38,532 - BERTopic - Representation - Completed ✓
2024-02-06 13:30:38,558 - BERTopic - Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model...
2024-02-06 13:30:38,565 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-06 13:30:38,567 - BERTopic - Dimensionality - Completed ✓
2024-02-06 13:30:38,577 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-06 13:30:38,581 - BERTopic - Cluster - Completed ✓
2024-02-06 13:30:38,587 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-06 13:31:33,230 - BERTopic - Representation - Completed ✓
2024-02-06 13:31:33,298 - BERTopic - Zeroshot Step 2 - Completed ✓
2024-02-06 13:31:33,299 - BERTopic - Zeroshot Step 3 - Combining clustered topics with the zeroshot model

IndexError Traceback (most recent call last)
Input In [67], in <cell line: 2>()
1 #topics, probabilities = topic_model.fit_transform(sentences_nlp)
----> 2 topics, probabilities = topic_model.fit_transform(freitextantwort_list)

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in fit_transform(self, documents, embeddings, images, y)
446 # Combine Zero-shot with outliers
447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448 predictions = self.combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
449
450 return predictions, self.probabilities

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/bertopic.py:3553, in combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
3551 return self.topics
, self.probabilities

3552
-> 3553 # Merge the two topic models
3554 merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1)
3555

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3166, in merge_models(cls, models, min_similarity, embedding_model)
3164 merged_topics["topic_aspects"][aspect][str(new_topic_val)] = values[str(new_topic)]
3165
-> 3166 # Add new embeddings
3167 new_tensors = tensors[new_topic - selected_topics["_outliers"]]
3168 merged_tensors = np.vstack([merged_tensors, new_tensors])

IndexError: index -2 is out of bounds for axis 0 with size 1`

Thanks alot

@MaartenGr
Copy link
Owner

@hubernst You mention using #1688 but the actual fix is found in #1762 which you should install through pip. Have you tried that? Make sure to start from a fresh and empty environment.

@hubernst
Copy link

hubernst commented Feb 6, 2024

Thanks for your realy quick response.
It's terrible, but I'm in a network environment without a Git connection. That's why I customized _bertopic.py directly as specified in the fix... And sorry, of course #1762

image

@MaartenGr
Copy link
Owner

@hubernst Can you provide a reproducible example? You shared very limited code so it's unclear for example what is in representation_model or which versions you are using. Also, I get no issues using the code from the PR on my end using the examples in the related issues.

@hubernst
Copy link

hubernst commented Feb 9, 2024

Hi, thanks for your answer.
I'm using bertopic in version 0.16.0 and python 3.10.
My code looks like this,

# Step 1 - Extract embeddings
embedding_model = sentence_transformers.SentenceTransformer('/userfs/assets/data_asset/huggingface/paraphrase-multilingual-MiniLM-L12-v2')
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=10, min_dist=0.0, metric='cosine', random_state=42)
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom', prediction_data=False)
# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopwords_german)
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()
# All steps together
topic_model = BERTopic(
    verbose=True,
    min_topic_size = 30,
    #nr_topics = 5,
    zeroshot_topic_list=kategorien_1,
    zeroshot_min_similarity=.45,
    embedding_model=embedding_model,          # Step 1 - Extract embeddings
    umap_model=umap_model,                    # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
    representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)
topics = topic_model.fit_transform(freitextantwort_list)

2024-02-09 15:44:24,639 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%
42/42 [00:18<00:00, 4.04it/s]
2024-02-09 15:44:43,544 - BERTopic - Embedding - Completed ✓
2024-02-09 15:44:43,546 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2024-02-09 15:44:43,747 - BERTopic - Zeroshot Step 1 - Completed ✓
2024-02-09 15:44:43,748 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-09 15:44:56,807 - BERTopic - Dimensionality - Completed ✓
2024-02-09 15:44:56,808 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-09 15:44:56,835 - BERTopic - Cluster - Completed ✓
2024-02-09 15:44:56,841 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-09 15:44:58,442 - BERTopic - Representation - Completed ✓
2024-02-09 15:44:58,469 - BERTopic - Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model...
2024-02-09 15:44:58,475 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-09 15:44:58,477 - BERTopic - Dimensionality - Completed ✓
2024-02-09 15:44:58,481 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-09 15:44:58,484 - BERTopic - Cluster - Completed ✓
2024-02-09 15:44:58,490 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-09 15:45:10,157 - BERTopic - Representation - Completed ✓
2024-02-09 15:45:10,231 - BERTopic - Zeroshot Step 2 - Completed ✓
2024-02-09 15:45:10,232 - BERTopic - Zeroshot Step 3 - Combining clustered topics with the zeroshot model

IndexError Traceback (most recent call last)
Input In [55], in <cell line: 2>()
1 #topics, probabilities = topic_model.fit_transform(sentences_nlp)
----> 2 topics = topic_model.fit_transform(freitextantwort_list)

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y)
446 # Combine Zero-shot with outliers
447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448 predictions = self.combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
450 return predictions, self.probabilities

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/bertopic.py:3554, in BERTopic.combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
3551 return self.topics
, self.probabilities

3553 # Merge the two topic models
-> 3554 merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1)
3556 # Update topic labels and representative docs of the zero-shot model
3557 for topic in range(len(set(y))):

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3167, in BERTopic.merge_models(cls, models, min_similarity, embedding_model)
3164 merged_topics["topic_aspects"][aspect][str(new_topic_val)] = values[str(new_topic)]
3166 # Add new embeddings
-> 3167 new_tensors = tensors[new_topic - selected_topics["_outliers"]]
3168 merged_tensors = np.vstack([merged_tensors, new_tensors])
3170 # Topic Mapper

IndexError: index -2 is out of bounds for axis 0 with size 1

It works if I am not using zero-shot topic modeling.

Many greetings

@MaartenGr
Copy link
Owner

I think this issue then relates to #1797 which should be relatively straightforward to fix. I would advise keeping an eye on that issue until a fix is released.

@MaartenGr
Copy link
Owner

@hubernst

I created a PR in #1804 that should solve both issues, the ordering of the embeddings as well as moving the outlier class back to the 0th position (which is necessary for many other functions).

Could you test whether it works for you?

@hubernst
Copy link

Hello, yes, of course I will check it, thank you for the fix! Hopefully today, tomorrow afternoon at the latest.

@hubernst
Copy link

Hi, thanks for the quick help. For the problem described here, the fix #1804 works! I.e. I can now specify different values for zeroshot_min_similarity. Unfortunately the fix does not solve issue #1792, I can also comment on that there. Furthermore, there is an error with topics_per_class(). Sorry.

@MaartenGr
Copy link
Owner

Glad to hear that it resolved at least this issue ;-) I added my response to that specific issue there.

@James-Leslie
Copy link

James-Leslie commented Jan 6, 2025

When running zero-shot topic modelling, I encounter the following error:
IndexError: index 62 is out of bounds for axis 0 with size 62

I had been using this same approach on a weekly basis for a few months with no issues, but have recently changed my embedding model from Open AI's text-embedding-ada-002 to their newer text-embedding-3-large model.

I cannot share my documents, as its sensitive for my company, but my code is below. If I change the zeroshot_min_similarity argument to something high, like 0.85, then the code will run, but there will be no zeroshot topics, only new ones.

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from bertopic import BERTopic
from bertopic.backend import OpenAIBackend
from bertopic.representation import BaseRepresentation, OpenAI
from bertopic.vectorizers import ClassTfidfTransformer
from hdbscan import HDBSCAN
from openai import AzureOpenAI
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP


# create Azure OpenAI client
client = AzureOpenAI(
    api_key=...,
    api_version=2024-10-21,
    azure_endpoint=...,
)

# 1. embeddings
embedding_model = OpenAIBackend(
    client,
    "text-embedding-3-large",
    generator_kwargs={
        "dimensions": 768
    }
)

# 2. dimensionality reduction
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric='cosine',
    random_state=42  # prevents stochastic behaviour
)

# 3. clustering
hdbscan_model = HDBSCAN(
    min_cluster_size=10,
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True
)

# 4. bag-of-words
vectorizer_model = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 2)
)

# 5. topic representation
ctfidf_model = ClassTfidfTransformer()

# 6. list of zero-shot topics
zeroshot_topic_list = user_topics["name"].tolist()  # have to keep this secret, but it's just a list of strings


# fit model to data
topic_model = BERTopic(
    # algorithm components
    embedding_model=embedding_model,  # Step 1 - Embedding model backend
    umap_model=umap_model,  # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,  # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model,  # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,  # Step 5 - Extract topic words
    # hyperparameters
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=0.75,
    min_topic_size=5,
    nr_topics="auto",
    verbose=True,
)

# Fit BERTopic using pre-computed embeddings
topic_model.fit(docs, embeddings=embeddings)

Here is the output before the error:

2025-01-06 01:15:52,232 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-06 01:16:18,496 - BERTopic - Dimensionality - Completed ✓
2025-01-06 01:16:18,498 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2025-01-06 01:16:18,694 - BERTopic - Zeroshot Step 1 - Completed ✓
2025-01-06 01:16:52,137 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-06 01:16:52,227 - BERTopic - Cluster - Completed ✓
2025-01-06 01:16:52,228 - BERTopic - Zeroshot Step 2 - Combining topics from zero-shot topic modeling with topics from clustering...
2025-01-06 01:16:52,247 - BERTopic - Zeroshot Step 2 - Completed ✓
2025-01-06 01:16:52,248 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-06 01:16:52,402 - BERTopic - Representation - Completed ✓
2025-01-06 01:16:52,404 - BERTopic - Topic reduction - Reducing number of topics

I am using pre-computed embeddings, which are a passed to the fit() method as a numpy array of shape (n_documents x 768)

@MaartenGr
Copy link
Owner

@James-Leslie It might be a result of the updated embedding model (which might change the distribution of similarities) but also a bug that was in earlier versions of BERTopic. Are you using the latest (v0.16.4)?

@JamesLeslieAT
Copy link

Hi @MaartenGr, I have lowered the threshold from 0.85 to 0.75 to account for the new model's distribution. Using version 0.16.4.

I found the error doesn't happen if I leave the nr_topics parameter out, however I like the feature of reducing the number of topics automatically.

If I leave nr_topics="auto" then it only works if I set the min_similarity high (which effectively just means that the zero-shot model doesn't match any documents)

MaartenGr added a commit that referenced this issue Jan 17, 2025
@MaartenGr MaartenGr linked a pull request Jan 17, 2025 that will close this issue
5 tasks
@MaartenGr
Copy link
Owner

@JamesLeslieAT @James-Leslie I just created a PR that should have fixed the issue. Could you try it out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants