IndexError: index out of range in self #686

Santhu489 · 2024-03-02T15:27:28Z

I currently work on the project of "Autism gene classifier " which is a binary-classification system .. I have a gene dataset which have columns gene-symbol and syndromic ( 0 and 1) ..

The Model i am using is GPT-2 and while i run my code on google colab i face the error of IndexError: index out of range in self

This is my code

Install required libraries

!pip install torch
!pip install transformers
!pip install pandas
!pip install scikit-learn

Import libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Model
import torch

Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)

data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')

Split the data into training and testing sets

train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

Initialize GPT-2 tokenizer and model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

Add a new pad token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Tokenize and encode the gene symbols with a maximum length of 512 tokens

train_tokens = tokenizer(train_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')
test_tokens = tokenizer(test_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')

Extract embeddings from GPT-2 model

model.eval()
with torch.no_grad():
train_embeddings = model(**train_tokens).last_hidden_state.mean(dim=1)
test_embeddings = model(**test_tokens).last_hidden_state.mean(dim=1)

Flatten the embeddings to be used as input to logistic regression

train_embeddings = train_embeddings.view(train_embeddings.size(0), -1)
test_embeddings = test_embeddings.view(test_embeddings.size(0), -1)

Train logistic regression classifier

clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, train_data['syndromic'])

Evaluate logistic regression classifier

train_predictions = clf.predict(train_embeddings)
train_accuracy = accuracy_score(train_data['syndromic'], train_predictions)
print("Training accuracy:", train_accuracy)

when i try to extract embedding from GPT-2 Model this index error came...
i also tried to maximize length as 512 and 1024 .. it wont work for me...
How to resolve this error ...please solve the error

p0lyMth · 2024-07-10T06:08:17Z

@Santhu489, if you want help, you need to provide a small dataset sample of sfari_genes.csv for reproducibility. The code snippet

data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')

is not enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: index out of range in self #686

IndexError: index out of range in self #686

Santhu489 commented Mar 2, 2024 •

edited

Loading

p0lyMth commented Jul 10, 2024

IndexError: index out of range in self #686

IndexError: index out of range in self #686

Comments

Santhu489 commented Mar 2, 2024 • edited Loading

Install required libraries

Import libraries

Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)

Split the data into training and testing sets

Initialize GPT-2 tokenizer and model

Add a new pad token

Tokenize and encode the gene symbols with a maximum length of 512 tokens

Extract embeddings from GPT-2 model

Flatten the embeddings to be used as input to logistic regression

Train logistic regression classifier

Evaluate logistic regression classifier

p0lyMth commented Jul 10, 2024

Santhu489 commented Mar 2, 2024 •

edited

Loading