Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: index out of range in self #686

Open
Santhu489 opened this issue Mar 2, 2024 · 1 comment
Open

IndexError: index out of range in self #686

Santhu489 opened this issue Mar 2, 2024 · 1 comment

Comments

@Santhu489
Copy link

Santhu489 commented Mar 2, 2024

I currently work on the project of "Autism gene classifier " which is a binary-classification system .. I have a gene dataset which have columns gene-symbol and syndromic ( 0 and 1) ..

The Model i am using is GPT-2 and while i run my code on google colab i face the error of IndexError: index out of range in self

This is my code

Install required libraries

!pip install torch
!pip install transformers
!pip install pandas
!pip install scikit-learn

Import libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Model
import torch

Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)

data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')

Split the data into training and testing sets

train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

Initialize GPT-2 tokenizer and model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

Add a new pad token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Tokenize and encode the gene symbols with a maximum length of 512 tokens

train_tokens = tokenizer(train_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')
test_tokens = tokenizer(test_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')

Extract embeddings from GPT-2 model

model.eval()
with torch.no_grad():
train_embeddings = model(**train_tokens).last_hidden_state.mean(dim=1)
test_embeddings = model(**test_tokens).last_hidden_state.mean(dim=1)

Flatten the embeddings to be used as input to logistic regression

train_embeddings = train_embeddings.view(train_embeddings.size(0), -1)
test_embeddings = test_embeddings.view(test_embeddings.size(0), -1)

Train logistic regression classifier

clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, train_data['syndromic'])

Evaluate logistic regression classifier

train_predictions = clf.predict(train_embeddings)
train_accuracy = accuracy_score(train_data['syndromic'], train_predictions)
print("Training accuracy:", train_accuracy)

when i try to extract embedding from GPT-2 Model this index error came...
i also tried to maximize length as 512 and 1024 .. it wont work for me...
How to resolve this error ...please solve the error

@p0lyMth
Copy link

p0lyMth commented Jul 10, 2024

@Santhu489, if you want help, you need to provide a small dataset sample of sfari_genes.csv for reproducibility. The code snippet

data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')

is not enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants