You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I currently work on the project of "Autism gene classifier " which is a binary-classification system .. I have a gene dataset which have columns gene-symbol and syndromic ( 0 and 1) ..
The Model i am using is GPT-2 and while i run my code on google colab i face the error of IndexError: index out of range in self
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Model
import torch
Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)
data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')
when i try to extract embedding from GPT-2 Model this index error came...
i also tried to maximize length as 512 and 1024 .. it wont work for me...
How to resolve this error ...please solve the error
The text was updated successfully, but these errors were encountered:
I currently work on the project of "Autism gene classifier " which is a binary-classification system .. I have a gene dataset which have columns gene-symbol and syndromic ( 0 and 1) ..
The Model i am using is GPT-2 and while i run my code on google colab i face the error of IndexError: index out of range in self
This is my code
Install required libraries
!pip install torch
!pip install transformers
!pip install pandas
!pip install scikit-learn
Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Model
import torch
Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)
data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')
Split the data into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
Initialize GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
Add a new pad token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
Tokenize and encode the gene symbols with a maximum length of 512 tokens
train_tokens = tokenizer(train_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')
test_tokens = tokenizer(test_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')
Extract embeddings from GPT-2 model
model.eval()
with torch.no_grad():
train_embeddings = model(**train_tokens).last_hidden_state.mean(dim=1)
test_embeddings = model(**test_tokens).last_hidden_state.mean(dim=1)
Flatten the embeddings to be used as input to logistic regression
train_embeddings = train_embeddings.view(train_embeddings.size(0), -1)
test_embeddings = test_embeddings.view(test_embeddings.size(0), -1)
Train logistic regression classifier
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, train_data['syndromic'])
Evaluate logistic regression classifier
train_predictions = clf.predict(train_embeddings)
train_accuracy = accuracy_score(train_data['syndromic'], train_predictions)
print("Training accuracy:", train_accuracy)
when i try to extract embedding from GPT-2 Model this index error came...
i also tried to maximize length as 512 and 1024 .. it wont work for me...
How to resolve this error ...please solve the error
The text was updated successfully, but these errors were encountered: