You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi.
First of all, thank you for making such a model available to us.
I am trying to get vector embeddings of abstracts of some of the articles in PubMed. But somehow I couldn't get the sentence embeddings. More precisely, I wrote the code below and the dimensions of the vectors I obtained are 2560. But on the huggingface page, it says sequence length is 1024. So I understand that the dimension of an embedding vector should be 1024. Am I wrong?
Can you help with getting sentence embeddings?
Best wishes.
Orhan
tokenizer = AutoTokenizer.from_pretrained("BioMedLM")
model = AutoModel.from_pretrained("BioMedLM")
tokenizer.pad_token = tokenizer.eos_token
f = open('articles.json', "r")
data = json.loads(f.read())
data_abst = [data[i]['abstract'] for i in range(len(data))]
data_title = [data[i]['title'] for i in range(len(data))]
def normalizer(x):
normalized_vector = x / np.linalg.norm(x)
return normalized_vector
class BioMedLM:
def __init__(self, model, tokenizer):
# self.sentence = sentence
self.model = model
self.tokenizer = tokenizer
def sentence_vectors(self,sentence):
inputs = self.tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")
w_vectors = self.model(**inputs)
# return w_vectors
token_embeddings = w_vectors[0] #First element of model_output contains all token embeddings
input_mask_expanded = inputs.attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
vec=torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return vec[0]
gpt_class = BioMedLM(model, tokenizer)
def sentence_encoder(data):
vectors = []
normalized_vectors = []
for i in range(len(data)):
sentence_vectors = gpt_class.sentence_vectors(data[i]).detach().numpy()
vectors.append(sentence_vectors)
normalized_vectors.append(normalizer(sentence_vectors))
vectors = np.squeeze(np.array(vectors))
normalized_vectors = np.squeeze(np.array(normalized_vectors))
return vectors, normalized_vectors
abst_vectors, abst_vectors_norm = sentence_encoder(data_abst)
The text was updated successfully, but these errors were encountered:
I'm not super familiar with generating document level representations from GPT-2 models, but your code looks like it is summing the hidden states for each position and normalizing? That would be 2560. Another way is to just look at the final hidden state, which would also be 2560. But I would expect the document level vector to be 2560 since you are going to use some algorithm to combine the size=2560 vectors into one final vector.
Could you point to paper, algorithm, code of how you want to generate final abstract-level representations ? As I said, it looks like your method is to add all of the final states and normalize. I think typically one would just take the final hidden state of the sequence.
What task do you want to use these abstract level vectors for ?
Hi.
First of all, thank you for making such a model available to us.
I am trying to get vector embeddings of abstracts of some of the articles in PubMed. But somehow I couldn't get the sentence embeddings. More precisely, I wrote the code below and the dimensions of the vectors I obtained are 2560. But on the huggingface page, it says sequence length is 1024. So I understand that the dimension of an embedding vector should be 1024. Am I wrong?
Can you help with getting sentence embeddings?
Best wishes.
Orhan
The text was updated successfully, but these errors were encountered: