sentence embedding #15

orhansonmeztr · 2023-04-17T16:19:04Z

Hi.
First of all, thank you for making such a model available to us.
I am trying to get vector embeddings of abstracts of some of the articles in PubMed. But somehow I couldn't get the sentence embeddings. More precisely, I wrote the code below and the dimensions of the vectors I obtained are 2560. But on the huggingface page, it says sequence length is 1024. So I understand that the dimension of an embedding vector should be 1024. Am I wrong?
Can you help with getting sentence embeddings?
Best wishes.
Orhan

tokenizer = AutoTokenizer.from_pretrained("BioMedLM")
model = AutoModel.from_pretrained("BioMedLM")
tokenizer.pad_token = tokenizer.eos_token

f = open('articles.json', "r")
data = json.loads(f.read())
data_abst = [data[i]['abstract'] for i in range(len(data))]
data_title = [data[i]['title'] for i in range(len(data))]

def normalizer(x):     
    normalized_vector = x / np.linalg.norm(x)
    return normalized_vector

class BioMedLM:    
    def __init__(self, model, tokenizer):
        # self.sentence = sentence
        self.model = model
        self.tokenizer = tokenizer

    def sentence_vectors(self,sentence):
        inputs = self.tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")
        w_vectors = self.model(**inputs)

        # return w_vectors
        token_embeddings = w_vectors[0] #First element of model_output contains all token embeddings
        input_mask_expanded = inputs.attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        vec=torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return vec[0]

gpt_class = BioMedLM(model, tokenizer)

def sentence_encoder(data):
    vectors = []
    normalized_vectors = []
    for i in range(len(data)):
        sentence_vectors = gpt_class.sentence_vectors(data[i]).detach().numpy()
        vectors.append(sentence_vectors)
        normalized_vectors.append(normalizer(sentence_vectors))

    vectors = np.squeeze(np.array(vectors))
    normalized_vectors = np.squeeze(np.array(normalized_vectors))

    return vectors, normalized_vectors


abst_vectors, abst_vectors_norm = sentence_encoder(data_abst)

The text was updated successfully, but these errors were encountered:

J38 · 2023-04-18T02:57:17Z

I'm not super familiar with generating document level representations from GPT-2 models, but your code looks like it is summing the hidden states for each position and normalizing? That would be 2560. Another way is to just look at the final hidden state, which would also be 2560. But I would expect the document level vector to be 2560 since you are going to use some algorithm to combine the size=2560 vectors into one final vector.

Could you point to paper, algorithm, code of how you want to generate final abstract-level representations ? As I said, it looks like your method is to add all of the final states and normalize. I think typically one would just take the final hidden state of the sequence.

What task do you want to use these abstract level vectors for ?

Mentholatum · 2024-09-06T12:47:56Z

In the BioMedLM/config.json file on Hugging Face, the notations are "n_embd": 2560, "n_head": 20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sentence embedding #15

sentence embedding #15

orhansonmeztr commented Apr 17, 2023

J38 commented Apr 18, 2023

Mentholatum commented Sep 6, 2024

sentence embedding #15

sentence embedding #15

Comments

orhansonmeztr commented Apr 17, 2023

J38 commented Apr 18, 2023

Mentholatum commented Sep 6, 2024