Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentence embedding #15

Open
orhansonmeztr opened this issue Apr 17, 2023 · 2 comments
Open

sentence embedding #15

orhansonmeztr opened this issue Apr 17, 2023 · 2 comments

Comments

@orhansonmeztr
Copy link

Hi.
First of all, thank you for making such a model available to us.
I am trying to get vector embeddings of abstracts of some of the articles in PubMed. But somehow I couldn't get the sentence embeddings. More precisely, I wrote the code below and the dimensions of the vectors I obtained are 2560. But on the huggingface page, it says sequence length is 1024. So I understand that the dimension of an embedding vector should be 1024. Am I wrong?
Can you help with getting sentence embeddings?
Best wishes.
Orhan

tokenizer = AutoTokenizer.from_pretrained("BioMedLM")
model = AutoModel.from_pretrained("BioMedLM")
tokenizer.pad_token = tokenizer.eos_token

f = open('articles.json', "r")
data = json.loads(f.read())
data_abst = [data[i]['abstract'] for i in range(len(data))]
data_title = [data[i]['title'] for i in range(len(data))]

def normalizer(x):     
    normalized_vector = x / np.linalg.norm(x)
    return normalized_vector

class BioMedLM:    
    def __init__(self, model, tokenizer):
        # self.sentence = sentence
        self.model = model
        self.tokenizer = tokenizer

    def sentence_vectors(self,sentence):
        inputs = self.tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")
        w_vectors = self.model(**inputs)

        # return w_vectors
        token_embeddings = w_vectors[0] #First element of model_output contains all token embeddings
        input_mask_expanded = inputs.attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        vec=torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return vec[0]

gpt_class = BioMedLM(model, tokenizer)

def sentence_encoder(data):
    vectors = []
    normalized_vectors = []
    for i in range(len(data)):
        sentence_vectors = gpt_class.sentence_vectors(data[i]).detach().numpy()
        vectors.append(sentence_vectors)
        normalized_vectors.append(normalizer(sentence_vectors))

    vectors = np.squeeze(np.array(vectors))
    normalized_vectors = np.squeeze(np.array(normalized_vectors))

    return vectors, normalized_vectors


abst_vectors, abst_vectors_norm = sentence_encoder(data_abst) 
@J38
Copy link
Contributor

J38 commented Apr 18, 2023

I'm not super familiar with generating document level representations from GPT-2 models, but your code looks like it is summing the hidden states for each position and normalizing? That would be 2560. Another way is to just look at the final hidden state, which would also be 2560. But I would expect the document level vector to be 2560 since you are going to use some algorithm to combine the size=2560 vectors into one final vector.

Could you point to paper, algorithm, code of how you want to generate final abstract-level representations ? As I said, it looks like your method is to add all of the final states and normalize. I think typically one would just take the final hidden state of the sequence.

What task do you want to use these abstract level vectors for ?

@Mentholatum
Copy link

In the BioMedLM/config.json file on Hugging Face, the notations are "n_embd": 2560, "n_head": 20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants