Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The problem about the beginning index #2

Open
Loose-Gu opened this issue Aug 7, 2023 · 1 comment
Open

The problem about the beginning index #2

Loose-Gu opened this issue Aug 7, 2023 · 1 comment

Comments

@Loose-Gu
Copy link

Loose-Gu commented Aug 7, 2023

The scorer computes the scores by sequence_cross_entropy_with_logits(). I notice that the begining index of the para is different from the implementation in EPR.
in UDR:
loss_list = sequence_cross_entropy_with_logits(logits=output.logits[:, :-1].contiguous(), targets=entry.input_ids[:, 1:].contiguous(), weights=pad_mask, average=None)
in EPR:
loss_list = sequence_cross_entropy_with_logits(logits=output.logits, targets=entry.input_ids[:,1:], weights=pad_mask, average=None)
So I wander what actually the input is and find this in scorer_dsr.py
tokenized_example = self.tokenizer.encode_plus(enc_text, truncation=True, add_special_tokens=False, return_tensors='pt') tokenized_labels = self.tokenizer.encode_plus(test_answer, truncation=True, add_special_tokens=False, return_tensors='pt')
Since the special tokens aren't add into the inputs, Why do we need to exclude the first of the inputs and the end of the logits?

@LeeSureman
Copy link

In UDR's implementation, we score examples batch by batch, when EPR's code scores examples one by one. To implement this batch-parallel example scoring, we adjust the example's pad position for tokenization, which leads to the mentioned difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants