Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmatch with CLS token position of indicies array and inverse mask array #4

Open
nesemenpolkov opened this issue Mar 3, 2023 · 0 comments

Comments

@nesemenpolkov
Copy link

Dear Ivan,

All of the work that you did is great. But while using your code of IMDBDataset i found some strange things! Array of indicies and array of inverse token mask values do not match each other from the beggining of the sequence because of CLS token (see the reference below). And it was also strange for me, while i found out CLS token in the second sentence. So, i may be wrong, but original BERT uses only one CLS token in the begining. And the last one, calculating length of vocab each time could be very expensive. So, i hope, that my notes will help you to make your code better and more clear. I would be proud if you gave me a posibility to take part in this and help you to fix it.

I am looking forward to your reply,
Nesemenpolkov.

Reference:

def _create_item(self, first: typing.List[str], second: typing.List[str], target: int = 1):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant