Unmatch with CLS token position of indicies array and inverse mask array #4

nesemenpolkov · 2023-03-03T06:59:38Z

Dear Ivan,

All of the work that you did is great. But while using your code of IMDBDataset i found some strange things! Array of indicies and array of inverse token mask values do not match each other from the beggining of the sequence because of CLS token (see the reference below). And it was also strange for me, while i found out CLS token in the second sentence. So, i may be wrong, but original BERT uses only one CLS token in the begining. And the last one, calculating length of vocab each time could be very expensive. So, i hope, that my notes will help you to make your code better and more clear. I would be proud if you gave me a posibility to take part in this and help you to fix it.

I am looking forward to your reply,
Nesemenpolkov.

Reference:

pytorch_bert/bert/dataset.py

Line 145 in 2cd4724

    
           def _create_item(self, first: typing.List[str], second: typing.List[str], target: int = 1):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unmatch with CLS token position of indicies array and inverse mask array #4

Unmatch with CLS token position of indicies array and inverse mask array #4

nesemenpolkov commented Mar 3, 2023

Unmatch with CLS token position of indicies array and inverse mask array #4

Unmatch with CLS token position of indicies array and inverse mask array #4

Comments

nesemenpolkov commented Mar 3, 2023