Imprecise description about removing token "pu" in section Unigram tokenization #699

yaojingguo · 2024-04-21T15:16:30Z

https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt says:

In this (very) particular case, we had two equivalent tokenizations of all the words: as we saw earlier, for example, "pug" could be tokenized ["p", "ug"] with the same score. Thus, removing the "pu" token from the vocabulary will give the exact same loss.

But as the following list from the link shows that "pun" needs "pu" and "n". If "pu" token is removed, the score for "pun" will change. So only if "pun" has the same score after "pu" is removed, the loss does not change.

"hug": ["hug"] (score 0.071428)
"pug": ["pu", "g"] (score 0.007710)
"pun": ["pu", "n"] (score 0.006168)
"bun": ["bu", "n"] (score 0.001451)
"hugs": ["hug", "s"] (score 0.001701)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imprecise description about removing token "pu" in section Unigram tokenization #699

Imprecise description about removing token "pu" in section Unigram tokenization #699

yaojingguo commented Apr 21, 2024

Imprecise description about removing token "pu" in section Unigram tokenization #699

Imprecise description about removing token "pu" in section Unigram tokenization #699

Comments

yaojingguo commented Apr 21, 2024