BPE tokenizer Implementation #2

albertpurnama · 2024-05-05T05:07:37Z

Hi,

I'm interested in contributing to implementing the BPE tokenizer.

Since we're using gpt-2 encoding (as shown in the preprocessors), I think we can use the original implementation of tiktoken-go. But, this means that we cannot yet make our very own Tokenizer training (to put out our own BPE ranks).

This might be important if people want to experiment with changing the vocabulary size and "train" their own tokenizer. However, it might also be the case that it is out of the scope of this repo's (or llm.c) purpose.

Maybe to start, can you elaborate on why the existing trie encoding is incorrect? Is the merging of tokens not done properly according to gpt-2 original tokenizer spec? Is it a performance issue to trie structure in general?

The text was updated successfully, but these errors were encountered:

joshcarp · 2024-05-07T11:44:55Z

Hey, that would be awesome if you could contribute that

joshcarp · 2024-05-07T13:02:41Z

The reason why the current implementation is incorrect is because it doesn't respect merge order. In BPE tokenization should be done on the merge order which corresponds to the most->least frequent merge pairs in the training data.

For example, if we were to take BPE encoding of the string "00000000000000000" (0*17):

split: ('0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0')
for each pair, (i += 2), check merge rules -> merge
('00', '00', '00', '00', '00', '00', '00', '00', '0')
('00', '00', '00', '00', '00', '00', '00', '000')
('0000', '0000', '0000', '00', '000')
('00000000', '0000', '00', '000')
('00000000', '000000', '000')

so final tokenization:[8269, 10535, 830] corresponding to the strings ['00000000', '000000', '000']

The trie implementation only looks for longest prefix, so this would go and look down the path of 0000000000000000, find that there isn't a longest path, and then return 0000000000000000 as the first token and 0 as the second token:
tokens: [25645, 15] strings: ['0000000000000000', '0']

The differences in these two approaches are basically:

BPE merge rules ensure that you've got the most frequent tokens
Longest prefix ensures you've got the longest tokens
It works because it's still valid in the vocabulary, but it means you're using less frequent tokens as inputs which would be sub optimal

Hk669 · 2024-06-09T10:34:15Z

@joshcarp what do you think about this implementation, Bpetokenizer, could I contribute to this? Or is this issue already addressed? Thanks.

joshcarp · 2024-06-09T11:51:53Z

Originally I wanted it to be a no-dependency project to be used as an example of every moving part of the transformer without having any external dependencies, however i've been way too distracted with other things such that I haven't come back to this.

If you want to raise a PR that removes all of the current tokenizer stuff and puts BPEtokenizer in that would be great, especially just to get the repo to a good working state.

I'm still going to do some BPE implementation in this repo for plans for a blogpost, similar go the annotated transformer, but it doesn't need to be fully feature complete.

Hk669 · 2024-06-09T12:24:14Z

that's great @joshcarp , let me raise a PR with the BPETokenizer( thinking to use the same GPT4 pattern to split the tokens if anyone wants to train their tokenizer ), thoughts on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE tokenizer Implementation #2

BPE tokenizer Implementation #2

albertpurnama commented May 5, 2024

joshcarp commented May 7, 2024

joshcarp commented May 7, 2024

Hk669 commented Jun 9, 2024 •

edited

Loading

joshcarp commented Jun 9, 2024

Hk669 commented Jun 9, 2024 •

edited

Loading

BPE tokenizer Implementation #2

BPE tokenizer Implementation #2

Comments

albertpurnama commented May 5, 2024

joshcarp commented May 7, 2024

joshcarp commented May 7, 2024

Hk669 commented Jun 9, 2024 • edited Loading

joshcarp commented Jun 9, 2024

Hk669 commented Jun 9, 2024 • edited Loading

Hk669 commented Jun 9, 2024 •

edited

Loading

Hk669 commented Jun 9, 2024 •

edited

Loading