Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE tokenizer Implementation #2

Open
albertpurnama opened this issue May 5, 2024 · 5 comments
Open

BPE tokenizer Implementation #2

albertpurnama opened this issue May 5, 2024 · 5 comments

Comments

@albertpurnama
Copy link
Contributor

Hi,

I'm interested in contributing to implementing the BPE tokenizer.

Since we're using gpt-2 encoding (as shown in the preprocessors), I think we can use the original implementation of tiktoken-go. But, this means that we cannot yet make our very own Tokenizer training (to put out our own BPE ranks).

This might be important if people want to experiment with changing the vocabulary size and "train" their own tokenizer. However, it might also be the case that it is out of the scope of this repo's (or llm.c) purpose.

Maybe to start, can you elaborate on why the existing trie encoding is incorrect? Is the merging of tokens not done properly according to gpt-2 original tokenizer spec? Is it a performance issue to trie structure in general?

@joshcarp
Copy link
Owner

joshcarp commented May 7, 2024

Hey, that would be awesome if you could contribute that

@joshcarp
Copy link
Owner

joshcarp commented May 7, 2024

The reason why the current implementation is incorrect is because it doesn't respect merge order. In BPE tokenization should be done on the merge order which corresponds to the most->least frequent merge pairs in the training data.

For example, if we were to take BPE encoding of the string "00000000000000000" (0*17):

  • split: ('0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0')
  • for each pair, (i += 2), check merge rules -> merge
  • ('00', '00', '00', '00', '00', '00', '00', '00', '0')
  • ('00', '00', '00', '00', '00', '00', '00', '000')
  • ('0000', '0000', '0000', '00', '000')
  • ('00000000', '0000', '00', '000')
  • ('00000000', '000000', '000')

so final tokenization:[8269, 10535, 830] corresponding to the strings ['00000000', '000000', '000']

The trie implementation only looks for longest prefix, so this would go and look down the path of 0000000000000000, find that there isn't a longest path, and then return 0000000000000000 as the first token and 0 as the second token:
tokens: [25645, 15] strings: ['0000000000000000', '0']

The differences in these two approaches are basically:

  • BPE merge rules ensure that you've got the most frequent tokens
  • Longest prefix ensures you've got the longest tokens
    It works because it's still valid in the vocabulary, but it means you're using less frequent tokens as inputs which would be sub optimal

@Hk669
Copy link

Hk669 commented Jun 9, 2024

@joshcarp what do you think about this implementation, Bpetokenizer, could I contribute to this? Or is this issue already addressed? Thanks.

@joshcarp
Copy link
Owner

joshcarp commented Jun 9, 2024

Originally I wanted it to be a no-dependency project to be used as an example of every moving part of the transformer without having any external dependencies, however i've been way too distracted with other things such that I haven't come back to this.

If you want to raise a PR that removes all of the current tokenizer stuff and puts BPEtokenizer in that would be great, especially just to get the repo to a good working state.

I'm still going to do some BPE implementation in this repo for plans for a blogpost, similar go the annotated transformer, but it doesn't need to be fully feature complete.

@Hk669
Copy link

Hk669 commented Jun 9, 2024

that's great @joshcarp , let me raise a PR with the BPETokenizer( thinking to use the same GPT4 pattern to split the tokens if anyone wants to train their tokenizer ), thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants