Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redo: New tokenizer implementation for MPT and GPT-J #765

Closed
wants to merge 1 commit into from

Conversation

apage43
Copy link
Member

@apage43 apage43 commented May 30, 2023

Re-do of 661 - leaving as Draft until building/linking issues are solved

Improves output quality by making these tokenizers more closely match the behavior of the huggingface tokenizers based BPE tokenizers these models were trained with.

Featuring:

  • Fixed unicode handling (via ICU)
  • Fixed BPE token merge handling
  • Complete added vocabulary handling

Improves output quality by making these tokenizers more closely
match the behavior of the huggingface `tokenizers` based BPE
tokenizers these models were trained with.

Featuring:
 * Fixed unicode handling (via ICU)
 * Fixed BPE token merge handling
 * Complete added vocabulary handling
@apage43 apage43 changed the title New tokenizer implementation for MPT and GPT-J redo: New tokenizer implementation for MPT and GPT-J May 30, 2023
@manyoso
Copy link
Collaborator

manyoso commented Jun 2, 2023

I think we should submodule ICU and statically link it to llmodel if we do this... otherwise every binding author is going to have to worry about ICU dependency bundling/handling.

@apage43
Copy link
Member Author

apage43 commented Jun 5, 2023

this got a bit hairy after the multiple implementation split without causing multiple embedded copies of the tokenizer configs, but should be a bit more doable as of the prompt() deduplication so none of the model specific code should have to call the tokenizers

@manyoso
Copy link
Collaborator

manyoso commented Jul 6, 2023

This can be closed since the tokenizer changes upstream?

@apage43
Copy link
Member Author

apage43 commented Jul 6, 2023

This can be closed since the tokenizer changes upstream?

no there's still no upstream fix for this - it requires file format changes so its not likely happening upstream until ggerganov/ggml#220 happens

it's stalled here because it introduces an ICU dependency for unicode-aware regex, utf8 handling, and unicode character class data ("what code points are 'letters'?")

ggllm (llama.cpp falcon fork) has a non-ICU variant but doing that required just copying the whole unicode data tables into C++ code, writing a utf8 codec, and hand-rewriting the GPT2 stock regex that most models use for word-splitting before applying BPE (where the unicode character tables are needed) into a C++ function - but as an alternative to linking against ICU we could copy that implementation

it's unfortunately not a simple matter of submoduling ICU and adding it to the build as it has a somwhat complex autotools-based build process, not a CMake one

or there's always the option to link against the original huggingface tokenizers (requiring the Rust toolchain is not great but I suspect its less of a headache than building ICU everywhere)

regardless we should still do one of these as differences in how we encode input and how the same input would've been encoded in training will make the models seem worse than they actually are

@niansa
Copy link
Contributor

niansa commented Aug 8, 2023

Hmm, how about we just expect the user to have libicu74 installed in their system and give an error if they haven't?

@apage43 apage43 closed this Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants