redo: New tokenizer implementation for MPT and GPT-J #765

apage43 · 2023-05-30T17:03:44Z

Re-do of 661 - leaving as Draft until building/linking issues are solved

Improves output quality by making these tokenizers more closely match the behavior of the huggingface tokenizers based BPE tokenizers these models were trained with.

Featuring:

Fixed unicode handling (via ICU)
Fixed BPE token merge handling
Complete added vocabulary handling

Improves output quality by making these tokenizers more closely match the behavior of the huggingface `tokenizers` based BPE tokenizers these models were trained with. Featuring: * Fixed unicode handling (via ICU) * Fixed BPE token merge handling * Complete added vocabulary handling

manyoso · 2023-06-02T11:57:50Z

I think we should submodule ICU and statically link it to llmodel if we do this... otherwise every binding author is going to have to worry about ICU dependency bundling/handling.

apage43 · 2023-06-05T19:34:18Z

this got a bit hairy after the multiple implementation split without causing multiple embedded copies of the tokenizer configs, but should be a bit more doable as of the prompt() deduplication so none of the model specific code should have to call the tokenizers

manyoso · 2023-07-06T17:59:43Z

This can be closed since the tokenizer changes upstream?

apage43 · 2023-07-06T23:22:54Z

This can be closed since the tokenizer changes upstream?

no there's still no upstream fix for this - it requires file format changes so its not likely happening upstream until ggerganov/ggml#220 happens

it's stalled here because it introduces an ICU dependency for unicode-aware regex, utf8 handling, and unicode character class data ("what code points are 'letters'?")

ggllm (llama.cpp falcon fork) has a non-ICU variant but doing that required just copying the whole unicode data tables into C++ code, writing a utf8 codec, and hand-rewriting the GPT2 stock regex that most models use for word-splitting before applying BPE (where the unicode character tables are needed) into a C++ function - but as an alternative to linking against ICU we could copy that implementation

it's unfortunately not a simple matter of submoduling ICU and adding it to the build as it has a somwhat complex autotools-based build process, not a CMake one

or there's always the option to link against the original huggingface tokenizers (requiring the Rust toolchain is not great but I suspect its less of a headache than building ICU everywhere)

regardless we should still do one of these as differences in how we encode input and how the same input would've been encoded in training will make the models seem worse than they actually are

niansa · 2023-08-08T12:43:27Z

Hmm, how about we just expect the user to have libicu74 installed in their system and give an error if they haven't?

apage43 changed the title ~~New tokenizer implementation for MPT and GPT-J~~ redo: New tokenizer implementation for MPT and GPT-J May 30, 2023

apage43 closed this Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redo: New tokenizer implementation for MPT and GPT-J #765

redo: New tokenizer implementation for MPT and GPT-J #765

apage43 commented May 30, 2023

manyoso commented Jun 2, 2023 •

edited

Loading

apage43 commented Jun 5, 2023

manyoso commented Jul 6, 2023

apage43 commented Jul 6, 2023 •

edited

Loading

niansa commented Aug 8, 2023 •

edited

Loading

redo: New tokenizer implementation for MPT and GPT-J #765

redo: New tokenizer implementation for MPT and GPT-J #765

Conversation

apage43 commented May 30, 2023

manyoso commented Jun 2, 2023 • edited Loading

apage43 commented Jun 5, 2023

manyoso commented Jul 6, 2023

apage43 commented Jul 6, 2023 • edited Loading

niansa commented Aug 8, 2023 • edited Loading

manyoso commented Jun 2, 2023 •

edited

Loading

apage43 commented Jul 6, 2023 •

edited

Loading

niansa commented Aug 8, 2023 •

edited

Loading