Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redo: New tokenizer implementation for MPT and GPT-J #765

Closed
wants to merge 1 commit into from

Commits on May 30, 2023

  1. New tokenizer implementation for MPT and GPT-J

    Improves output quality by making these tokenizers more closely
    match the behavior of the huggingface `tokenizers` based BPE
    tokenizers these models were trained with.
    
    Featuring:
     * Fixed unicode handling (via ICU)
     * Fixed BPE token merge handling
     * Complete added vocabulary handling
    apage43 committed May 30, 2023
    Configuration menu
    Copy the full SHA
    18ad68a View commit details
    Browse the repository at this point in the history