Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug?] Incompatible with Hugging Face Tokenizers #18

Open
EricLBuehler opened this issue Sep 2, 2024 · 4 comments
Open

[Bug?] Incompatible with Hugging Face Tokenizers #18

EricLBuehler opened this issue Sep 2, 2024 · 4 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@EricLBuehler
Copy link

Hi @Dan-wanna-M!

I wanted to integrate your great work here into mistral.rs and Candle. However, when testing with the microsoft/Phi-3.5-mini-instruct model's tokenizer using the below code, I get an error.

pub fn new(grammar: &str, tokenizer: &Tokenizer) -> Result<Self> {
    let tokenizer_vocab = tokenizer.get_vocab(true);
    let mut id_to_tok = AHashMap::new();
    let mut id_to_tok_str = AHashMap::new();
    for (tok_str, id) in tokenizer_vocab {
        id_to_tok.insert(id, Token(tok_str.as_bytes().to_vec().into_boxed_slice()));
        id_to_tok_str.insert(id, tok_str);
    }
    let vocab = Vocabulary::new(id_to_tok, id_to_tok_str)?;
    Ok(Self {
        engine: Engine::new(grammar, vocab)?,
        vocab_size: tokenizer.get_vocab_size(true),
    })
}

Output:

WARN kbnf::vocabulary: The following bytes are not present in any token: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 127, 192, 193, 221, 222, 238, 241, 242, 243, 244, 245, 246, 247]. This likely indicates that the vocabulary loading code is wrong, the tokenizer is doing some creepy processing or the tokenizer is not UTF-8 compatible. Check the vocabulary loading code and the tokenizer code to fix any bug and/or consider processing the vocab like the tokenizer.    
thread '<unnamed>' panicked at /home/ericbuehler/.cargo/registry/src/index.crates.io-6f17d22bba15001f/kbnf-syntax-0.4.2/src/grammar.rs:36:14:
called `Option::unwrap()` on a `None` value
  • What does being UTF-8 compatible mean?
  • Should we be panicking here? Perhaps some invariant was invalidated, but this seemed concerning, especially since Vocabulary::new already returns a Result, so maybe we can just return an error for this.
@Dan-wanna-M
Copy link
Owner

Dan-wanna-M commented Sep 2, 2024

What does being UTF-8 compatible mean?

The warning indicates some UTF-8 specific bytes are not present, which can be resulted from:

  • the vocabulary loading code is wrong
  • the tokenizer is doing some interesting preprocessing on the vocabulary
  • the tokenizer itself just does not support UTF-8

In this case I think it is because the tokenizer is doing some interesting preprocessing on the vocabulary. Huggingface's bbpe tokenizer encodes control bytes&non-ascii bytes in an interesting way and we need to decode it before passing it into KBNF Vocabulary. You can check this file to check the heuristic handling I used in formatron. I should make it clearer in the docs.

Should we be panicking here?

No, this indeed looks like a separate bug since vocabulary loading should not intervene with grammar creation. Specifically, suspicious vocabulary loading should not lead to any panics(hence a warning rather than a hard error). Could you share the KBNF grammar string you use?

@Dan-wanna-M
Copy link
Owner

I think I managed to reproduce the bug; I guess the start nonterminal(which default to start) is not present in your kbnf grammar and I somehow forgot to handle it when validating the grammar. It is fixed in 0.5.2 now.

@Dan-wanna-M Dan-wanna-M added bug Something isn't working documentation Improvements or additions to documentation labels Sep 2, 2024
@Dan-wanna-M
Copy link
Owner

@EricLBuehler Could you elaborate a bit about how you would like to integrate kbnf into candle.rs and candle-vllm? I have more time now and I would like to create a PR for the integration.

@EricLBuehler
Copy link
Author

Hi @Dan-wanna-M! That sounds great. I have a PR here: EricLBuehler/mistral.rs#815

Perhaps you could take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants