-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug?] Incompatible with Hugging Face Tokenizers
#18
Comments
The warning indicates some UTF-8 specific bytes are not present, which can be resulted from:
In this case I think it is because the tokenizer is doing some interesting preprocessing on the vocabulary. Huggingface's bbpe tokenizer encodes control bytes&non-ascii bytes in an interesting way and we need to decode it before passing it into KBNF Vocabulary. You can check this file to check the heuristic handling I used in
No, this indeed looks like a separate bug since vocabulary loading should not intervene with grammar creation. Specifically, suspicious vocabulary loading should not lead to any panics(hence a warning rather than a hard error). Could you share the KBNF grammar string you use? |
I think I managed to reproduce the bug; I guess the start nonterminal(which default to |
@EricLBuehler Could you elaborate a bit about how you would like to integrate kbnf into candle.rs and candle-vllm? I have more time now and I would like to create a PR for the integration. |
Hi @Dan-wanna-M! That sounds great. I have a PR here: EricLBuehler/mistral.rs#815 Perhaps you could take a look? |
Hi @Dan-wanna-M!
I wanted to integrate your great work here into mistral.rs and Candle. However, when testing with the
microsoft/Phi-3.5-mini-instruct
model's tokenizer using the below code, I get an error.Output:
Vocabulary::new
already returns a Result, so maybe we can just return an error for this.The text was updated successfully, but these errors were encountered: