Documentation of the tokenizer #7748

Asgir · 2024-06-04T17:06:11Z

Asgir
Jun 4, 2024

Is there a documentation of the precise algorithm of the tokenizer in llama.cpp?

While there are plenty of precise documentations or simple reference implementations for how exactly the various LLM architectures work, I can't find someting similar for (the presumably much simpler) tokenizers. But at the same time the tokenizer often seems responsible if anything breaks during port, like it happened with llama 3.

So I'm wondering if there is a documentation of what exactly llama.cpp does with tokenizer.ggml.model, tokenizer.ggml.pre, tokenizer.ggml.tokens, tokenizer.ggml.token_type, tokenizer.ggml.merges (and if some, like merges, are not present), and if there any non-trivial hard coded processing steps not governed by a parameter in the gguf.

Or do I have to read the source code?

JosephHodes · 2024-09-03T22:58:22Z

JosephHodes
Sep 3, 2024

Hey did you make any progress on this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation of the tokenizer #7748

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Documentation of the tokenizer #7748

Asgir Jun 4, 2024

Replies: 1 comment

JosephHodes Sep 3, 2024

Asgir
Jun 4, 2024

JosephHodes
Sep 3, 2024