Skip to content

Commit

Permalink
Update generative/README.md for tokenizer_to_sentencepiece.py
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 684970206
  • Loading branch information
ai-edge-bot authored and copybara-github committed Oct 11, 2024
1 parent 2278e6c commit ddb7bf7
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions ai_edge_torch/generative/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,27 @@ To deploy using the MP LLM Inference API, you need to
* Bundle the converted TFLite files along with some other configurations such as start/stop tokens, tokenizer model etc. See [here](http://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference#ai_edge_model_conversion)
* Once the bundle is created, you can easily invoke the pipeline using the mobile APIs [here](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android#create_the_task).

#### Tokenizer

The bundle files used by MediaPipe LLM Interface API require SentencePiece model
protobuf files as the tokenizer model. Many PyTorch models don't provide
SentencePiece model protobuf files when they uses BPE tokenization. In that
case, SentencePiece model protobuf files can be built from tokenizer config json
files. `generative/tools/tokenizer_to_sentencepiece.py` might be enough to do it
though the generated SentencePiece model would not output the same token IDs for
all input strings. For example, the SentencePiece model of Llama3.2 built by
`generative/tools/tokenizer_to_sentencepiece.py` outputs token IDs mismatched
with ones by the original BPE tokenizer around by 1%.

```
python tokenizer_to_sentencepiece.py \
--checkpoint=meta-llama/Llama-3.2-3B-Instruct \
--output_path=llama3.spm.model
...
I1011 tokenizer_to_sentencepiece.py:203] Not matched strictly 35/1000 pairs: 3.50%, loosely 9/1000 pairs: 0.90%
I1011 tokenizer_to_sentencepiece.py:274] Writing the SentencePieceModel protobuf file to: llama3.spm.model
```

<br/>

## Model visualization
Expand Down

0 comments on commit ddb7bf7

Please sign in to comment.