A C# implementation for OpenAI language models encoding
-
It seems that the BPE file contains a
Map
ofbase64
strings tointeger
pairs. -
The regex doesn't seem to translate well. There is something that breaks, when trying to encode something directly in rust (see in the fork this specific test)
-
Currently the regex is not separating the words nor is it encoding them base64 individually, and that seems to be the problem
- Focus on the encoders' regex for
tiktoken
. - Write the regex correctly in a C# compliant way.
- Test with an external script. Meaning, table test different inputs and assert over the output of
tiktoken
vsHaToken
-
Python debugging is easy from VSCode. Just remember to edit the
launch.json
file so that"justMyCode": false
then you will be able to dive into your localpip
installation oftiktoken
-
In order to do that either press F11 a lot, or open the files in VSCode for
tiktoken
and set your breakpoints there. Their locations should be something like~/.local/lib/python3.8/site-packages/tiktoken/core.py
-
You will have to install
tiktoken
withpip
like so:
pip3 install tiktoken
- Example
playground.py
file:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
result = enc.encode("Hello world")
print(result)