Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token counts #88

Open
timsueberkrueb opened this issue Nov 22, 2023 · 2 comments
Open

Token counts #88

timsueberkrueb opened this issue Nov 22, 2023 · 2 comments

Comments

@timsueberkrueb
Copy link

Hey, thank you for making this data set available to the community.
I'm wondering how you estimated the token counts in the table in the README and the blogpost? In particular, do you have the corresponding numbers in bytes or Unicode codepoints?
Thanks a lot in advance.

@mauriceweber
Copy link
Collaborator

mauriceweber commented Nov 22, 2023

Hi @timsueberkrueb -- we used the mistral-7B tokenizer and tokenized a subset of 100M documents. We then used these token counts to extrapolate to the full dataset. You can check out the code used to count tokens here: https://github.com/togethercomputer/RedPajama-Data/blob/main/app/src/token_count.py.

In particular, do you have the corresponding numbers in bytes or Unicode codepoints?

What do you mean by this? Are you referring to a specific tokenizer?

@timsueberkrueb
Copy link
Author

timsueberkrueb commented Nov 23, 2023

Thank you @mauriceweber!

What do you mean by this? Are you referring to a specific tokenizer?

I was wondering about the total amount of text data per language (excluding metadata etc), prior to tokenization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants