Token counts #88

timsueberkrueb · 2023-11-22T10:45:36Z

Hey, thank you for making this data set available to the community.
I'm wondering how you estimated the token counts in the table in the README and the blogpost? In particular, do you have the corresponding numbers in bytes or Unicode codepoints?
Thanks a lot in advance.

mauriceweber · 2023-11-22T16:43:11Z

Hi @timsueberkrueb -- we used the mistral-7B tokenizer and tokenized a subset of 100M documents. We then used these token counts to extrapolate to the full dataset. You can check out the code used to count tokens here: https://github.com/togethercomputer/RedPajama-Data/blob/main/app/src/token_count.py.

In particular, do you have the corresponding numbers in bytes or Unicode codepoints?

What do you mean by this? Are you referring to a specific tokenizer?

timsueberkrueb · 2023-11-23T08:27:50Z

Thank you @mauriceweber!

What do you mean by this? Are you referring to a specific tokenizer?

I was wondering about the total amount of text data per language (excluding metadata etc), prior to tokenization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token counts #88

Token counts #88

timsueberkrueb commented Nov 22, 2023

mauriceweber commented Nov 22, 2023 •

edited

Loading

timsueberkrueb commented Nov 23, 2023 •

edited

Loading

Token counts #88

Token counts #88

Comments

timsueberkrueb commented Nov 22, 2023

mauriceweber commented Nov 22, 2023 • edited Loading

timsueberkrueb commented Nov 23, 2023 • edited Loading

mauriceweber commented Nov 22, 2023 •

edited

Loading

timsueberkrueb commented Nov 23, 2023 •

edited

Loading