Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Use smollm corpus #695

Open
linux-leo opened this issue Jul 18, 2024 · 3 comments
Open

Suggestion: Use smollm corpus #695

linux-leo opened this issue Jul 18, 2024 · 3 comments

Comments

@linux-leo
Copy link

From my understanding we are always trying to use the best dataset, so that's why I'm suggesting the one from the new Huggingface SmolLM: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus

@gordicaleksa
Copy link
Contributor

can you post some (eval) results against edu fineweb?

@linux-leo
Copy link
Author

I haven't done experiments and never trained a model myself with this codebase, but will do if I ever get around to it.

Note that the large majority of SmolLM is fineweb-edu, only augmented with synthetic data from cosmopedia-v2 and coding data from python-edu, which in my opinion, given that both of these sources are small compared to the fineweb-edu data, should have almost no negative impact on any benchmarks compared to pure fineweb-edu models, but maybe achieve higher scores on more academic questions and reasoning tasks.

@linux-leo
Copy link
Author

linux-leo commented Jul 29, 2024

This not a one to one comparison, but it is from the official blog post announcing smolLM (notice the comparison to karparthy GPT)

image

https://huggingface.co/blog/smollm

Note: I don't know what checkpoint they are comparing to, but assuming the longest trained one, smollm was still trained on more than twice the Amount of tokens. Still, I don't think that by itself explains some improvements, especially when taking model saturation into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants