Suggestion: Use smollm corpus #695

linux-leo · 2024-07-18T16:59:08Z

From my understanding we are always trying to use the best dataset, so that's why I'm suggesting the one from the new Huggingface SmolLM: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus

gordicaleksa · 2024-07-19T15:01:13Z

can you post some (eval) results against edu fineweb?

linux-leo · 2024-07-22T09:05:05Z

I haven't done experiments and never trained a model myself with this codebase, but will do if I ever get around to it.

Note that the large majority of SmolLM is fineweb-edu, only augmented with synthetic data from cosmopedia-v2 and coding data from python-edu, which in my opinion, given that both of these sources are small compared to the fineweb-edu data, should have almost no negative impact on any benchmarks compared to pure fineweb-edu models, but maybe achieve higher scores on more academic questions and reasoning tasks.

linux-leo · 2024-07-29T16:35:39Z

This not a one to one comparison, but it is from the official blog post announcing smolLM (notice the comparison to karparthy GPT)

https://huggingface.co/blog/smollm

Note: I don't know what checkpoint they are comparing to, but assuming the longest trained one, smollm was still trained on more than twice the Amount of tokens. Still, I don't think that by itself explains some improvements, especially when taking model saturation into account.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Use smollm corpus #695

Suggestion: Use smollm corpus #695

linux-leo commented Jul 18, 2024

gordicaleksa commented Jul 19, 2024

linux-leo commented Jul 22, 2024

linux-leo commented Jul 29, 2024 •

edited

Loading

Suggestion: Use smollm corpus #695

Suggestion: Use smollm corpus #695

Comments

linux-leo commented Jul 18, 2024

gordicaleksa commented Jul 19, 2024

linux-leo commented Jul 22, 2024

linux-leo commented Jul 29, 2024 • edited Loading

linux-leo commented Jul 29, 2024 •

edited

Loading