Is there explicitly instruction-following data in the version of Dolma used to train v1? #658

john-hewitt · 2024-07-15T22:53:49Z

Hi everyone,

I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no explicitly instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.

Some thoughts:

I don't have the disk space to build Dolma. Alas!
I realize that in Dolma v1.7, FLAN is explicitly included, so that's out.
I say "explicitly" instruction-following data because lots of naturally occuring web data have instruction-response-like formats (stackoverflow, etc) -- this is fine; I'm just worried about the increasingly common practice of mixing in explicit "instruction following SFT data" in the pretraining process.
I know there's an n-gram viewer at https://wimbd.apps.allenai.org/about, and it says the TULU-style <|assistant|> n-gram shows up around 20M times, but... it returns the identical answer for assistant without the formatting, so I imagine it's stripping the formatting, so this isn't useful.

I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?

See corresponding Dolma request; wasn't sure how much information sharing there would be between the two: allenai/dolma#177

Thanks!

The text was updated successfully, but these errors were encountered:

soldni · 2024-07-18T14:25:33Z

(responded on ticket in Dolma repository)

john-hewitt added the type/question An issue that's a question label Jul 15, 2024

john-hewitt mentioned this issue Jul 15, 2024

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? allenai/dolma#177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there explicitly instruction-following data in the version of Dolma used to train v1? #658

Is there explicitly instruction-following data in the version of Dolma used to train v1? #658

john-hewitt commented Jul 15, 2024 •

edited

Loading

soldni commented Jul 18, 2024

Is there explicitly instruction-following data in the version of Dolma used to train v1? #658

Is there explicitly instruction-following data in the version of Dolma used to train v1? #658

Comments

john-hewitt commented Jul 15, 2024 • edited Loading

soldni commented Jul 18, 2024

john-hewitt commented Jul 15, 2024 •

edited

Loading