Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? #177

john-hewitt · 2024-07-15T23:05:35Z

Hi everyone,

I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no explicitly instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.

Some thoughts:

I don't have the disk space to build Dolma. Alas!
I realize that in Dolma v1.7, FLAN is explicitly included, so that's out.
I say "explicitly" instruction-following data because lots of naturally occuring web data have instruction-response-like formats (stackoverflow, etc) -- this is fine; I'm just worried about the increasingly common practice of mixing in explicit "instruction following SFT data" in the pretraining process.
I know there's an n-gram viewer at https://wimbd.apps.allenai.org/about, and it says the TULU-style <|assistant|> n-gram shows up around 20M times, but... it returns the identical answer for assistant without the formatting, so I imagine it's stripping the formatting, so this isn't useful.

I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?

See corresponding Olmo request; wasn't sure how much information sharing there would be between the two: allenai/OLMo#658

Thanks!

The text was updated successfully, but these errors were encountered:

soldni · 2024-07-18T14:24:57Z

Hi John!

Olmo V1 was trained on not explicitly trained on any instruction data. If any leak in occurs, I suspect is through the code subset we trained OLMo on.

I am would surprised WIMDB is stripping that formatting, but I can check. In any case, the vocab for the base OLMo model does not contain an assistant tokens, so <|assistant|> would get split.

john-hewitt · 2024-07-19T23:41:18Z

Hi Luca!!

Thanks so much for the response. That certainly helps clarify things to me. I'd also be curious about the WIMDB details, but no pressure there.

yanaiela · 2024-07-20T01:08:16Z

Hey @john-hewitt,

With WIMBD, we have the online search index, which relies on elasticsearch. So it uses some stop words to index the data and filter the queries, which is why you got the same numbers for your query.

We also have a search tool written in rust (so it runs pretty fast) that doesn't have these kinds of filters.
I just ran this query on dolma1.5, and found no instances of the <|assistant|> ngram.

If you're interested in running it yourself (it doesn't require much memory, just storing the data locally), you can check it out here.
But if you're unable to store it, I could run a list of queries for you if you'd like.

john-hewitt mentioned this issue Jul 15, 2024

Is there explicitly instruction-following data in the version of Dolma used to train v1? allenai/OLMo#658

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? #177

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? #177

john-hewitt commented Jul 15, 2024

soldni commented Jul 18, 2024

john-hewitt commented Jul 19, 2024

yanaiela commented Jul 20, 2024

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? #177

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? #177

Comments

john-hewitt commented Jul 15, 2024

soldni commented Jul 18, 2024

john-hewitt commented Jul 19, 2024

yanaiela commented Jul 20, 2024