You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no explicitly instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.
Some thoughts:
I don't have the disk space to build Dolma. Alas!
I realize that in Dolma v1.7, FLAN is explicitly included, so that's out.
I say "explicitly" instruction-following data because lots of naturally occuring web data have instruction-response-like formats (stackoverflow, etc) -- this is fine; I'm just worried about the increasingly common practice of mixing in explicit "instruction following SFT data" in the pretraining process.
I know there's an n-gram viewer at https://wimbd.apps.allenai.org/about, and it says the TULU-style <|assistant|> n-gram shows up around 20M times, but... it returns the identical answer for assistant without the formatting, so I imagine it's stripping the formatting, so this isn't useful.
I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?
See corresponding Dolma request; wasn't sure how much information sharing there would be between the two: allenai/dolma#177
Thanks!
The text was updated successfully, but these errors were encountered:
Hi everyone,
I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no explicitly instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.
Some thoughts:
<|assistant|>
n-gram shows up around 20M times, but... it returns the identical answer forassistant
without the formatting, so I imagine it's stripping the formatting, so this isn't useful.I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?
See corresponding Dolma request; wasn't sure how much information sharing there would be between the two: allenai/dolma#177
Thanks!
The text was updated successfully, but these errors were encountered: