[Pythia on Pile-Dedup] Training for ~1.5 epochs: how to identify the repeated sequences (i.e., the additional .5 epoch)? #144

pietrolesci · 2024-01-09T20:31:39Z

Hi there,

The deduplicated dataset has fewer sequences and to keep a consistent token count with the non-deduplicated version the models are trained for ~1.5 epochs (as discussed in the README). Between epochs, are the data reshuffled or simply the dataloader starts from the beginning again in the same order? If the latter is the case, is there a way to know exactly which checkpoint is the first to see the same data twice? Put differently, is there a way to know which sequences are seen by the model in the additional ~half epoch?

Thanks a lot in advance for your help!

cc @haileyschoelkopf

jeffreygwang · 2024-01-11T02:39:25Z

Hey! I had similar questions a while back for a paper in which we used the Pythia suite—to the best of my understanding, the answers are that it's 1.5 epochs, where about the first half of the data (same order) is seen twice. The Pythia paper describes how many total tokens the models see and how many it sees in the first pass; based on those numbers, I use the step98000 checkpoint as my full "single pass" checkpoint. I believe the checkpoints after start "seeing double."

pietrolesci · 2024-01-12T15:57:55Z

Thanks a lot for your answer @jeffreygwang, this seems reasonable to me too!

pietrolesci · 2024-01-18T17:43:08Z

Between epochs, are the data reshuffled or simply the dataloader starts from the beginning again in the same order?

The answer seems to be that the dataloader does NOT simply start from the beginning again. It means that the concatenation happened at the document level, that is before the "packing" process. This means the initial tokens can appear in different positions within a sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pythia on Pile-Dedup] Training for ~1.5 epochs: how to identify the repeated sequences (i.e., the additional .5 epoch)? #144

[Pythia on Pile-Dedup] Training for ~1.5 epochs: how to identify the repeated sequences (i.e., the additional .5 epoch)? #144

pietrolesci commented Jan 9, 2024

jeffreygwang commented Jan 11, 2024

pietrolesci commented Jan 12, 2024

pietrolesci commented Jan 18, 2024

[Pythia on Pile-Dedup] Training for ~1.5 epochs: how to identify the repeated sequences (i.e., the additional .5 epoch)? #144

[Pythia on Pile-Dedup] Training for ~1.5 epochs: how to identify the repeated sequences (i.e., the additional .5 epoch)? #144

Comments

pietrolesci commented Jan 9, 2024

jeffreygwang commented Jan 11, 2024

pietrolesci commented Jan 12, 2024

pietrolesci commented Jan 18, 2024