Correctly support resuming from checkpoint with a dataset without length #33544

muupan · 2024-09-17T15:56:00Z

What does this PR do?

There is an inconsistency in Trainer's behavior between training from scratch and resuming from checkpoint when the given dataset has no length like datasets.IterableDataset. For a reproducible example, see #26413 (comment) . This PR fixes the inconsistency by correctly supporting resuming from checkpoint with such a dataset.

Fixes #26413

Current behavior

When training starts with a dataset without length, Trainer assumes one epoch is equal to max_steps steps and tries to train for that many steps. There are two possible scenarios.

A. If the dataset yields enough samples, the training finishes precisely after one epoch.
B. If the dataset raises StopIteration before yielding samples enough for max_steps steps, Trainer increments the current epoch and re-iterate the dataset.

When resuming from a checkpoint, Trainer simply skips the first batches until global_step of the checkpoint. In scenario A, there is no problem. In scenario B, the dataset raises StopIteration during the skipping, but Trainer does not re-iterate the dataset. Instead, it just finishes training with a warning. This is inconsistent from what happens in training from scratch, and it contradicts with what the documents about max_steps says:

transformers/src/transformers/training_args.py

Lines 301 to 304 in ac5a055

    
                   max_steps (`int`, *optional*, defaults to -1): 
        
                       If set to a positive number, the total number of training steps to perform. Overrides `num_train_epochs`. 
        
                       For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until 
        
                       `max_steps` is reached.

Solution

This PR modifies the skipping behavior so that Trainer now re-iterates the dataset until it catches up global_step. A caveat is that it does not support the ignore_data_skip option, as Trainer does not know what epoch to start from. I am also concerned that the logic is becoming too complicated.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

LysandreJik · 2024-09-18T13:59:18Z

Very impressive PR @muupan!

I'm pinging @muellerzr and @SunMarc to take a look; Zach is off for a few weeks and will take a look as soon as he's back, thank you for your patience 🙏

SunMarc · 2024-09-27T16:10:43Z

Thanks for the PR @muupan ! We will review it shortly. There is a new feature in accelerate that enable you to use a stateful dataloader, so that we don't need to iterate to resume a training. Feel free to give it a try, note that it is a very experimental support for now.

muupan · 2024-10-31T17:01:13Z

It seems like the code got broken after rebasing with main, where #34198 renamed the variable epoch_iterator. I will fix.

SunMarc · 2024-11-05T14:42:28Z

Let us knew when it is done !

muupan mentioned this pull request Sep 17, 2024

resume_from_checkpoint function fails because "There seems to be not a single sample in your epoch_iterator" #26413

Closed

4 tasks

LysandreJik requested review from SunMarc and muellerzr and removed request for SunMarc September 18, 2024 13:59

Correctly support resuming with dataset without length

bc56f1c

muupan force-pushed the feature/resume-training-with-iterable-dataset branch from 6f83505 to bc56f1c Compare October 31, 2024 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly support resuming from checkpoint with a dataset without length #33544

Correctly support resuming from checkpoint with a dataset without length #33544

muupan commented Sep 17, 2024

LysandreJik commented Sep 18, 2024

SunMarc commented Sep 27, 2024

muupan commented Oct 31, 2024

SunMarc commented Nov 5, 2024

	max_steps (`int`, optional, defaults to -1):
	If set to a positive number, the total number of training steps to perform. Overrides `num_train_epochs`.
	For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until
	`max_steps` is reached.

Correctly support resuming from checkpoint with a dataset without length #33544

Are you sure you want to change the base?

Correctly support resuming from checkpoint with a dataset without length #33544

Conversation

muupan commented Sep 17, 2024

What does this PR do?

Current behavior

Solution

Before submitting

Who can review?

LysandreJik commented Sep 18, 2024

SunMarc commented Sep 27, 2024

muupan commented Oct 31, 2024

SunMarc commented Nov 5, 2024