continue training of already saved model (extending TrainRunner) #1065

vitkl · 2021-05-16T19:14:00Z

Would be great to have the option to continue the training of the already saved model. @adamgayoso said this needs to go into the TrainRunner. One thing you need to do is to combine old and new training history like this:

if continue_training and self.is_trained_:
    # add ELBO listory
    index = range(
      len(self.module.history_),
      len(self.module.history_)
      + len(trainer.logger.history["train_loss_epoch"]),
    )
    trainer.logger.history["train_loss_epoch"].index = index
    self.module.history_ = pd.concat(
      [self.module.history_, trainer.logger.history["train_loss_epoch"]]
    )
else:
    self.module.history_ = trainer.logger.history["train_loss_epoch"]
    self.history_ = self.module.history_

njbernstein · 2021-06-01T18:17:28Z

plus 1 for this

adamgayoso · 2021-06-02T18:14:44Z

So there is the simple way where we can just maintain the old history so it's not overwritten; however, if there is a desire to also have the whole optimizer state/learning rate scheduler state we have to do more engineering. Thoughts? In the first case (simple way) it would be a fresh optimizer and schedulers.

njbernstein · 2021-06-02T18:29:36Z

I only want the old history not overwritten for what its worth

adamgayoso · 2021-06-02T18:30:21Z

In the more complicated case, we'd have to

use save_hyperparameters in the training plans and avoid saving the modules (see here)
save function would have to create a pytorch lightning checkpoint from the trainer attribute of the model (self)

Then to the train methods we can maybe add a parameter like continue_from_checkpoint: Path that you give the path of the save directory and the train method will then load the training plan from the checkpoint.

adamgayoso · 2021-06-04T15:52:41Z

So in either simple or complex case, we can do the following:

Change this line
https://github.com/YosefLab/scvi-tools/blob/a0a608912aff56e94bb89b9e8c4f122a6c776500/scvi/train/_trainrunner.py#L75

to self.model.history = check_and_extend_history(self.trainer.logger.history) where if the history is not None, it extends it

vitkl · 2021-06-05T21:03:21Z

I think it's ok to create a new optimiser when continuing training (this is what pymc3 does by the way) - just load state param dict and continue history. @adamgayoso is this what you mean by a simple case?

vitkl · 2021-06-05T21:08:11Z

My use case for this is 'train->save->potentially start new cluster job->load->continue training'. One problem with this which I see now is that when the saved model is loaded, one training step is run, and the training history is lost - this will solve that issue, right?

adamgayoso · 2021-06-07T14:27:36Z

I think it's ok to create a new optimiser when continuing training (this is what pymc3 does by the way)

For Pyro based models we might not have a choice. In general though I think it would be nice to maintain the gradient information for optimizers like Adam (which is part of the "complex" solution, though easy with pytorch lightning)

One problem with this which I see now is that when the saved model is loaded, one training step is run, and the training history is lost - this will solve that issue, right?

Yes, but again, Pyro models need some special care.

vitkl · 2021-06-10T20:59:37Z

I see. In my opinion, a simple solution should just preserve history, including when the models are loaded.

Is it necessary to train a loaded model for 1 iteration? If this is done just to initialise the guide properly - then maybe this can be done in evaluation mode? For example, just using svi.evaluate_loss rather than svi.step for both training and validation data? #1073

vitkl · 2021-06-21T12:19:50Z

So this commit, f9652f2, solves the issue for loading models but keeping history when continuing training remains to be addressed, right?

adamgayoso · 2021-06-21T16:30:50Z

Yes this issue remains to be addressed (we are getting there). That commit you referenced fixes the loading issue for pyro models.

vitkl added the enhancement label May 16, 2021

adamgayoso assigned galenxing May 17, 2021

adamgayoso mentioned this issue Jun 21, 2021

extend history when train method run multiple times #1091

Merged

adamgayoso closed this as completed in #1091 Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continue training of already saved model (extending TrainRunner) #1065

continue training of already saved model (extending TrainRunner) #1065

vitkl commented May 16, 2021

njbernstein commented Jun 1, 2021

adamgayoso commented Jun 2, 2021

njbernstein commented Jun 2, 2021

adamgayoso commented Jun 2, 2021

adamgayoso commented Jun 4, 2021

vitkl commented Jun 5, 2021 •

edited

Loading

vitkl commented Jun 5, 2021

adamgayoso commented Jun 7, 2021

vitkl commented Jun 10, 2021

vitkl commented Jun 21, 2021 •

edited

Loading

adamgayoso commented Jun 21, 2021

continue training of already saved model (extending TrainRunner) #1065

continue training of already saved model (extending TrainRunner) #1065

Comments

vitkl commented May 16, 2021

njbernstein commented Jun 1, 2021

adamgayoso commented Jun 2, 2021

njbernstein commented Jun 2, 2021

adamgayoso commented Jun 2, 2021

adamgayoso commented Jun 4, 2021

vitkl commented Jun 5, 2021 • edited Loading

vitkl commented Jun 5, 2021

adamgayoso commented Jun 7, 2021

vitkl commented Jun 10, 2021

vitkl commented Jun 21, 2021 • edited Loading

adamgayoso commented Jun 21, 2021

vitkl commented Jun 5, 2021 •

edited

Loading

vitkl commented Jun 21, 2021 •

edited

Loading