Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

continue training from checkpoint_best.pth instead of checkpoint_latest.pth #2563

Open
romainVala opened this issue Oct 21, 2024 · 5 comments
Assignees

Comments

@romainVala
Copy link

romainVala commented Oct 21, 2024

Hi

I do not understand why you choose to continue training from the checkpoint_latest.pth instead of checkpoint_best.pth.
Checkpoint_latest.pth is saved every 50 epochs, so when we restart; we may loos up to 49 epochs.
In my case, I have a very long training epoch time (up to 2000 s sometimes) do loosing 50 epoch is then equivalent to loose 27 hours of computing ...

I find a way around, by just removing the checkpoint_latest.pth from the training log dir.
see

if not isfile(expected_checkpoint_file):

It would be more efficient to compare the date (or directly the epoch) of both checkpoint_best.pth checkpoint_latest.pth and choose the more recent

But may be I miss something ?

@fitzjalen
Copy link

fitzjalen commented Oct 21, 2024

Yeah - good idea

@romainVala
Copy link
Author

Is there a way to change the parameter that define the save period to 50 epochs ?
In my case of very long training I would like to make it every 10 epochs

@gaojh135
Copy link

Is there a way to change the parameter that define the save period to 50 epochs ?有没有办法将定义保存周期的参数更改为 50 epochs ? In my case of very long training I would like to make it every 10 epochs在我训练时间很长的情况下,我想每 10 个 epoch 进行一次

@gaojh135
Copy link

Hi 你好

I do not understand why you choose to continue training from the checkpoint_latest.pth instead of checkpoint_best.pth.我不明白为什么您选择从 checkpoint_latest.pth 而不是 checkpoint_best.pth 继续训练。 Checkpoint_latest.pth is saved every 50 epochs, so when we restart; we may loos up to 49 epochs.Checkpoint_latest.pth 每 50 个 epoch 保存一次,因此当我们重新启动时;我们最多可以损失 49 个 epoch。 In my case, I have a very long training epoch time (up to 2000 s sometimes) do loosing 50 epoch is then equivalent to loose 27 hours of computing ...就我而言,我有一个很长的训练 epoch 时间(有时长达 2000 秒),那么失去 50 个 epoch 相当于松散的 27 小时计算......

I find a way around, by just removing the checkpoint_latest.pth from the training log dir.我找到了一种方法,只需从训练日志目录中删除 checkpoint_latest.pth。 see  看

if not isfile(expected_checkpoint_file):

It would be more efficient to compare the date (or directly the epoch) of both checkpoint_best.pth checkpoint_latest.pth and choose the more recent比较 checkpoint_best.pth checkpoint_latest.pth 的日期(或直接比较纪元)并选择较新的日期会更有效

But may be I miss something ?但可能是我错过了什么?

nnUNetv2_train
--val_best [OPTIONAL] If set, the validation will be performed with the checkpoint_best instead of checkpoint_final. NOT COMPATIBLE with --disable_checkpointing! WARNING: This will use the same 'validation' folder as the
regular validation with no way of distinguishing the two!

@fitzjalen
Copy link

Hi 你好

I do not understand why you choose to continue training from the checkpoint_latest.pth instead of checkpoint_best.pth.我不明白为什么您选择从 checkpoint_latest.pth 而不是 checkpoint_best.pth 继续训练。 Checkpoint_latest.pth is saved every 50 epochs, so when we restart; we may loos up to 49 epochs.Checkpoint_latest.pth 每 50 个 epoch 保存一次,因此当我们重新启动时;我们最多可以损失 49 个 epoch。 In my case, I have a very long training epoch time (up to 2000 s sometimes) do loosing 50 epoch is then equivalent to loose 27 hours of computing ...就我而言,我有一个很长的训练 epoch 时间(有时长达 2000 秒),那么失去 50 个 epoch 相当于松散的 27 小时计算......

I find a way around, by just removing the checkpoint_latest.pth from the training log dir.我找到了一种方法,只需从训练日志目录中删除 checkpoint_latest.pth。 see  看

if not isfile(expected_checkpoint_file):

It would be more efficient to compare the date (or directly the epoch) of both checkpoint_best.pth checkpoint_latest.pth and choose the more recent比较 checkpoint_best.pth checkpoint_latest.pth 的日期(或直接比较纪元)并选择较新的日期会更有效

But may be I miss something ?但可能是我错过了什么?

nnUNetv2_train

--val_best [OPTIONAL] If set, the validation will be performed with the checkpoint_best instead of checkpoint_final. NOT COMPATIBLE with --disable_checkpointing! WARNING: This will use the same 'validation' folder as the

                    regular validation with no way of distinguishing the two!

The answer was how to continue training not validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants