continue training from checkpoint_best.pth instead of checkpoint_latest.pth #2563

romainVala · 2024-10-21T07:45:52Z

Hi

I do not understand why you choose to continue training from the checkpoint_latest.pth instead of checkpoint_best.pth.
Checkpoint_latest.pth is saved every 50 epochs, so when we restart; we may loos up to 49 epochs.
In my case, I have a very long training epoch time (up to 2000 s sometimes) do loosing 50 epoch is then equivalent to loose 27 hours of computing ...

I find a way around, by just removing the checkpoint_latest.pth from the training log dir.
see

nnUNet/nnunetv2/run/run_training.py

Line 81 in 520e749

if not isfile(expected_checkpoint_file):

It would be more efficient to compare the date (or directly the epoch) of both checkpoint_best.pth checkpoint_latest.pth and choose the more recent

But may be I miss something ?

fitzjalen · 2024-10-21T19:11:18Z

Yeah - good idea

romainVala · 2024-10-23T06:33:35Z

Is there a way to change the parameter that define the save period to 50 epochs ?
In my case of very long training I would like to make it every 10 epochs

gaojh135 · 2024-10-25T02:33:51Z

Is there a way to change the parameter that define the save period to 50 epochs ?有没有办法将定义保存周期的参数更改为 50 epochs ？ In my case of very long training I would like to make it every 10 epochs在我训练时间很长的情况下，我想每 10 个 epoch 进行一次

nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py

Line 188 in 1e2aa0a

self.save_every = 50

gaojh135 · 2024-10-25T02:35:42Z

Hi 你好

I do not understand why you choose to continue training from the checkpoint_latest.pth instead of checkpoint_best.pth.我不明白为什么您选择从 checkpoint_latest.pth 而不是 checkpoint_best.pth 继续训练。 Checkpoint_latest.pth is saved every 50 epochs, so when we restart; we may loos up to 49 epochs.Checkpoint_latest.pth 每 50 个 epoch 保存一次，因此当我们重新启动时;我们最多可以损失 49 个 epoch。 In my case, I have a very long training epoch time (up to 2000 s sometimes) do loosing 50 epoch is then equivalent to loose 27 hours of computing ...就我而言，我有一个很长的训练 epoch 时间（有时长达 2000 秒），那么失去 50 个 epoch 相当于松散的 27 小时计算......

I find a way around, by just removing the checkpoint_latest.pth from the training log dir.我找到了一种方法，只需从训练日志目录中删除 checkpoint_latest.pth。 see 看

nnUNet/nnunetv2/run/run_training.py

Line 81 in 520e749

if not isfile(expected_checkpoint_file):

It would be more efficient to compare the date (or directly the epoch) of both checkpoint_best.pth checkpoint_latest.pth and choose the more recent比较 checkpoint_best.pth checkpoint_latest.pth 的日期（或直接比较纪元）并选择较新的日期会更有效

But may be I miss something ?但可能是我错过了什么？

nnUNetv2_train
--val_best [OPTIONAL] If set, the validation will be performed with the checkpoint_best instead of checkpoint_final. NOT COMPATIBLE with --disable_checkpointing! WARNING: This will use the same 'validation' folder as the
regular validation with no way of distinguishing the two!

fitzjalen · 2024-10-25T07:10:21Z

Hi 你好

I do not understand why you choose to continue training from the checkpoint_latest.pth instead of checkpoint_best.pth.我不明白为什么您选择从 checkpoint_latest.pth 而不是 checkpoint_best.pth 继续训练。 Checkpoint_latest.pth is saved every 50 epochs, so when we restart; we may loos up to 49 epochs.Checkpoint_latest.pth 每 50 个 epoch 保存一次，因此当我们重新启动时;我们最多可以损失 49 个 epoch。 In my case, I have a very long training epoch time (up to 2000 s sometimes) do loosing 50 epoch is then equivalent to loose 27 hours of computing ...就我而言，我有一个很长的训练 epoch 时间（有时长达 2000 秒），那么失去 50 个 epoch 相当于松散的 27 小时计算......

I find a way around, by just removing the checkpoint_latest.pth from the training log dir.我找到了一种方法，只需从训练日志目录中删除 checkpoint_latest.pth。 see 看

nnUNet/nnunetv2/run/run_training.py

Line 81 in 520e749

if not isfile(expected_checkpoint_file):

It would be more efficient to compare the date (or directly the epoch) of both checkpoint_best.pth checkpoint_latest.pth and choose the more recent比较 checkpoint_best.pth checkpoint_latest.pth 的日期（或直接比较纪元）并选择较新的日期会更有效

But may be I miss something ?但可能是我错过了什么？

nnUNetv2_train

--val_best [OPTIONAL] If set, the validation will be performed with the checkpoint_best instead of checkpoint_final. NOT COMPATIBLE with --disable_checkpointing! WARNING: This will use the same 'validation' folder as the
                    regular validation with no way of distinguishing the two!

The answer was how to continue training not validation

fitzjalen mentioned this issue Oct 22, 2024

Use real latest checkpoint in training #2564

Open

FabianIsensee assigned dojoh Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continue training from checkpoint_best.pth instead of checkpoint_latest.pth #2563

continue training from checkpoint_best.pth instead of checkpoint_latest.pth #2563

romainVala commented Oct 21, 2024 •

edited

Loading

fitzjalen commented Oct 21, 2024 •

edited

Loading

romainVala commented Oct 23, 2024

gaojh135 commented Oct 25, 2024

gaojh135 commented Oct 25, 2024

fitzjalen commented Oct 25, 2024

continue training from checkpoint_best.pth instead of checkpoint_latest.pth #2563

continue training from checkpoint_best.pth instead of checkpoint_latest.pth #2563

Comments

romainVala commented Oct 21, 2024 • edited Loading

fitzjalen commented Oct 21, 2024 • edited Loading

romainVala commented Oct 23, 2024

gaojh135 commented Oct 25, 2024

gaojh135 commented Oct 25, 2024

fitzjalen commented Oct 25, 2024

romainVala commented Oct 21, 2024 •

edited

Loading

fitzjalen commented Oct 21, 2024 •

edited

Loading