Auto3DSeg + segresnet: code for resuming partially-trained folds #7506

pwrightkcl · 2024-02-29T15:41:53Z

pwrightkcl
Feb 29, 2024

My Auto3DSeg use case is this:

Using segresnet
Running Auto3DSeg on a kubernetes cluster with a job manager
The job manager sometimes evicts jobs and restarts the Docker container, so the default Auto3DSeg behaviour is to restart from the beginning (or, if using AutoRunner, you can skip partially trained folds).
I want to be able to resume folds that were evicted part way through training from their last saved checkpoint.

I have separate scripts for data analysis, training, etc. to help with parallelisation, based on this tutorial notebook. Note that I recently learned you can also run selected steps using AutoRunner. Read more in this reply. I think for resuming you need to use a standalone script, but welcome more info.

The short solution is to add these keyword arguments to the train call on a BundleAlgo object:

{"pretrained_ckpt_name": "/path/to/model_final.pt", "continue", True}

Note again that I am using segresnet, and this solution probably won't work for dints and swinunetr (it might work for segresnet2d but I haven't tested it). The keywords arguments are referenced in the segresnet code here and here. If you dig into the scripts for the other algo types, you may be able to come up with a similar solution.

Also note that when I'm checking if the checkpoint file has the full number of epochs, I have to adjust epoch counter because because segresnet adjusts the number of epochs internally using num_crops_per_image. I don't fully understand how that works, so would appreciate any correction. It looks like the epoch counter in the checkpoint increments using the correction factor as a step size. For example, if you have 2 crops per image and 400 epochs, it will run for 200 epochs and count from 0 to 398 in steps of 2, rather than 0 to 199.

Below is my code for training with resuming built in. I'm not a MONAI dev, just a user, so it is supplied "as is" in the hope it will save other users time. You will need to test it out in your own environment.

import argparse
from pathlib import Path
from multiprocessing import freeze_support

from monai.apps.auto3dseg import (
    import_bundle_algo_history,
)
from monai.auto3dseg import algo_to_pickle
from monai.utils.enums import AlgoKeys
import torch


def main(work_dir: Path):
    freeze_support()

    print(f"Training models in {work_dir}.")

    if not work_dir.exists():
        raise FileNotFoundError(f"Could not find work directory {work_dir}")

    history = import_bundle_algo_history(str(work_dir), only_trained=False)
    print(f"Found {len(history)} model instances.")

    for algo_dict in history:
        train_param = {}  # Add any extra parameters here

        algo = algo_dict[AlgoKeys.ALGO]

        if algo_dict[AlgoKeys.IS_TRAINED]:
            # Load the most recent checkpoint and retrieve info to determine if training is partial or complete.
            ckpt_file = Path(algo.output_path) / "model" / "model_final.pt"
            checkpoint = torch.load(ckpt_file, map_location="cpu")
            epoch = checkpoint.get("epoch")
            if not epoch:
                raise ValueError(f"Checkpoint has no attribute 'epoch': {ckpt_file}")
            config = checkpoint.get("config")
            if not config:
                raise ValueError(f"Checkpoint has no attribute 'config': {ckpt_file}")
            num_epochs = config.get("num_epochs")
            if not num_epochs:
                raise ValueError(f"Checkpoint config has not attribute 'num_epochs': {ckpt_file}")
            num_crops_per_image = config.get("num_crops_per_image")
            if not num_crops_per_image:
                raise ValueError(f"Checkpoint config has not attribute 'num_crops_per_image': {ckpt_file}")

            epoch_1ind = epoch + min(3, num_crops_per_image)
            if epoch_1ind >= num_epochs:
                print(f"Skipping training on {algo_dict[AlgoKeys.ID]} because it has already trained for {epoch_1ind} epochs.")
                continue

            print(f"Resuming training on {algo_dict[AlgoKeys.ID]} from epoch {epoch}.")
            remaining_epochs = int((num_epochs - epoch) / min(3, num_crops_per_image))
            print(f"Remaining epochs: {remaining_epochs}.")
            train_param.update(
                {"pretrained_ckpt_name": str(ckpt_file), "continue": True}
            )
        else:
            print(f"Beginning training on {algo_dict[AlgoKeys.ID]}.")

        algo.train(train_param)
        acc = algo.get_score()
        algo_to_pickle(algo, template_path=algo.template_path, best_metric=acc)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("work_dir", type=Path, help="Path to work directory (required)")
    args = parser.parse_args()
    main(work_dir=args.work_dir)

falqa · 2024-08-12T15:31:04Z

falqa
Aug 12, 2024

Worked great, was able to continue my training by meshing this into my train command. Saved me days of almost wasted training, thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto3DSeg + segresnet: code for resuming partially-trained folds #7506

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Auto3DSeg + segresnet: code for resuming partially-trained folds #7506

pwrightkcl Feb 29, 2024

Replies: 1 comment

falqa Aug 12, 2024

pwrightkcl
Feb 29, 2024

falqa
Aug 12, 2024