resume training does not work for multi-gpus training #23

forever208 · 2022-01-21T20:45:15Z

I add --resume_checkpoint $path_to_checkpoint$ to continue the training, it works for a single GPU, but does not work for multi-gpus

the code gets stuck here:

Logging to /proj/ihorse_2021/users/x_manni/guided-diffusion/log9
creating model and diffusion...
creating data loader...
start training...
loading model from checkpoint: /proj/ihorse_2021/users/x_manni/guided-diffusion/log9/model200000.pt...

The text was updated successfully, but these errors were encountered:

VigneshSrinivasan10 · 2022-02-08T10:59:59Z

@forever208 I have the same problem and the code gets stuck forever.

On further investigation, I found that the test script image_sample.py reloads the model.
Here is the difference:
The test script reloads the model before placing the model on cuda.
However, the training script already has the model on cuda and this leads to the problem of getting stuck there.
Upon debugging, I found the code to get stuck on this line in dist_util.py:

guided-diffusion/guided_diffusion/dist_util.py

Line 67 in 27c20a8

MPI.COMM_WORLD.bcast(data[i : i + chunk_size])

It is not clear to me why this fails.
Any pointers in fixing this problem would be greatly appreciated.
Thanks in advance.

bahjat-kawar · 2022-03-01T14:24:58Z

@VigneshSrinivasan10 Not sure if it's the same issue, but I also found the code to stop at the same line when running classifier_train.py.
I fixed the issue by indenting load_state_dict out of the if dist.get_rank() == 0 block on line 51.
Hope this helps.

VigneshSrinivasan10 · 2022-04-19T10:59:40Z

@bahjat-kawar Thanks for tip and sorry for the delay in my response.
Your suggestion fixed the problem.

VigneshSrinivasan10 · 2022-04-27T14:11:34Z

@bahjat-kawar Although the model reloading was successful, I still face loss values going to NAN after retraining for a few iterations. All the three .pt files were reloaded, but this issue still persists. I assumed the opt.pt file should have some information of the optimizer parameters which should help continue the training.

Did you also face this issue?

JiamingLiu-Jeremy · 2022-06-17T17:07:50Z

@VigneshSrinivasan10 I met the similar problem. Any progress on fixing this issue ?

forever208 · 2022-06-22T20:46:14Z

@VigneshSrinivasan10 I met the similar problem. Any progress on fixing this issue ?

solution: remove if dist.get_rank() == 0 in script train_util.py when loading checkpints, because each GPU need to load checkpoint

ONobody · 2023-03-05T02:34:45Z

@forever208 Hello, do you use opt, model or ema's. pt file when using resume_checkpoint?
Or put their. pt files in a folder as the path of resume_checkpoint.

forever208 · 2023-03-05T08:57:03Z

@on I use model to do resume training (where both ema and opt will be loaded). use ema to do sampling

ONobody · 2023-03-05T12:39:33Z

@forever208 When I continue to train,
such as python image_train.py --resume_checkpoint path/modelXX.pt? think you very much

forever208 · 2023-03-05T12:45:07Z

@ONobody exactly
If you have further trouble, take a look at this pull request Fix resumed model training for Multi-GPUs

ONobody · 2023-03-05T13:08:23Z

@forever208 Thank you very much.

ONobody · 2023-03-06T07:51:30Z

@forever208 Hello, I would like to ask how to train Classifier guidance on my own data set.
Do you need to change any codes?
I always make mistakes.

forever208 · 2023-03-06T13:46:51Z

@ONobody I have no experience of using the classifier guidance, sorry for not being able to help you in this case

ONobody · 2023-03-06T13:50:23Z

@forever208 What about the calculation of FID IS and other evaluation indicators?
I don't know how to calculate it here

forever208 · 2023-03-06T19:16:46Z

@ONobody the author provides the instructions: https://github.com/openai/guided-diffusion/tree/main/evaluations

ONobody · 2023-03-07T00:40:55Z

@forever208
The diffusion I trained is based on my own data set.
How to evaluate this thank you.

forever208 · 2023-03-07T08:47:07Z

@ONobody if your own dataset only has one class, you can randomly draw 50k samples to form the reference_batch. Then you generated 50k samples using your trained model. Computing the FID by running the script

$ python evaluator.py reference_batch.npz 50k_samples.npz

If your own dataset has more than 1 class, you'd better use the whole training set as the reference_batch.

remember to convert your data into .npz format

ONobody · 2023-03-07T09:18:48Z

@forever208
The picture size of my dataset is not 256.
But the size of the picture I generated is 256.
Do I need to change the picture size of my data set to 256?
thank you

forever208 · 2023-03-07T09:34:03Z

@ONobody you have to keep them the same size. For example, your training data must be resized to 256 when doing the training. Then, your model generates a 256*256 sample.

ONobody · 2023-03-07T10:48:44Z

@forever208 When I make an assessment,
Convert your own data set into 256*256.
And then convert it to npz format, right?

forever208 · 2023-03-07T13:32:23Z

@ONobody convert the training data into 256x256 --> training the model --> sampling 50k samples (256x256) from the model --> convert both referench batch (256x256) and 50k samples (256x256) into npz file --> compute FID

ONobody · 2023-03-08T02:13:15Z

@forever208 Thank you very much.
When converting my data set to npz format and then calculating fid.

Is it the wrong way for me to convert?

open11012 · 2024-03-28T06:45:05Z

@VigneshSrinivasan10 Not sure if it's the same issue, but I also found the code to stop at the same line when running classifier_train.py. I fixed the issue by indenting load_state_dict out of the if dist.get_rank() == 0 block on line 51. Hope this helps.

Thanks! I meet the same problem. It works in my code!

JiamingLiu-Jeremy mentioned this issue Jun 17, 2022

Resume training does not work for multi-gpus training openai/improved-diffusion#20

Open

forever208 closed this as completed Jun 22, 2022

harubaru mentioned this issue Aug 17, 2022

Fix resumed model training for Multi-GPUs #64

Open

Suimingzhe mentioned this issue Feb 1, 2023

load_state_dict stuck! #68

Open

lukasz-staniszewski mentioned this issue Jul 28, 2024

Fail to resume model through mpi load_state_dict openai/improved-diffusion#101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume training does not work for multi-gpus training #23

resume training does not work for multi-gpus training #23

forever208 commented Jan 21, 2022 •

edited

Loading

VigneshSrinivasan10 commented Feb 8, 2022

bahjat-kawar commented Mar 1, 2022

VigneshSrinivasan10 commented Apr 19, 2022

VigneshSrinivasan10 commented Apr 27, 2022

JiamingLiu-Jeremy commented Jun 17, 2022

forever208 commented Jun 22, 2022

ONobody commented Mar 5, 2023

forever208 commented Mar 5, 2023

ONobody commented Mar 5, 2023

forever208 commented Mar 5, 2023 •

edited

Loading

ONobody commented Mar 5, 2023

ONobody commented Mar 6, 2023

forever208 commented Mar 6, 2023

ONobody commented Mar 6, 2023

forever208 commented Mar 6, 2023

ONobody commented Mar 7, 2023

forever208 commented Mar 7, 2023

ONobody commented Mar 7, 2023

forever208 commented Mar 7, 2023

ONobody commented Mar 7, 2023

forever208 commented Mar 7, 2023 •

edited

Loading

ONobody commented Mar 8, 2023

open11012 commented Mar 28, 2024

resume training does not work for multi-gpus training #23

resume training does not work for multi-gpus training #23

Comments

forever208 commented Jan 21, 2022 • edited Loading

VigneshSrinivasan10 commented Feb 8, 2022

bahjat-kawar commented Mar 1, 2022

VigneshSrinivasan10 commented Apr 19, 2022

VigneshSrinivasan10 commented Apr 27, 2022

JiamingLiu-Jeremy commented Jun 17, 2022

forever208 commented Jun 22, 2022

ONobody commented Mar 5, 2023

forever208 commented Mar 5, 2023

ONobody commented Mar 5, 2023

forever208 commented Mar 5, 2023 • edited Loading

ONobody commented Mar 5, 2023

ONobody commented Mar 6, 2023

forever208 commented Mar 6, 2023

ONobody commented Mar 6, 2023

forever208 commented Mar 6, 2023

ONobody commented Mar 7, 2023

forever208 commented Mar 7, 2023

ONobody commented Mar 7, 2023

forever208 commented Mar 7, 2023

ONobody commented Mar 7, 2023

forever208 commented Mar 7, 2023 • edited Loading

ONobody commented Mar 8, 2023

open11012 commented Mar 28, 2024

forever208 commented Jan 21, 2022 •

edited

Loading

forever208 commented Mar 5, 2023 •

edited

Loading

forever208 commented Mar 7, 2023 •

edited

Loading