Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume training does not work for multi-gpus training #23

Closed
forever208 opened this issue Jan 21, 2022 · 23 comments
Closed

resume training does not work for multi-gpus training #23

forever208 opened this issue Jan 21, 2022 · 23 comments

Comments

@forever208
Copy link

forever208 commented Jan 21, 2022

I add --resume_checkpoint $path_to_checkpoint$ to continue the training, it works for a single GPU, but does not work for multi-gpus

the code gets stuck here:

Logging to /proj/ihorse_2021/users/x_manni/guided-diffusion/log9
creating model and diffusion...
creating data loader...
start training...
loading model from checkpoint: /proj/ihorse_2021/users/x_manni/guided-diffusion/log9/model200000.pt...

@VigneshSrinivasan10
Copy link

@forever208 I have the same problem and the code gets stuck forever.

On further investigation, I found that the test script image_sample.py reloads the model.
Here is the difference:
The test script reloads the model before placing the model on cuda.
However, the training script already has the model on cuda and this leads to the problem of getting stuck there.
Upon debugging, I found the code to get stuck on this line in dist_util.py:

MPI.COMM_WORLD.bcast(data[i : i + chunk_size])

It is not clear to me why this fails.
Any pointers in fixing this problem would be greatly appreciated.
Thanks in advance.

@bahjat-kawar
Copy link

@VigneshSrinivasan10 Not sure if it's the same issue, but I also found the code to stop at the same line when running classifier_train.py.
I fixed the issue by indenting load_state_dict out of the if dist.get_rank() == 0 block on line 51.
Hope this helps.

@VigneshSrinivasan10
Copy link

@bahjat-kawar Thanks for tip and sorry for the delay in my response.
Your suggestion fixed the problem.

@VigneshSrinivasan10
Copy link

@bahjat-kawar Although the model reloading was successful, I still face loss values going to NAN after retraining for a few iterations. All the three .pt files were reloaded, but this issue still persists. I assumed the opt.pt file should have some information of the optimizer parameters which should help continue the training.

Did you also face this issue?

@JiamingLiu-Jeremy
Copy link

@VigneshSrinivasan10 I met the similar problem. Any progress on fixing this issue ?

@forever208
Copy link
Author

@VigneshSrinivasan10 I met the similar problem. Any progress on fixing this issue ?

solution: remove if dist.get_rank() == 0 in script train_util.py when loading checkpints, because each GPU need to load checkpoint

@ONobody
Copy link

ONobody commented Mar 5, 2023

@forever208 Hello, do you use opt, model or ema's. pt file when using resume_checkpoint?
Or put their. pt files in a folder as the path of resume_checkpoint.

@forever208
Copy link
Author

@on I use model to do resume training (where both ema and opt will be loaded). use ema to do sampling

@ONobody
Copy link

ONobody commented Mar 5, 2023

@forever208 When I continue to train,
such as python image_train.py --resume_checkpoint path/modelXX.pt? think you very much

@forever208
Copy link
Author

forever208 commented Mar 5, 2023

@ONobody exactly
If you have further trouble, take a look at this pull request Fix resumed model training for Multi-GPUs

@ONobody
Copy link

ONobody commented Mar 5, 2023

@forever208 Thank you very much.

@ONobody
Copy link

ONobody commented Mar 6, 2023

@forever208 Hello, I would like to ask how to train Classifier guidance on my own data set.
Do you need to change any codes?
I always make mistakes.

@forever208
Copy link
Author

@ONobody I have no experience of using the classifier guidance, sorry for not being able to help you in this case

@ONobody
Copy link

ONobody commented Mar 6, 2023

@forever208 What about the calculation of FID IS and other evaluation indicators?
I don't know how to calculate it here

@forever208
Copy link
Author

@ONobody the author provides the instructions: https://github.com/openai/guided-diffusion/tree/main/evaluations

@ONobody
Copy link

ONobody commented Mar 7, 2023

@forever208
The diffusion I trained is based on my own data set.
How to evaluate this thank you.

@forever208
Copy link
Author

@ONobody if your own dataset only has one class, you can randomly draw 50k samples to form the reference_batch. Then you generated 50k samples using your trained model. Computing the FID by running the script

$ python evaluator.py reference_batch.npz 50k_samples.npz

If your own dataset has more than 1 class, you'd better use the whole training set as the reference_batch.

remember to convert your data into .npz format

@ONobody
Copy link

ONobody commented Mar 7, 2023

@forever208
The picture size of my dataset is not 256.
But the size of the picture I generated is 256.
Do I need to change the picture size of my data set to 256?
thank you

@forever208
Copy link
Author

@ONobody you have to keep them the same size. For example, your training data must be resized to 256 when doing the training. Then, your model generates a 256*256 sample.

@ONobody
Copy link

ONobody commented Mar 7, 2023

@forever208 When I make an assessment,
Convert your own data set into 256*256.
And then convert it to npz format, right?

@forever208
Copy link
Author

forever208 commented Mar 7, 2023

@ONobody convert the training data into 256x256 --> training the model --> sampling 50k samples (256x256) from the model --> convert both referench batch (256x256) and 50k samples (256x256) into npz file --> compute FID

@ONobody
Copy link

ONobody commented Mar 8, 2023

@forever208 Thank you very much.
When converting my data set to npz format and then calculating fid.
图片1
Is it the wrong way for me to convert?

@open11012
Copy link

@VigneshSrinivasan10 Not sure if it's the same issue, but I also found the code to stop at the same line when running classifier_train.py. I fixed the issue by indenting load_state_dict out of the if dist.get_rank() == 0 block on line 51. Hope this helps.

Thanks! I meet the same problem. It works in my code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants