Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If you have trouble in training in DDP, here is THE solution #91

Open
wanyunfeiAlex opened this issue May 30, 2023 · 6 comments
Open

If you have trouble in training in DDP, here is THE solution #91

wanyunfeiAlex opened this issue May 30, 2023 · 6 comments

Comments

@wanyunfeiAlex
Copy link

wanyunfeiAlex commented May 30, 2023

Noted that I've successfully trained on muti-GPUs of A100*8, I would like to upload the code, but according to my lab's regulations, I cannot do that. So I illustrate the steps you need to do.

Here are the instructions:

  1. I reconstruct the DDP procedure according to this:
    https://github.com/rentainhe/pytorch-distributed-training/tree/master
    Since this is a Chinese version though, you can choose any other accessible version as well.

  2. As for the dataloader, model construction, and optimizer methods, you can just copy them from train_ddp.py.

  3. In class ModelWithLoss, "from_logits" the parameter of the class "TverskyLoss" needs to be set to "True", or the loss will come out nasty.

  4. Still in classs ModelWithLoss, "mode" the parameter of both class "TverskyLoss" and "FocalLossSeg" needs to be set to "self.model.seg_mode", or the training process will collapse.

  5. During the training iterations, there is a loss check to make sure the loss is not nan/inf. It seems reasonable in a single card, but it causes hang in multi-card scenarios. For example, card A detects a nan and skips this iteration, while card B (WAPP!) doesn't face a nan/inf so it marches into the backward process, during which card B is waiting for responses from card A that will never happen. So the training process is hanged forever. IN ORDER TO fix this, you have to invoke torch.all_gather. Once you detect there is a nan/inf on any of your cards, you have to skip this step on all your cards.

PS: I doubt that much valid data can be skipped under this fix, so I would try to train more epochs to compensate for this drawback.

@wanyunfeiAlex wanyunfeiAlex changed the title If you have trouble in training in DDP, here are THE solution If you have trouble in training in DDP, here is THE solution May 30, 2023
@YuMuYe0930
Copy link

您好,我使用train_dpp.py 进行多GPU训练时出现问题,有好的解决办法吗

@wanyunfeiAlex
Copy link
Author

您好,我使用train_dpp.py 进行多GPU训练时出现问题,有好的解决办法吗

作者的跑不通的,需要按照我上面的步骤重写

@YuMuYe0930
Copy link

好的,谢谢, 我尝试一下吧。 还想请教一下我这边自定义多类别检测训练,一直不收敛,效果很差,您有什么解决方法吗

@wanyunfeiAlex
Copy link
Author

好的,谢谢, 我尝试一下吧。 还想请教一下我这边自定义多类别检测训练,一直不收敛,效果很差,您有什么解决方法吗

lr调低/ 训练参数先固定一部分/ 先在小批量数据上看能不能过拟合。你结合你自己的网络看下

@YuMuYe0930
Copy link

好的 我尝试 ,刚开始学这个, 谢谢您

@wanyunfeiAlex
Copy link
Author

好的 我尝试 ,刚开始学这个, 谢谢您

加油

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants