If you have trouble in training in DDP, here is THE solution #91

wanyunfeiAlex · 2023-05-30T12:24:54Z

Noted that I've successfully trained on muti-GPUs of A100*8, I would like to upload the code, but according to my lab's regulations, I cannot do that. So I illustrate the steps you need to do.

Here are the instructions:

I reconstruct the DDP procedure according to this:
https://github.com/rentainhe/pytorch-distributed-training/tree/master
Since this is a Chinese version though, you can choose any other accessible version as well.
As for the dataloader, model construction, and optimizer methods, you can just copy them from train_ddp.py.
In class ModelWithLoss, "from_logits" the parameter of the class "TverskyLoss" needs to be set to "True", or the loss will come out nasty.
Still in classs ModelWithLoss, "mode" the parameter of both class "TverskyLoss" and "FocalLossSeg" needs to be set to "self.model.seg_mode", or the training process will collapse.
During the training iterations, there is a loss check to make sure the loss is not nan/inf. It seems reasonable in a single card, but it causes hang in multi-card scenarios. For example, card A detects a nan and skips this iteration, while card B (WAPP!) doesn't face a nan/inf so it marches into the backward process, during which card B is waiting for responses from card A that will never happen. So the training process is hanged forever. IN ORDER TO fix this, you have to invoke torch.all_gather. Once you detect there is a nan/inf on any of your cards, you have to skip this step on all your cards.

PS: I doubt that much valid data can be skipped under this fix, so I would try to train more epochs to compensate for this drawback.

YuMuYe0930 · 2024-04-26T07:23:20Z

您好，我使用train_dpp.py 进行多GPU训练时出现问题，有好的解决办法吗

wanyunfeiAlex · 2024-04-26T07:42:07Z

您好，我使用train_dpp.py 进行多GPU训练时出现问题，有好的解决办法吗

作者的跑不通的，需要按照我上面的步骤重写

YuMuYe0930 · 2024-04-26T07:49:54Z

好的，谢谢，我尝试一下吧。还想请教一下我这边自定义多类别检测训练，一直不收敛，效果很差，您有什么解决方法吗

wanyunfeiAlex · 2024-04-26T07:54:26Z

好的，谢谢，我尝试一下吧。还想请教一下我这边自定义多类别检测训练，一直不收敛，效果很差，您有什么解决方法吗

lr调低/ 训练参数先固定一部分/ 先在小批量数据上看能不能过拟合。你结合你自己的网络看下

YuMuYe0930 · 2024-04-26T08:12:34Z

好的我尝试，刚开始学这个，谢谢您

wanyunfeiAlex · 2024-04-26T08:24:02Z

好的我尝试，刚开始学这个，谢谢您

加油

wanyunfeiAlex changed the title ~~If you have trouble in training in DDP, here are THE solution~~ If you have trouble in training in DDP, here is THE solution May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If you have trouble in training in DDP, here is THE solution #91

If you have trouble in training in DDP, here is THE solution #91

wanyunfeiAlex commented May 30, 2023 •

edited

Loading

YuMuYe0930 commented Apr 26, 2024

wanyunfeiAlex commented Apr 26, 2024

YuMuYe0930 commented Apr 26, 2024

wanyunfeiAlex commented Apr 26, 2024

YuMuYe0930 commented Apr 26, 2024

wanyunfeiAlex commented Apr 26, 2024

If you have trouble in training in DDP, here is THE solution #91

If you have trouble in training in DDP, here is THE solution #91

Comments

wanyunfeiAlex commented May 30, 2023 • edited Loading

YuMuYe0930 commented Apr 26, 2024

wanyunfeiAlex commented Apr 26, 2024

YuMuYe0930 commented Apr 26, 2024

wanyunfeiAlex commented Apr 26, 2024

YuMuYe0930 commented Apr 26, 2024

wanyunfeiAlex commented Apr 26, 2024

wanyunfeiAlex commented May 30, 2023 •

edited

Loading