Problem of distributed training #36

eagles1812 · 2022-09-22T15:18:41Z

Thanks for the great paper, dataset and code!

I tried to train the model with ready data using single GPU, it took roughly half day. So I tried to add some distributed training component, the training time decreased, but also the AP/AR/IOU values. Have you tested distributed training? How do you correctly set the parameters to ensure shorter training time and proper AP/AR/IOU values?

Thank you!

jrebut · 2022-09-29T14:10:59Z

Hi, can you please explain what you means by distributed training component?

Julien

eagles1812 · 2022-10-10T13:11:17Z

Thanks for your reply. In your code suite, you used one GPU for training, for a large dataset such as yours, it takes very long time. I have multiple GPUs, and wanted to decrease the training time, so I modified your training code to include distributed training component following articles such as https://towardsdatascience.com/how-to-scale-training-on-multiple-gpus-dae1041f49d2, then I met the problem I listed in the previous post. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem of distributed training #36

Problem of distributed training #36

eagles1812 commented Sep 22, 2022

jrebut commented Sep 29, 2022

eagles1812 commented Oct 10, 2022

Problem of distributed training #36

Problem of distributed training #36

Comments

eagles1812 commented Sep 22, 2022

jrebut commented Sep 29, 2022

eagles1812 commented Oct 10, 2022