Distributed training with Horovod? #58

Qiuzhuang · 2019-04-18T05:24:33Z

Can we have the Distributed training with Horovod version -- we want to speed up LM training via a cluster of GPU machines.

raulpuric · 2019-04-19T04:01:06Z

We do not use horovod for pytorch distributed training. We use our own multinode launchers that adhere to pytorch's distributed training format for multinode training.
Pytorch's multinode format is mpi-like and fairly similar to horovod so you may be able to use it, but I can't say for sure.
Here's a more updated example of PyTorch distributed training for language modeling. You should be able to set similar environment variables to run multinode training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training with Horovod? #58

Distributed training with Horovod? #58

Qiuzhuang commented Apr 18, 2019

raulpuric commented Apr 19, 2019

Distributed training with Horovod? #58

Distributed training with Horovod? #58

Comments

Qiuzhuang commented Apr 18, 2019

raulpuric commented Apr 19, 2019