You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We do not use horovod for pytorch distributed training. We use our own multinode launchers that adhere to pytorch's distributed training format for multinode training.
Pytorch's multinode format is mpi-like and fairly similar to horovod so you may be able to use it, but I can't say for sure.
Here's a more updated example of PyTorch distributed training for language modeling. You should be able to set similar environment variables to run multinode training.
Can we have the Distributed training with Horovod version -- we want to speed up LM training via a cluster of GPU machines.
The text was updated successfully, but these errors were encountered: