-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using RDMA capable nodes #34
Comments
I also noticed NCCL_IB_DISABLE (env variable) is set to 1 by the pretrain AML environment (or maybe by the Docker image)
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html Wonder if the authors hit any blocking issues using infiniband/rdma @aashna |
When I tried the pretraining on ND24rs (RDMA/infiniband), I got the following error:
I think NCCL_IB_DISABLE should be set to 0 (or unset), but haven't tried yet. |
After checking with AzureML folks, it turned out I have to use Intel MPI as the backend when I use nodes without SR-IOV support.
Accelerating Distributed Training in Azure Machine Learning service using SR-IOV If you have access to NCv3 or NDv2, then you can take advantage of the faster GPU interconnect. SR-IOV support should come to NCv2 and NDv1 later in 2020. Without SR-IOV, for NCCL, we need to set "NCCL_IB_DISABLE": "0" to disable InfiniBand on RDMA capable VMs (e.g., ND24rs). |
Is there a reason for using Standard_NC24s_v3 rather than the RDMA capable Standard_NC24rs_v3?
The text was updated successfully, but these errors were encountered: