-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add node_rank to torchrun cmd if rdzv_backend is 'static' #761
Conversation
# TODO 'node_rank' is made optional as it currently does not work with the AWS Batch scheduler. | ||
# node_rank is only used when rdzv_backend is 'static' | ||
if rdzv_backend == "static": | ||
cmd += ["--node_rank", f"{macros.replica_id}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a unit test for rdzv_backend=="static"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change specifically for the AWS Batch use-case? And if so, is the intention to make it easier for users to find rank0's log stream by ensuring that the AWS Batch assigned node0 always runs ddp's rank0?
If so, using static
rdzv_backend on AWS Batch won't work as intended because the static
rdzv_backend was added to ensure backwards compatibility with torch.distributed.launch
in torch versions less than 1.9. What the static
rdzv_backend does NOT do is to "wait" for full rank before forking the actual ddp processes. Since AWS Batch is not a true gang scheduler, the way AWS Batch creates MNP (multi-node-parallel) jobs, is that it first provisions node0
, then starts adding additional nodes one-by-one until the job gets the requested number of nodes. This process can take up to 45minutes (depending on the availability of the instance types, size of the docker image, etc) and therefore it forces you to set the process group timeout when calling dist.init_process_group()
to something larger than 45min. Since the default pg is also used to timeout collective calls, GPU/node failures now take up to 45minutes to detect which is probably what you don't want.
A better solution is to just pin rank0 on AWS Batch's node0 as done in this sample PR and still use rdzv_backend="c10d"
. This however has the drawback that ONLY rank0 is pinned to node0. But in most cases this is what folks want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically not for AWS Batch since this broke on the AWS Batch scheduler due to the macro expansion not being correct. I believe this is being used on the Ray scheduler
This was originally reverted in #759
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @d4l3k added a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes @kiukchung, this PR is not intended to fix the AWS Batch issue. I'm working with the Ray Scheduler and have found a need for the static rdzv_backend.
Codecov Report
@@ Coverage Diff @@
## main #761 +/- ##
==========================================
+ Coverage 92.78% 92.81% +0.02%
==========================================
Files 96 96
Lines 6073 6084 +11
==========================================
+ Hits 5635 5647 +12
+ Misses 438 437 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 6 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
3d32e73
to
0a63c84
Compare
* add node_rank to torchrun cmd if rdzv_backend is 'static' * add unit test for static backend option
* add node_rank to torchrun cmd if rdzv_backend is 'static' * add unit test for static backend option
#751
#759
This PR adds the torchrun parameter
node_rank
in the cmd creation ofddp()
. This parameter is needed if a user wants to usestatic
rdzv_backend instead ofc10d
.The additional parameter is only added when the rdzv_backend is set to
static
. For more details on why it has been implemented this way please see #759Test plan:
This has been tested locally against
dist_test.py
with all 10 tests passing.