add node_rank to torchrun cmd if rdzv_backend is 'static' #761

MichaelClifford · 2023-08-17T21:26:10Z

This PR adds the torchrun parameter node_rank in the cmd creation of ddp(). This parameter is needed if a user wants to use static rdzv_backend instead of c10d.

The additional parameter is only added when the rdzv_backend is set to static. For more details on why it has been implemented this way please see #759

Test plan:
This has been tested locally against dist_test.py with all 10 tests passing.

d4l3k · 2023-08-17T21:29:14Z

torchx/components/dist.py

+    # TODO 'node_rank' is made optional as it currently does not work with the AWS Batch scheduler.
+    # node_rank is only used when rdzv_backend is 'static'
+    if rdzv_backend == "static":
+        cmd += ["--node_rank", f"{macros.replica_id}"]


can you add a unit test for rdzv_backend=="static"

Is this change specifically for the AWS Batch use-case? And if so, is the intention to make it easier for users to find rank0's log stream by ensuring that the AWS Batch assigned node0 always runs ddp's rank0?

If so, using static rdzv_backend on AWS Batch won't work as intended because the static rdzv_backend was added to ensure backwards compatibility with torch.distributed.launch in torch versions less than 1.9. What the static rdzv_backend does NOT do is to "wait" for full rank before forking the actual ddp processes. Since AWS Batch is not a true gang scheduler, the way AWS Batch creates MNP (multi-node-parallel) jobs, is that it first provisions node0, then starts adding additional nodes one-by-one until the job gets the requested number of nodes. This process can take up to 45minutes (depending on the availability of the instance types, size of the docker image, etc) and therefore it forces you to set the process group timeout when calling dist.init_process_group() to something larger than 45min. Since the default pg is also used to timeout collective calls, GPU/node failures now take up to 45minutes to detect which is probably what you don't want.

A better solution is to just pin rank0 on AWS Batch's node0 as done in this sample PR and still use rdzv_backend="c10d". This however has the drawback that ONLY rank0 is pinned to node0. But in most cases this is what folks want.

This is specifically not for AWS Batch since this broke on the AWS Batch scheduler due to the macro expansion not being correct. I believe this is being used on the Ray scheduler

This was originally reverted in #759

Thanks @d4l3k added a unit test.

And yes @kiukchung, this PR is not intended to fix the AWS Batch issue. I'm working with the Ray Scheduler and have found a need for the static rdzv_backend.

codecov · 2023-08-17T21:32:09Z

Codecov Report

Merging #761 (0a63c84) into main (3a253be) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #761      +/-   ##
==========================================
+ Coverage   92.78%   92.81%   +0.02%     
==========================================
  Files          96       96              
  Lines        6073     6084      +11     
==========================================
+ Hits         5635     5647      +12     
+ Misses        438      437       -1

Flag	Coverage Δ
unittests	`92.81% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
torchx/components/dist.py	`98.48% <100.00%> (+0.07%)`	⬆️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

* add node_rank to torchrun cmd if rdzv_backend is 'static' * add unit test for static backend option

add node_rank to torchrun cmd if rdzv_backend is 'static'

502300b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 17, 2023

d4l3k reviewed Aug 17, 2023

View reviewed changes

add unit test for static backend option

0a63c84

MichaelClifford force-pushed the node_rank_option branch from 3d32e73 to 0a63c84 Compare August 21, 2023 19:06

d4l3k approved these changes Aug 21, 2023

View reviewed changes

d4l3k merged commit b3c4133 into pytorch:main Aug 21, 2023
22 checks passed

KPostOffice pushed a commit to KPostOffice/torchx that referenced this pull request Sep 7, 2023

add node_rank to torchrun cmd if rdzv_backend is 'static' (pytorch#761)

9882f12

* add node_rank to torchrun cmd if rdzv_backend is 'static' * add unit test for static backend option

MichaelClifford added a commit to project-codeflare/torchx that referenced this pull request Sep 8, 2023

add node_rank to torchrun cmd if rdzv_backend is 'static' (pytorch#761)

7255b83

* add node_rank to torchrun cmd if rdzv_backend is 'static' * add unit test for static backend option

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add node_rank to torchrun cmd if rdzv_backend is 'static' #761

add node_rank to torchrun cmd if rdzv_backend is 'static' #761

MichaelClifford commented Aug 17, 2023

d4l3k Aug 17, 2023

kiukchung Aug 17, 2023 •

edited

Loading

d4l3k Aug 17, 2023

MichaelClifford Aug 21, 2023

MichaelClifford Aug 21, 2023

codecov bot commented Aug 17, 2023 •

edited

Loading

add node_rank to torchrun cmd if rdzv_backend is 'static' #761

add node_rank to torchrun cmd if rdzv_backend is 'static' #761

Conversation

MichaelClifford commented Aug 17, 2023

d4l3k Aug 17, 2023

Choose a reason for hiding this comment

kiukchung Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

d4l3k Aug 17, 2023

Choose a reason for hiding this comment

MichaelClifford Aug 21, 2023

Choose a reason for hiding this comment

MichaelClifford Aug 21, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 17, 2023 • edited Loading

Codecov Report

kiukchung Aug 17, 2023 •

edited

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading