Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add node_rank to torchrun cmd if rdzv_backend is 'static' #761

Merged
merged 2 commits into from
Aug 21, 2023

Conversation

MichaelClifford
Copy link
Contributor

#751
#759

This PR adds the torchrun parameter node_rank in the cmd creation of ddp(). This parameter is needed if a user wants to use static rdzv_backend instead of c10d.

The additional parameter is only added when the rdzv_backend is set to static. For more details on why it has been implemented this way please see #759

Test plan:
This has been tested locally against dist_test.py with all 10 tests passing.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 17, 2023
# TODO 'node_rank' is made optional as it currently does not work with the AWS Batch scheduler.
# node_rank is only used when rdzv_backend is 'static'
if rdzv_backend == "static":
cmd += ["--node_rank", f"{macros.replica_id}"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a unit test for rdzv_backend=="static"

Copy link
Collaborator

@kiukchung kiukchung Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change specifically for the AWS Batch use-case? And if so, is the intention to make it easier for users to find rank0's log stream by ensuring that the AWS Batch assigned node0 always runs ddp's rank0?

If so, using static rdzv_backend on AWS Batch won't work as intended because the static rdzv_backend was added to ensure backwards compatibility with torch.distributed.launch in torch versions less than 1.9. What the static rdzv_backend does NOT do is to "wait" for full rank before forking the actual ddp processes. Since AWS Batch is not a true gang scheduler, the way AWS Batch creates MNP (multi-node-parallel) jobs, is that it first provisions node0, then starts adding additional nodes one-by-one until the job gets the requested number of nodes. This process can take up to 45minutes (depending on the availability of the instance types, size of the docker image, etc) and therefore it forces you to set the process group timeout when calling dist.init_process_group() to something larger than 45min. Since the default pg is also used to timeout collective calls, GPU/node failures now take up to 45minutes to detect which is probably what you don't want.

A better solution is to just pin rank0 on AWS Batch's node0 as done in this sample PR and still use rdzv_backend="c10d". This however has the drawback that ONLY rank0 is pinned to node0. But in most cases this is what folks want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically not for AWS Batch since this broke on the AWS Batch scheduler due to the macro expansion not being correct. I believe this is being used on the Ray scheduler

This was originally reverted in #759

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @d4l3k added a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes @kiukchung, this PR is not intended to fix the AWS Batch issue. I'm working with the Ray Scheduler and have found a need for the static rdzv_backend.

@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Merging #761 (0a63c84) into main (3a253be) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #761      +/-   ##
==========================================
+ Coverage   92.78%   92.81%   +0.02%     
==========================================
  Files          96       96              
  Lines        6073     6084      +11     
==========================================
+ Hits         5635     5647      +12     
+ Misses        438      437       -1     
Flag Coverage Δ
unittests 92.81% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
torchx/components/dist.py 98.48% <100.00%> (+0.07%) ⬆️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@d4l3k d4l3k merged commit b3c4133 into pytorch:main Aug 21, 2023
22 checks passed
KPostOffice pushed a commit to KPostOffice/torchx that referenced this pull request Sep 7, 2023
* add node_rank to torchrun cmd if rdzv_backend is 'static'

* add unit test for static backend option
MichaelClifford added a commit to project-codeflare/torchx that referenced this pull request Sep 8, 2023
* add node_rank to torchrun cmd if rdzv_backend is 'static'

* add unit test for static backend option
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants