Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Distributed Training Rendezvous error with MCAD v.1.34.1 #793

Merged
merged 2 commits into from
Nov 20, 2023

Conversation

Sara-KS
Copy link
Contributor

@Sara-KS Sara-KS commented Nov 16, 2023

Some releases of MCAD (starting with v1.34.1) incorrectly autogenerated pod labels with the old appwrapper identifier labels added. Since the Kubernetes service deployed by TorchX-MCAD relied on the correct label, distributed jobs failed due to rendezvous errors. This PR moves the generated Kubernetes Service selector to the standard app.kubernetes.io/instance label to cover any versions of MCAD that may be in deployment.

Test plan:
Updated the corresponding TorchX-MCAD tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 16, 2023
@kiukchung
Copy link
Collaborator

LGTM thanks

@kiukchung kiukchung merged commit c3868c1 into pytorch:main Nov 20, 2023
22 checks passed
@Sara-KS Sara-KS deleted the mcad-label-fix branch November 20, 2023 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants