Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing k8s node types for trn1 instance types #792

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Nov 15, 2023

Now that we use K8S_ITYPE for multi-node job's instance_type for AWS resources, it has to match the official label or we get the following error with aws_batch_scheduler:

botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the RegisterJobDefinition operation: Error executing request, Exception : Instance type can only be one of [m6g.xlarge, ..., c7g.large, trn1.2xlarge, c7g.12xlarge, ..., trn1.32xlarge, ..., g4dn.4xlarge, trn1n.32xlarge, ..., r7g.medium].

Intentionally not keeping an alias to previous labels, e.g. aws_trn1.32xl because torchx suggests the new one.

Test plan:
Fixed unit tests

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2023
@clumsy
Copy link
Contributor Author

clumsy commented Nov 15, 2023

This is a regression, I'm afraid @kiukchung

@kiukchung
Copy link
Collaborator

LGTM thanks

@kiukchung kiukchung merged commit 8048ef3 into pytorch:main Nov 15, 2023
22 checks passed
@clumsy clumsy deleted the fix/trn1_k8s_node_type branch November 15, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants