Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EFA NCCL test case, unmanaged nodegroup template #427

Merged
merged 1 commit into from
Feb 24, 2024

Conversation

cartermckinnon
Copy link
Member

Description of changes:

This adds a new option to the eksapi deployer, --efa, which will create an EFA-enabled unmanaged nodegroup.

It also adds a case to the nvidia e2e tests to perform an AllReduce operation using 2 workers on EFA-enabled nodes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

containers:
- image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3
- image: TODO
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary until there's an build pipeline set up for the image added in this PR.

- FI_LOG_LEVEL=warn
- -x
- FI_EFA_USE_DEVICE_RDMA=1
- -x
- OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important, it will fail the test case if something goes wrong with the EFA RDMA instead of falling back to the dirt-slow host buffer copy.

@@ -52,6 +52,7 @@ spec:
- p4de.24xlarge
- trn1.32xlarge
- trn1n.32xlarge
- p5.48xlarge
Copy link
Contributor

@Issacwww Issacwww Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have some control over it so that we can easily tune/stabilize the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding p5 types to the nodeSelector for the EFA device plugin, it's just a pre-req for using p5's.

@cartermckinnon cartermckinnon merged commit a2fee67 into main Feb 24, 2024
2 checks passed
@cartermckinnon cartermckinnon deleted the efa-nccl-test branch February 24, 2024 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants