Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nutanix GPU support implementation #8745

Merged
merged 3 commits into from
Oct 10, 2024
Merged

Conversation

adiantum
Copy link
Contributor

Description of changes:
Implemented GPU support for Nutanix provider. Both vGPU and Passthrough modes are supported.

Testing (if applicable):

$ eksctl anywhere create cluster -f ./cluster-ntnx-gpu.yaml -v 10 --bundles-override bin/local-bundle-release.yaml
2024-09-11T14:12:05.046Z	V6	Executing command	{"cmd": "/usr/bin/docker version --format {{.Client.Version}}"}
2024-09-11T14:12:05.065Z	V6	Executing command	{"cmd": "/usr/bin/docker info --format '{{json .MemTotal}}'"}
2024-09-11T14:12:05.118Z	V4	Reading bundles manifest	{"url": "bin/local-bundle-release.yaml"}
2024-09-11T14:12:05.138Z	V4	Using CAPI provider versions	{"Core Cluster API": "v1.7.2+7b521fe", "Kubeadm Bootstrap": "v1.7.2+74bd9a3", "Kubeadm Control Plane": "v1.7.2+d29bc82", "External etcd Bootstrap": "v1.0.13+4d890d2", "External etcd Controller": "v1.0.22+a8279bb", "Cluster API Provider Nutanix": "v1.3.5+0f39da7"}
2024-09-11T14:12:05.370Z	V5	Retrier:	{"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-09-11T14:12:05.370Z	V2	Pulling docker image	{"image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:12:05.370Z	V6	Executing command	{"cmd": "/usr/bin/docker pull public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:12:05.953Z	V5	Retry execution successful	{"retries": 1, "duration": "582.779276ms"}
2024-09-11T14:12:05.953Z	V3	Initializing long running container	{"name": "eksa_1726063925370486292", "image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:12:05.953Z	V6	Executing command	{"cmd": "/usr/bin/docker run -d --name eksa_1726063925370486292 --network host -w /home/ubuntu/eksa-tests/gpus-feature -v /var/run/docker.sock:/var/run/docker.sock -v /home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu:/home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature --entrypoint sleep public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110 infinity"}
2024-09-11T14:12:06.119Z	V1	Using the eksa controller to create the management cluster
2024-09-11T14:12:06.119Z	V4	Task start	{"task_name": "setup-validate"}
2024-09-11T14:12:06.119Z	V0	Performing setup and validations
2024-09-11T14:12:06.119Z	V0	ValidateClusterSpec for Nutanix datacenter	{"NutanixDatacenter": "eksa-ntnx-gpu"}
2024-09-11T14:12:15.144Z	V0	✅ Nutanix Provider setup is valid
2024-09-11T14:12:15.144Z	V0	✅ Validate OS is compatible with registry mirror configuration
2024-09-11T14:12:15.144Z	V0	✅ Validate certificate for registry mirror
2024-09-11T14:12:15.144Z	V0	✅ Validate authentication for git provider
2024-09-11T14:12:15.144Z	V0	✅ Validate cluster's eksaVersion matches EKS-A version
2024-09-11T14:12:15.144Z	V4	Task finished	{"task_name": "setup-validate", "duration": "9.025697406s"}
2024-09-11T14:12:15.144Z	V4	----------------------------------
2024-09-11T14:12:15.144Z	V4	Task start	{"task_name": "bootstrap-cluster-init"}
2024-09-11T14:12:15.144Z	V0	Creating new bootstrap cluster
...
2024-09-11T14:23:44.708Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726063925370486292 kubectl get clusters.cluster.x-k8s.io -o json --kubeconfig eksa-ntnx-gpu/generated/eksa-ntnx-gpu.kind.kubeconfig --namespace eksa-system"}
2024-09-11T14:23:44.867Z	V5	Retry execution successful	{"retries": 1, "duration": "158.96818ms"}
2024-09-11T14:23:44.867Z	V5	Retrier:	{"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-09-11T14:23:44.867Z	V4	Deleting kind cluster	{"name": "eksa-ntnx-gpu-eks-a-cluster"}
2024-09-11T14:23:44.867Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726063925370486292 kind delete cluster --name eksa-ntnx-gpu-eks-a-cluster"}
2024-09-11T14:23:45.957Z	V5	Retry execution successful	{"retries": 1, "duration": "1.089860832s"}
2024-09-11T14:23:45.957Z	V0	🎉 Cluster created!
2024-09-11T14:23:45.957Z	V4	Task finished	{"task_name": "delete-kind-cluster", "duration": "1.534960627s"}
...

$ kubectl apply -f ./cuda-vectoradd.yaml
pod/cuda-vectoradd created

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Screenshot 2024-09-11 at 16 37 15
2024-09-11T14:51:49.291Z	V6	Executing command	{"cmd": "/usr/bin/docker version --format {{.Client.Version}}"}
2024-09-11T14:51:49.310Z	V6	Executing command	{"cmd": "/usr/bin/docker info --format '{{json .MemTotal}}'"}
2024-09-11T14:51:49.355Z	V4	Reading bundles manifest	{"url": "bin/local-bundle-release.yaml"}
2024-09-11T14:51:49.373Z	V4	Using CAPI provider versions	{"Core Cluster API": "v1.7.2+7b521fe", "Kubeadm Bootstrap": "v1.7.2+74bd9a3", "Kubeadm Control Plane": "v1.7.2+d29bc82", "External etcd Bootstrap": "v1.0.13+4d890d2", "External etcd Controller": "v1.0.22+a8279bb", "Cluster API Provider Nutanix": "v1.3.5+0f39da7"}
2024-09-11T14:51:49.601Z	V5	Retrier:	{"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-09-11T14:51:49.601Z	V2	Pulling docker image	{"image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:51:49.601Z	V6	Executing command	{"cmd": "/usr/bin/docker pull public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:51:50.292Z	V5	Retry execution successful	{"retries": 1, "duration": "691.054922ms"}
2024-09-11T14:51:50.292Z	V3	Initializing long running container	{"name": "eksa_1726066309601289865", "image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:51:50.292Z	V6	Executing command	{"cmd": "/usr/bin/docker run -d --name eksa_1726066309601289865 --network host -w /home/ubuntu/eksa-tests/gpus-feature -v /var/run/docker.sock:/var/run/docker.sock -v /home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu:/home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature --entrypoint sleep public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110 infinity"}
2024-09-11T14:51:50.468Z	V4	Task start	{"task_name": "setup-validate-create"}
2024-09-11T14:51:50.468Z	V0	ValidateClusterSpec for Nutanix datacenter	{"NutanixDatacenter": "eksa-wrk-ntnx-gpu"}
2024-09-11T14:51:59.497Z	V0	✅ Workload cluster's nutanix Provider setup is valid
2024-09-11T14:51:59.497Z	V0	✅ Validate OS is compatible with registry mirror configuration
2024-09-11T14:51:59.497Z	V0	✅ Validate certificate for registry mirror
2024-09-11T14:51:59.497Z	V0	✅ Validate authentication for git provider
2024-09-11T14:51:59.497Z	V0	✅ Validate cluster's eksaVersion matches EKS-A version
2024-09-11T14:51:59.497Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get clusters.cluster.x-k8s.io -o json --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig --namespace eksa-system"}
2024-09-11T14:51:59.637Z	V0	✅ Validate cluster name
2024-09-11T14:51:59.637Z	V0	✅ Validate gitops
2024-09-11T14:51:59.637Z	V5	skipping ValidateIdentityProviderNameIsUnique
2024-09-11T14:51:59.637Z	V0	✅ Validate identity providers' name
2024-09-11T14:51:59.637Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get customresourcedefinition clusters.cluster.x-k8s.io --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig"}
2024-09-11T14:51:59.763Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get customresourcedefinition clusters.anywhere.eks.amazonaws.com --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig"}
2024-09-11T14:51:59.908Z	V0	✅ Validate management cluster has eksa crds
2024-09-11T14:51:59.908Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig --field-selector=metadata.name=eksa-ntnx-gpu"}
2024-09-11T14:52:00.075Z	V0	✅ Validate management cluster name is valid
2024-09-11T14:52:00.075Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig --field-selector=metadata.name=eksa-ntnx-gpu"}
2024-09-11T14:52:00.214Z	V0	✅ Validate management cluster eksaVersion compatibility
2024-09-11T14:52:00.214Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get --ignore-not-found -o json --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig EKSARelease.v1alpha1.anywhere.eks.amazonaws.com --namespace eksa-system eksa-v0-0-0"}
2024-09-11T14:52:00.339Z	V0	✅ Validate eksa release components exist on management cluster
2024-09-11T14:52:00.339Z	V4	Task finished	{"task_name": "setup-validate-create", "duration": "9.87119141s"}
2024-09-11T14:52:00.339Z	V4	----------------------------------
2024-09-11T14:52:00.339Z	V4	Task start	{"task_name": "create-workload-cluster"}
2024-09-11T14:52:00.339Z	V0	Creating workload cluster
2024-09-11T14:52:00.339Z	V3	Applying cluster spec
...
2024-09-11T14:59:31.213Z	V4	----------------------------------
2024-09-11T14:59:31.213Z	V4	Task start	{"task_name": "write-cluster-config"}
2024-09-11T14:59:31.213Z	V0	Writing cluster config file
2024-09-11T14:59:31.216Z	V0	🎉 Cluster created!
2024-09-11T14:59:31.216Z	V4	Task finished	{"task_name": "write-cluster-config", "duration": "2.839671ms"}
2024-09-11T14:59:31.216Z	V4	----------------------------------
2024-09-11T14:59:31.216Z	V4	Tasks completed	{"duration": "7m40.748367154s"}
2024-09-11T14:59:31.216Z	V3	Cleaning up long running container	{"name": "eksa_1726066309601289865"}
2024-09-11T14:59:31.216Z	V6	Executing command	{"cmd": "/usr/bin/docker rm -f -v eksa_1726066309601289865"}

$ kubectl apply -f ./cuda-vectoradd.yaml
pod/cuda-vectoradd configured

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Screenshot 2024-09-11 at 17 03 52

Documentation added/planned (if applicable):
Planned docs: GPU support for Nutanix clusters

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

Hi @adiantum. Thanks for your PR.

I'm waiting for a aws member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eks-distro-bot eks-distro-bot added needs-ok-to-test size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 11, 2024
@abhinavmpandey08
Copy link
Member

/ok-to-test

@abhinavmpandey08
Copy link
Member

Can you add an example of what the cluster config will look like with the GPUs configured?

Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 92.73743% with 13 lines in your changes missing coverage. Please review.

Project coverage is 73.74%. Comparing base (cf665ed) to head (d4bcd55).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/providers/nutanix/validator.go 93.71% 7 Missing and 4 partials ⚠️
pkg/api/v1alpha1/nutanixmachineconfig.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8745      +/-   ##
==========================================
+ Coverage   73.66%   73.74%   +0.08%     
==========================================
  Files         578      578              
  Lines       36618    36788     +170     
==========================================
+ Hits        26973    27130     +157     
- Misses       7919     7928       +9     
- Partials     1726     1730       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adiantum
Copy link
Contributor Author

Can you add an example of what the cluster config will look like with the GPUs configured?

Sure, I have it in tests:
https://github.com/aws/eks-anywhere/pull/8745/files#diff-8837b0b2c467097c587ca47d1d535adfec94fc5c306a3526a9acfccd210bba9eR64-R68

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: NutanixMachineConfig
metadata:
  name: eksa-unit-test
  namespace: default
spec:
  vcpusPerSocket: 1
  vcpuSockets: 4
  memorySize: 8Gi
  ...
  gpus:
  - type:     deviceID
    deviceID: 8757
  - type:     name
    name:     "Ampere 40"
  systemDiskSize: 40Gi
  osFamily: "ubuntu"
...

@abhinavmpandey08
Copy link
Member

/lgtm
/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavmpandey08

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot merged commit 4d1408c into aws:main Oct 10, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants