✨ vGPU implementation #2272

puneetkatyal · 2023-08-22T22:51:04Z

Builds on the changes in [WIP] VGPU implementation #1579

What this PR does / why we need it:
Support adding vGPUs to VMs

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1972

k8s-ci-robot · 2023-08-22T22:51:13Z

Hi @puneetkatyal. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-08-22T22:51:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer · 2023-08-23T04:41:54Z

/ok-to-test

puneetkatyal · 2023-08-23T20:07:46Z

/retest

- Builds on the changes in kubernetes-sigs#1579 Co-authored-by: Geetika Batra <[email protected]> Signed-off-by: Puneet Katyal <[email protected]>

chrischdi

There are some general questions for me:

I think we should prefer only keeping reconcilePCIDevices as the way for the normal PCI Passthrough. Especially because this also handles some condition handling.
Question: instead of adding vgpu devices directly at clone: should this maybe also be done inside reconcilePCIDevices? (because they are just a bit more special PCI devices?)
API conversion

Note: I'm not able to verify that all of this works because I currently don't have a environment including vgpu's available.

chrischdi · 2023-08-28T07:17:08Z

apis/v1alpha3/zz_generated.conversion.go

@@ -1685,6 +1685,7 @@ func autoConvert_v1beta1_VirtualMachineCloneSpec_To_v1alpha3_VirtualMachineClone
 	out.CustomVMXKeys = *(*map[string]string)(unsafe.Pointer(&in.CustomVMXKeys))
 	// WARNING: in.TagIDs requires manual conversion: does not exist in peer-type
 	// WARNING: in.PciDevices requires manual conversion: does not exist in peer-type
+	// WARNING: in.VGPUDevices requires manual conversion: does not exist in peer-type


I think we have to implement conversion for this.

chrischdi · 2023-08-28T09:59:22Z

docs/gpu-vgpu.md

+      template: '${VSPHERE_TEMPLATE}'
+      thumbprint: '${VSPHERE_TLS_THUMBPRINT}'
+      vgpuDevices:
+        - profileName: "grid_v100d-4c"    <============ value from above


To make it a valid yaml

Suggested change

- profileName: "grid_v100d-4c" <============ value from above

- profileName: "grid_v100d-4c" # value from above!

chrischdi · 2023-08-28T10:01:56Z

docs/gpu-vgpu.md

+/Applications/Xcode.app/Contents/Developer/usr/bin/make generate-flavors FLAVOR_DIR=/Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0
+go run ./packaging/flavorgen --output-dir /Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0


Suggested change

/Applications/Xcode.app/Contents/Developer/usr/bin/make generate-flavors FLAVOR_DIR=/Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0

go run ./packaging/flavorgen --output-dir /Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0

Let's omit the stdout output

chrischdi · 2023-08-28T10:04:42Z

docs/gpu-vgpu.md

+      template: '${VSPHERE_TEMPLATE}'
+      thumbprint: '${VSPHERE_TLS_THUMBPRINT}'
+      vgpuDevices:
+        - profileName: "grid_v100d-4c"    <============ value from above


Would it be worth having this as envsubst parameter in some way?

WDYT: would it be worth having a separate flavor for vgpu?

The same NVIDIA GPU supports multiple vGPU profiles, and the matrix expands when you add more GPU varieties to the mix. For example, in my testing, I use different profiles for different worker nodes for the same workload cluster. I don't think it's useful to have this as an envsubst parameter.

chrischdi · 2023-08-28T10:07:27Z