-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ vGPU implementation #2272
✨ vGPU implementation #2272
Conversation
Hi @puneetkatyal. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
9dc76f8
to
5fabdad
Compare
/retest |
- Builds on the changes in kubernetes-sigs#1579 Co-authored-by: Geetika Batra <[email protected]> Signed-off-by: Puneet Katyal <[email protected]>
5fabdad
to
d3615dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some general questions for me:
- I think we should prefer only keeping
reconcilePCIDevices
as the way for the normal PCI Passthrough. Especially because this also handles some condition handling. - Question: instead of adding vgpu devices directly at clone: should this maybe also be done inside
reconcilePCIDevices
? (because they are just a bit more special PCI devices?) - API conversion
Note: I'm not able to verify that all of this works because I currently don't have a environment including vgpu's available.
@@ -1685,6 +1685,7 @@ func autoConvert_v1beta1_VirtualMachineCloneSpec_To_v1alpha3_VirtualMachineClone | |||
out.CustomVMXKeys = *(*map[string]string)(unsafe.Pointer(&in.CustomVMXKeys)) | |||
// WARNING: in.TagIDs requires manual conversion: does not exist in peer-type | |||
// WARNING: in.PciDevices requires manual conversion: does not exist in peer-type | |||
// WARNING: in.VGPUDevices requires manual conversion: does not exist in peer-type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to implement conversion for this.
template: '${VSPHERE_TEMPLATE}' | ||
thumbprint: '${VSPHERE_TLS_THUMBPRINT}' | ||
vgpuDevices: | ||
- profileName: "grid_v100d-4c" <============ value from above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it a valid yaml
- profileName: "grid_v100d-4c" <============ value from above |
- profileName: "grid_v100d-4c" # value from above!
/Applications/Xcode.app/Contents/Developer/usr/bin/make generate-flavors FLAVOR_DIR=/Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0 | ||
go run ./packaging/flavorgen --output-dir /Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/Applications/Xcode.app/Contents/Developer/usr/bin/make generate-flavors FLAVOR_DIR=/Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0 | |
go run ./packaging/flavorgen --output-dir /Users/pkatyal/.cluster-api/overrides/infrastructure-vsphere/v0.0.0 |
Let's omit the stdout output
template: '${VSPHERE_TEMPLATE}' | ||
thumbprint: '${VSPHERE_TLS_THUMBPRINT}' | ||
vgpuDevices: | ||
- profileName: "grid_v100d-4c" <============ value from above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth having this as envsubst parameter in some way?
WDYT: would it be worth having a separate flavor for vgpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same NVIDIA GPU supports multiple vGPU profiles, and the matrix expands when you add more GPU varieties to the mix. For example, in my testing, I use different profiles for different worker nodes for the same workload cluster. I don't think it's useful to have this as an envsubst parameter.
@@ -453,3 +461,55 @@ func getNetworkSpecs(ctx *context.VMContext, devices object.VirtualDeviceList) ( | |||
|
|||
return deviceSpecs, nil | |||
} | |||
|
|||
func createPCIPassThroughDevice(deviceKey int32, backingInfo types.BaseVirtualDeviceBackingInfo) types.BaseVirtualDevice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this func to the very bottom of the file, because its a helper for the ones below.
Device: dynamicDirectPathDevice, | ||
Operation: types.VirtualDeviceConfigSpecOperationAdd, | ||
}) | ||
ctx.Logger.V(4).Info("created vGPU device", "vgpu-profile", vGPUDevice.ProfileName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a smilar message for the gpu case?
return device | ||
} | ||
|
||
func getGpuSpecs(ctx *context.VMContext) []types.BaseVirtualDeviceConfigSpec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we try to not use the *context.VMContext
parameter here and use more specific parameters instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: this function seems to be a duplicate of https://github.com/chrischdi/cluster-api-provider-vsphere/blob/234e7751f7bb09bb438eed6b4feabbc274cae842/pkg/services/govmomi/pci/device.go#L59 ?
@@ -151,8 +151,16 @@ func Clone(ctx *context.VMContext, bootstrapData []byte, format bootstrapv1.Form | |||
|
|||
deviceSpecs = append(deviceSpecs, networkSpecs...) | |||
|
|||
if err != nil { | |||
return errors.Wrapf(err, "error getting network specs for %q", ctx) | |||
if len(ctx.VSphereVM.Spec.VirtualMachineCloneSpec.PciDevices) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a second codepath which adds PciDevices to a VM. The already existing code is at https://github.com/chrischdi/cluster-api-provider-vsphere/blob/234e7751f7bb09bb438eed6b4feabbc274cae842/pkg/services/govmomi/service.go#L478 .
I'd prefer only having one place to do this so either this here should get removed or the other place. And we have to take care to not break existing use cases.
Co-authored-by: Christian Schlotter <[email protected]>
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/close Given no changes in the last ~ 5 months. Feel free to reopen if you want to continue working on it. |
@sbueringer: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Support adding vGPUs to VMs
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #1972