Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ✨ Flexible Nova microversions #1567

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lentzi90
Copy link
Contributor

@lentzi90 lentzi90 commented May 29, 2023

This is related to the proposal in #1565 and is just meant as a discussion starter for now. If you have any comments, ideas or suggestions, please feel free to comment here or in #1565.

What this PR does / why we need it:

Use a list of supported versions instead of a single hard coded one. The list can be filtered based on specific feature requirements (e.g. usage of server tags). The version to use is then picked from the intersection of this list and what the server supports.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1448

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 29, 2023
@netlify
Copy link

netlify bot commented May 29, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 3ab8a84
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/671b608060318d000834abd1
😎 Deploy Preview https://deploy-preview-1567--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 29, 2023
for i := range recognizedMicroversions {
if recognizedMicroversions[i].ID == "2.1" {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the 2.1 handling here? I Think we have a default (2.1) then choose a best fit (2.53, 2.60 ..)
just don't understand the special handle here..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I'm not sure if this is how we should do it, but I can explain what I'm trying to do here.
Before we set the minimum version we didn't set any version at all. If I understand correctly that would mean the same thing as setting 2.1.

If no version is specified then the API will behave as if a version request of v2.1 was requested.

From https://docs.openstack.org/nova/latest/reference/api-microversion-history.html

So what I'm trying to do here is to allow fallback to the default if possible. If tags are used, then we require 2.53 minimum so then I remove 2.1 from the versions that it can pick from. Probably a better way to do it would be to convert the versions to floats and then remove anything that is < 2.53, but for now I'm just playing with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally however we do it, I'd probably go with ints rather than floats and drop the 2..

I wonder if this could be something like:

compute := s.getComputeClient()
if len(instanceSpec.Tags) > 0 {
  err := compute.RequireMicroVersion(NovaTagging)
  if err != nil {
    ...tagging not supported...
  }
}

We could have similar for neutron extensions.

Thoughts:

  • how can we inform the user up front, if their configuration requires more than their cloud supports?
  • If we can't do it up front, how do we make the best UX we can in the failure case?
  • How do we do the 'we need the multi-attach microversion' thing, because we can't detect that from k8s objects?

On the latter point: we could pre-fetch the specified volume-type, or infer the default volume type (don't know if this is possible) and check if multi-attach is set. Or we could react to a failure by trying again with a higher microversion, but the granularity of OpenStack error code likely makes this messy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we inform the user up front, if their configuration requires more than their cloud supports?

I think upfront is impossible here. We don't do any API calls in the webhook and for good reason, so I don't see how we would be able to do it unfortunately.

If we can't do it up front, how do we make the best UX we can in the failure case?

Probably the best we can do here is to bubble up the error to the CAPI level where it will be visible on the Machine or the Cluster.

How do we do the 'we need the multi-attach microversion' thing, because we can't detect that from k8s objects?
On the latter point: we could pre-fetch the specified volume-type, or infer the default volume type (don't know if this is possible) and check if multi-attach is set. Or we could react to a failure by trying again with a higher microversion, but the granularity of OpenStack error code likely makes this messy.

Now we are getting to the core of the issue! I think the reasonable thing to do is to keep a list of supported versions with higher priority for higher versions (as seen in this PR). That way the highest supported version would be picked (I hope), which should support as many features as possible. My thinking is that this would solve the issue without complicating the code with extra checks, errors and retries.

However, it builds on the assumption that picking the highest supported version is always best and I'm not completely sure if this is true. Could there be a case similar to the multi-attach volume situation, where OpenStack is configured in a way that a microversion is in theory supported, but in practice a lower version is needed because of specific-volume-type-or-similar set up by the admins? I want to think the answer is no and that it would be safe to always pick the highest available versions.

var NovaSupportedVersions = []*utils.Version{
{ID: "2.1", Priority: 10, Suffix: "/v2.1/"}, // Initial microversion release, same as specifying no version
{ID: "2.53", Priority: 20, Suffix: "/v2.53/"}, // Maximum in Pike
{ID: "2.60", Priority: 30, Suffix: "/v2.60/"}, // Maximum in Queens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we forced to 2.60 at commit 2df3778
so I think at some specific call we should use that version ,but I didn't see we have such negotiation
in the following code, can you point me where's the logic that choose 2.60 after the change?Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version is chosen here. It will check what versions are supported by the server and then pick a version from this list that is supported. I also made sure to put highest priority for 2.60 so this is what should be preferred if it is available.

CAPO supports multiattach volume types, which were added in microversion 2.60.
*/
const NovaMinimumMicroversion = "2.60"
var NovaSupportedVersions = []*utils.Version{
{ID: "2.1", Priority: 10, Suffix: "/v2.1/"}, // Initial microversion release, same as specifying no version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the bar higher here, it's hard to correlate but we might be implicitly depending on some features only available later. Starting from Newton 1 would make sense to me.

Footnotes

  1. https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#maximum-in-newton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! I picked 2.1 because that seems to be what is used if no version is specified, so this would have been the situation before CAPO specified a version. Basically this list then represents the history of CAPO:

  • No version (=2.1)
  • 2.53 (added as requirement for tags)
  • 2.60 (added as requirement for multi-attach volumes)

That said, I agree that it could be a good idea to set the bar higher and 2.38 sounds like a good starting place.

Copy link
Contributor

@jichenjc jichenjc May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newton is old enough , we should be good to set the bar to Newton
but I'd like to make it configuration variable so someone might change it if they really want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'd like to make it configuration variable so someone might change it if they really want

This was also my original suggestion. I wanted to make CAPO/Gophercloud pick up the microversion from clouds.yaml. This way the user could set the version in there if needed and it would work the same way as for the OpenStack CLI. However, there were objections since it would be very easy to break things by setting versions that CAPO cannot work with. So now I'm trying to propose a way where CAPO still has control but it is a bit more flexible.

@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from c38aa7f to cd74f5a Compare June 14, 2023 09:35
@lentzi90
Copy link
Contributor Author

Updated! I added 2 more commits: one to "undo" the previous changes so it is easier to compare different approaches, and one with new changes. What I have done is basically this:

  1. Use integers instead of strings and floats when comparing versions
  2. Use a separate RequireMicroversion method instead of passing the requirements to getComputeClient. I'm a bit torn on this. It is nice to not have to pass the required version around, but it also makes it easier to forget to set any requirements. I still added a version negotiation without specific requirements to the NewComputeClient function to avoid this "no version set" issue.
  3. Set a higher lowest version. As suggested I put the bar at 2.38 now instead of 2.1. I still think there is a valid reason to go with 2.1 (since that is the default when nothing is specified), but I also think it is so old that we should not have to support it.

pkg/clients/compute.go Outdated Show resolved Hide resolved
if err != nil {
return nil, fmt.Errorf("failed to negotiation compute service version: %w", err)
}
compute.Microversion = version.ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method needs to return a deterministic microversion. It can't be a negotiated version: it needs to be success or failure. I want my code to run with microversion X or nothing (ok, we're going to have to support 2 for volume creation/attachment, but... details).

How about:

  • NewComputeClient always returns microversion 38.
  • A new method WithMicroversion(microversion int) (ComputeClient, error) returns a copy of itself with the target microversion if it is supported?

So most code is unchanged, and the only effect is that it's now running against an older microversion.

Code which requires a newer microversion does something like:

{ // Lexical scope limits new compute client
  compute, err := compute.WithMicroversion(TAGS)
  if err ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure I understand. If we have deterministic microversion and no negotiation, then how would this return failure? Then we end up with the same as we already have where CAPO says the microversion is X and then we will see when making the first API call if that works or not. With negotiation we get to know this already here.
I can accept negotiating with a single microversion (so it is that microversion or nothing), but then I still think it is better to do the negotiation than just setting it without checking if it is supported.

Copy link
Contributor

@mdbooth mdbooth Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I anticipate that we'd fetch the maximum supported microversion from Nova when initially creating the service client and store it in the service client. We would then use this every time we needed to check that the required microversion was supported. i.e.

On instantiating the service client we check that microversion 38 is supported because that's the minimum. This is the microversion that all code without a higher requirement will run against.

When adding tags to a server before creation we'd do something like:

  createClient, err := compute.WithMicroversion(TAGS)
  if err != nil {
    createClient = compute
  } else {
    ... add tags to server create opts ...
  }

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can see a way forward with this!

@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from cd74f5a to ab8d3bc Compare September 29, 2023 09:49
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 29, 2023
@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from ab8d3bc to 178d4d0 Compare September 29, 2023 09:54
@lentzi90
Copy link
Contributor Author

/test pull-cluster-api-provider-openstack-e2e-test

1 similar comment
@lentzi90
Copy link
Contributor Author

/test pull-cluster-api-provider-openstack-e2e-test

@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from 178d4d0 to 746cf67 Compare September 29, 2023 12:22
@lentzi90
Copy link
Contributor Author

/test pull-cluster-api-provider-openstack-e2e-test

@lentzi90
Copy link
Contributor Author

lentzi90 commented Oct 4, 2023

/test pull-cluster-api-provider-openstack-e2e-test
/test pull-cluster-api-provider-openstack-test
/test pull-cluster-api-provider-openstack-build

@lentzi90
Copy link
Contributor Author

lentzi90 commented Oct 4, 2023

It probably makes sense to move some of this to gophercloud. See gophercloud/gophercloud#2791

@lentzi90
Copy link
Contributor Author

lentzi90 commented Oct 4, 2023

/test pull-cluster-api-provider-openstack-build
/test pull-cluster-api-provider-openstack-test

@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from bc97d3e to f931f03 Compare October 5, 2023 06:08
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from 21df213 to b584752 Compare January 30, 2024 09:28
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2024
@lentzi90
Copy link
Contributor Author

/test pull-cluster-api-provider-openstack-build
/test pull-cluster-api-provider-openstack-test
/test pull-cluster-api-provider-openstack-e2e-test

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 29, 2024
@lentzi90
Copy link
Contributor Author

lentzi90 commented Apr 2, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2024
@lentzi90
Copy link
Contributor Author

lentzi90 commented Jul 7, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 5, 2024
@lentzi90
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2024
@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from b584752 to 05bbd05 Compare October 25, 2024 06:53
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lentzi90. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2024
@lentzi90
Copy link
Contributor Author

/test pull-cluster-api-provider-openstack-build
/test pull-cluster-api-provider-openstack-test
/test pull-cluster-api-provider-openstack-e2e-test

This is an attempt to set the microversion based on what features are
needed. For example, if tags are defined for the server, then we need a
microversion that supports tags. If no special features are used/needed
then we use the fixed minimum version
@lentzi90 lentzi90 force-pushed the lentzi90/flexible-nova-version branch from 05bbd05 to 3ab8a84 Compare October 25, 2024 09:10
@lentzi90
Copy link
Contributor Author

/test pull-cluster-api-provider-openstack-build
/test pull-cluster-api-provider-openstack-test

@lentzi90
Copy link
Contributor Author

/test pull-cluster-api-provider-openstack-e2e-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support configurable microversions for optional features like multi-attachable volumes
7 participants