Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graduate MachinePools from experimental to stable API #9005

Open
mboersma opened this issue Jul 17, 2023 · 19 comments
Open

Graduate MachinePools from experimental to stable API #9005

mboersma opened this issue Jul 17, 2023 · 19 comments
Assignees
Labels
area/machinepool Issues or PRs related to machinepools kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mboersma
Copy link
Contributor

mboersma commented Jul 17, 2023

What would you like to be added (User Story)?

As a production user of CAPI, I'm not comfortable working with "experimental" parts of the product, but I would like to take advantage of the native VM-grouping construct at my cloud provider.

Detailed Description

MachinePools have been part of the experimental API for years, but their API has been overall stable (recent backward-compatible changes for MachinePool Machines notwithstanding). Moving them out of exp/ will declare that MachinePools are stable, supported, and safe to use.

Anything else you would like to add?

The graduation criteria in the MachinePools proposal remains TBD. Informally, when the question "when can MPs graduate?" has been asked, the answer has generally been "after MachinePool Machines land."

MachinePool Machines have begun to land and should soon be largely complete. This seems like the time to consider graduating MachinePools, although there may be more context or specific // TODO items I'm not aware of. I'm hoping to identify those in this issue.

Label(s) to be applied

/kind feature

/cc @dtzar @Jont828

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 17, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 17, 2023
@jayesh-srivastava
Copy link
Member

@mboersma I would like to take this one up. Can we have discussions over subtaks/TODOs which can help me moving forward.

@fabriziopandini
Copy link
Member

@mboersma
AFAIK as far as I remember we have still some to do left from @Jont828 PR in order to complete the implementation of the proposal and get feature parity with MD (this is at least what I remember from some thread on the PR and the discussion we had around it and the - divide and conquer from the original PR- )
e.g. label propagation, CC support, but also machines that can't be operated, probably more

It will be great to use this issue to track all this work

@vincepri
Copy link
Member

It'd be also good to have a featureset comparison in the book to guide users when they should be using MachineDeployment or MachinePool

@Jont828
Copy link
Contributor

Jont828 commented Jul 24, 2023

Sounds good. #8842 is the last PR I currently have open related to MachinePool Machines. I think it makes sense to put a list of follow up tasks here once that gets wrapped up.

@vincepri We have a section in the CAPI book about differences in features but I suppose we can expand it.

@CecileRobertMichon
Copy link
Contributor

#8858 should probably get resolved as well before we do this

@AndiDog
Copy link
Contributor

AndiDog commented Jan 23, 2024

Apart from some notes in #8858, here are some comments about the API.

  • spec.replicas feels weird. Its meaning depends on what the infra provider (e.g. CAPA) and cloud provider (e.g. AWS) offer. For AWS ASGs, it's min/max/desired, where min/max are in AWSMachinePool.spec.{minSize,maxSize} and desired is in MachinePool.spec.replicas. What if a cloud provider doesn't have a "desired number of replicas" setting at all? I rather see these all in the <Infra>MachinePool type.
  • spec.minReadySeconds is documented as "Minimum number of seconds for which a newly created machine instances should be ready." – As a user, I don't understand what the effect of this field is. Ready for that interval, and then? What does "ready" mean? Which component reports it as ready? If we don't improve the description, the value could be mistaken for the infra provider specific one (e.g. AWS ASG "instance warmup" in seconds). And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?
  • https://cluster-api.sigs.k8s.io/tasks/experimental-features/machine-pools should document a "contract" – which fields should infra providers use and how? Which are required/optional to implement?

Other than that, I think stabilizing the API is a good idea – even a stable API can be evolved, for instance by adding new fields if we really run into major missing features.

@sbueringer
Copy link
Member

sbueringer commented Jan 24, 2024

And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?

This should be fixed via: #9837

Sounds like there's potential to improve the godoc for MD & MachinePool, but apart from that it seems to have a clear meaning

@dtzar
Copy link
Contributor

dtzar commented Feb 7, 2024

Per the CAPI community call today as I understand it, the next steps are:

  1. Immediately have someone PR to have Machinepool get deployed by default (without requiring feature flag being enabled). Push it also from alpha to beta status at this time.
  2. Gather additional feedback after this change and consider setting a timeline with updated note(s) in the spec

@mboersma
Copy link
Contributor Author

/assign

@fabriziopandini
Copy link
Member

/priority important-long term

@mboersma is there some work left for this issue or can we close it

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The label(s) priority/important-long, priority/term cannot be applied, because the repository doesn't have them.

In response to this:

/priority important-long term

@mboersma is there some work left for this issue or can we close it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024
@mboersma
Copy link
Contributor Author

@fabriziopandini MachinePools are "beta" now, but I think we need to move the code out of exp somehow. I haven't thought about how to do that yet, but it's on my plate to do.

@enxebre
Copy link
Member

enxebre commented Jun 19, 2024

Thanks @mboersma is there any report that can be shared of the unit and e2e test coverage for MachinePools?

@sbueringer
Copy link
Member

sbueringer commented Jun 20, 2024

I'll try to find some time next week to write a summary of the current state of MPs regarding open issues and features we were planning to implement at some point but didn't until today.

Then we can discuss how they relate to the remaining steps to get MPs out of experimental. If I understand correctly, the remaining steps are moving the implementation out of "exp" and eventually dropping the feature gate.

In any case, from my side no objection against moving the code out of "exp". I see no reason why it should stay there. In my opinion for our users it should not be relevant in which package we implement features. (I mean which end user even knows that MachinePools are below "exp" in our repo, and why should they have to care?)

@sbueringer sbueringer added the area/machinepool Issues or PRs related to machinepools label Jul 10, 2024
@sbueringer
Copy link
Member

sbueringer commented Aug 1, 2024

Sorry for the delay. Here's my summary and my first try at a categorization.

Flaky tests with MachinePools:

Fundamental issues / questions about the MachinePool API / contract:

MachinePool Machines: (not clear to me, should be double checked by folks more familiar with MP Machines)

  • Metadata propagation like for MD => MS => Machine
  • from the proposal: "Allow both Cluster API-driven and externally driven interactions with individual Machines in a MachinePool"
    • MachineHealthChecks for MP Machines
    • Can MachinePool Machines today be deleted via delete machine? e.g. "Cluster Autoscaler needs to delete individual Machines" (based on the proposal only some implementations support that, how would a user know which?)
  • It's not entirely clear to me how much of the proposal is implemented
  • Foreground deletion for Machines (see 🌱 Foreground deletion for MachineDeployments and MachineSets #11174) (not sure how this works with MachinePools today)
  • In general either feature parity between MD and MP Machines or a clear documentation about the differences would be good

Work related to the status improvements for v1beta2:

(Potential) Bugs:

Misc / TBD:

Minor improvements:

As far as I can tell regarding graduation - apart from the issues listed here - the following is missing:

  • Move code out of exp
  • Remove the feature gate

We can/should discuss which parts of the "graduation" we want to block until which issues are resolved. I"m personally fine with moving the code out of exp, but we should then probably treat removing the feature gate as the "graduation".

In general, it would be great to see some folks signing up to be maintainers of the MachinePool feature/code. To be honest, at the moment, it's mostly up to folks which are neither very familiar with MachinePools nor are they using the feature. This is not a great state if we're planning to maintain this feature going forward, which should be the minimum bar for graduation.

cc @fabriziopandini @vincepri @enxebre @killianmuldoon @chrischdi @mboersma

@fabriziopandini
Copy link
Member

Great work @sbueringer! 🥳
Having a list of all the open issues + all those valuable notes helps me to understand better where we are, and I guess it helps others too.

Up to @mboersma if we want to move this to an Umbrella issue for graduation.

If I can give my two cents, I consider ownership + open issues the key points to be addressed before graduation; I also agree that moving the code out of exp is sort of orthogonal to graduation given that we are already shipping MachinePool APIs together with everything else.

@fabriziopandini
Copy link
Member

Adding on top of the note above, If we cannot find an answer to the lack of maintainers for this feature, we should probably start thinking about a split of MP E2E tests from other tests, so we can have a cleaner release signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinepool Issues or PRs related to machinepools kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.