Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting an Inplace Update Rollout Strategy for upgrading Workload Clusters #9489

Open
dharmjit opened this issue Sep 25, 2023 · 8 comments · May be fixed by #11029
Open

Supporting an Inplace Update Rollout Strategy for upgrading Workload Clusters #9489

dharmjit opened this issue Sep 25, 2023 · 8 comments · May be fixed by #11029
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@dharmjit
Copy link

dharmjit commented Sep 25, 2023

User Story

As a Platform Operator managing the Kubernetes clusters in resource constraint environments(Non-HA) OR/AND Specialized customized environments, I want to upgrade Kubernetes clusters without rolling out new nodes.

Detailed Description

For use cases such as Single-Node Clusters with no spare capacity or even Multi-Node Clusters with VM/OS customizations for high-performance/low-latency workloads or dependency on local persistent storage, Upgrading a Workload Cluster via RollingUpdate rollout strategy could either be not feasible or a costly operation requiring to re-apply these customizations on newer nodes and hence more downtime.

CAPI uses/promotes the immutable Infrastructure principles for a range of advantages. With the emergence of Image-based OS upgrade techniques such as A/B partition OS upgrades or OSTree Filesystem OS upgrades which provide immutable OS characteristics, We could rethink CAPI providing another rollout strategy to update the K8s/OS for the workload clusters.

At a high level, below could be some of the requirements

  • To introduce a new rollout strategy to allow upgrading workload clusters without rolling out new nodes.
  • To support this new rollout strategy for both clusterclass as well as non-clusterclass clusters.
  • To support this new rollout strategy for both the control plane as well as worker nodes of a workload cluster.
  • To ensure this new rollout upgrade strategy is agnostic of Image-based OS upgrades underlying implementation (OSTree upgrades, AB partition upgrades, etc.)

Note: For highly available clusters in resource constraint environments, CAPI provides strategies like ScaleIn(KCP) and OnDelete(MD) for upgrades without requiring additional infra capacity.

Anything else you would like to add?

There are already some CAPI slack discussions/GH issues discussing in-place upgrade needs and probably folks already have some ideas or more use cases around this. It would be great to hear/discuss those in the comments and probably it would be beneficial to create a Working Group around this feature.

Some GH issues around In-place upgrades/mutability in CAPI and tagging folks part of these discussions

cc: @furkatgofurov7 @pacoxu @fabriziopandini @sbueringer @shivi28

Please feel free to add more folks interested in this feature.

/kind feature
/area upgrades

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 25, 2023
@fabriziopandini
Copy link
Member

/triage accepted
I personally think this is a great discussion to have, IMO the project is now at a stage where we have all the required tools/conditions to approach this topic with confidence.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 25, 2023
@g-gaston
Copy link
Contributor

/assign
The working group will collaborate on a design for this

@fabriziopandini
Copy link
Member

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 11, 2024
@nickperry
Copy link

nickperry commented Apr 17, 2024

As an operator of CAPI clusters at scale in regulated physical locations with bandwidth and compute hardware constraints, I would very much welcome this capability.

@ahrtr
Copy link
Member

ahrtr commented May 13, 2024

Thanks @fabriziopandini for pointing me to this issue (I was going to raise the same issue).

One of the problems of creating & removing nodes one by one is that you have to sync the etcd's data from the leader each time when you upgrade or update the cluster. It's definitely unnecessary, It would be great if we can avoid it by in-place rolling upgrading & updating.

@guettli
Copy link
Contributor

guettli commented Jul 11, 2024

@ahrtr please elaborate why it is a problem for you that etcd data needs to be synced again. I understand that it is network traffic which could be avoided, but please explain the pain of the current "delete and recreate". Etcd has the learner mode now, so that the etcd node will only join after it has synced.

@ahrtr
Copy link
Member

ahrtr commented Jul 11, 2024

  • It's a waste of network bandwidth, the leader needs to send snapshot to each of the followers when you delete & recreate each of them. Obviously it's unnecessary from etcd perspective.
  • It creates a window in which it reduces the failure tolerance. Assuming a 3 member cluster, it can tolerate one member failure. When you delete & recreate one member, before its data is in sync with the leader and promoted to a voting member, the cluster can tolerate 0 member failure.

@guettli
Copy link
Contributor

guettli commented Jul 11, 2024

  • It's a waste of network bandwidth, the leader needs to send snapshot to each of the followers when you delete & recreate each of them. Obviously it's unnecessary from etcd perspective.

Do you have numbers? How much data needs to sync? (in my current context I have a lot of smaller clusters, so it does not matter).

  • It creates a window in which it reduces the failure tolerance. Assuming a 3 member cluster, it can tolerate one member failure. When you delete & recreate one member, before its data is in sync with the leader and promoted to a voting member, the cluster can tolerate 0 member failure.

Wait a second. I thought Cluster-API does a scale-out during an upgrade. If you have 3 CP nodes, then a 4th node gets added, then an old node gets deleted. But maybe I am missing something.

@g-gaston g-gaston linked a pull request Aug 7, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants