Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-Place Node OS Updates #120

Open
9 tasks
vlerenc opened this issue Aug 8, 2023 · 1 comment
Open
9 tasks

In-Place Node OS Updates #120

vlerenc opened this issue Aug 8, 2023 · 1 comment
Labels
area/os Operation system related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) os/garden-linux Related to Garden Linux OS topology/shoot Affects Shoot clusters

Comments

@vlerenc
Copy link
Member

vlerenc commented Aug 8, 2023

Story

  • As user I want to update my node operating system in-place with only a restart, but no rolling update required

Motivation

We were approached by parties that, for performance reasons, use locally attached disks with data. While the data is replicated, rolling a node and re-building/syncing the local data may take hours. Doing that during a cluster rolling update may therefore take many days, which is difficult for them.

This feature is also useful in cases where the used machine type is scarce (very special machine types) and it isn't easy / guaranteed to get new machines (no reserved instances).

GardenLinux is currently developing the capability to do that. Reminiscent of CoreOS' FastPatch updates, it will have 2 partitions, run on one, prepare the other one, reboot into the other one. Persistent data is stored on yet another and preserved. This may not work with every update, but with many. The GardenLinux developers expect full rolling only to happen later every 1-2 years, but all other updates could be handled in-place once they and we are done.

This ticket here is about Gardener's part, because we do not support in-place OS updates as of now and do need to think it through and do it then, if feasible. Just for historic reference, please see here one of our very first Gardener tickets when we implemented full automated cluster updates (no. 14 for K8s v1.5 -> v1.6 - time flies) and decided at first against FastPatch (gardener/gardener#14).

Labels

/area os
/kind enhancement
/os garden-linux
/topology shoot

Acceptance Criteria

  • Node OS updates (probably of something like patch versions to also fit our Kubernetes versioning concept) is done without rolling the nodes
  • Ideally, the "dead time" where the kubelet stops posting until it reposts (99 percentile) is shorter than the default machineHealthTimeout of 10m (even better, shorter than the default KCM nodeMonitorGracePeriod of 40s), but that can tweaked (including pod tolerations) by the cluster admins, if not sufficient (still it would be great to achieve a.) if not b.) since "it was said", the rebooting shall take place in seconds)
  • ...

Enhancement/Implementation Proposal (optional)

This will require a GEP (https://github.com/gardener/gardener/tree/master/docs/proposals) as conceptional and core changes will be necessary and everything else up until the update of the versioning guide/docs. The question is also what the main actor is, i.e. will we handle this use case like we handle Kubernetes patch updates, i.e. carried out by the maintenance controller? That's probably preferred for multiple reasons (means to opt out, shoot spec lists exact version, time scatter/jiggle resp. coordinated update, etc.) over the OS doing it itself.

Further Considerations

  • Rolling updates, as side-effects, help with some security obligations (regular fresh start), help building robust solutions (avoiding pet VMs), and the rolling update acts as some sort of safety net: Only when the new node is registered and ready, the old node will be drained and subsequently terminated. In-place updates obviously do not offer this.
  • Because this is not generally desirable (only in certain cases, e.g. with nodes with local disks or of scarce machine types), it would be best to make the update policy (rolling or in-place) configurable per worker pool, which would require more changes. The maintenance section as of today is for the entire cluster.

Resources (optional)

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?
@gardener-robot gardener-robot added area/os Operation system related kind/enhancement Enhancement, improvement, extension os/garden-linux Related to Garden Linux OS topology/shoot Affects Shoot clusters labels Aug 8, 2023
@vlerenc
Copy link
Member Author

vlerenc commented Aug 10, 2023

@unmarshall had two very valuable comments I will incorporate above:

Rolling updates provides some sort of safety net. Only when the new node is running will the old node be drained and subsequently deleted. In-place updates obviously does not offer this level of availability.

If we could have this option at machine-deployment level then that would be nice. Then for more expensive machines or machines which have lesser quota (either due to specialised extensions or high demand) can then be marked for in-place updates. For other machine-deployments it could be a std rolling-update.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 18, 2024
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/os Operation system related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) os/garden-linux Related to Garden Linux OS topology/shoot Affects Shoot clusters
Projects
None yet
Development

No branches or pull requests

2 participants