In-Place Node OS Updates #120

vlerenc · 2023-08-08T05:04:10Z

Story

As user I want to update my node operating system in-place with only a restart, but no rolling update required

Motivation

We were approached by parties that, for performance reasons, use locally attached disks with data. While the data is replicated, rolling a node and re-building/syncing the local data may take hours. Doing that during a cluster rolling update may therefore take many days, which is difficult for them.

This feature is also useful in cases where the used machine type is scarce (very special machine types) and it isn't easy / guaranteed to get new machines (no reserved instances).

GardenLinux is currently developing the capability to do that. Reminiscent of CoreOS' FastPatch updates, it will have 2 partitions, run on one, prepare the other one, reboot into the other one. Persistent data is stored on yet another and preserved. This may not work with every update, but with many. The GardenLinux developers expect full rolling only to happen later every 1-2 years, but all other updates could be handled in-place once they and we are done.

This ticket here is about Gardener's part, because we do not support in-place OS updates as of now and do need to think it through and do it then, if feasible. Just for historic reference, please see here one of our very first Gardener tickets when we implemented full automated cluster updates (no. 14 for K8s v1.5 -> v1.6 - time flies) and decided at first against FastPatch (gardener/gardener#14).

Labels

/area os
/kind enhancement
/os garden-linux
/topology shoot

Acceptance Criteria

Node OS updates (probably of something like patch versions to also fit our Kubernetes versioning concept) is done without rolling the nodes
Ideally, the "dead time" where the kubelet stops posting until it reposts (99 percentile) is shorter than the default machineHealthTimeout of 10m (even better, shorter than the default KCM nodeMonitorGracePeriod of 40s), but that can tweaked (including pod tolerations) by the cluster admins, if not sufficient (still it would be great to achieve a.) if not b.) since "it was said", the rebooting shall take place in seconds)
...

Enhancement/Implementation Proposal (optional)

This will require a GEP (https://github.com/gardener/gardener/tree/master/docs/proposals) as conceptional and core changes will be necessary and everything else up until the update of the versioning guide/docs. The question is also what the main actor is, i.e. will we handle this use case like we handle Kubernetes patch updates, i.e. carried out by the maintenance controller? That's probably preferred for multiple reasons (means to opt out, shoot spec lists exact version, time scatter/jiggle resp. coordinated update, etc.) over the OS doing it itself.

Further Considerations

Rolling updates, as side-effects, help with some security obligations (regular fresh start), help building robust solutions (avoiding pet VMs), and the rolling update acts as some sort of safety net: Only when the new node is registered and ready, the old node will be drained and subsequently terminated. In-place updates obviously do not offer this.
Because this is not generally desirable (only in certain cases, e.g. with nodes with local disks or of scarce machine types), it would be best to make the update policy (rolling or in-place) configurable per worker pool, which would require more changes. The maintenance section as of today is for the entire cluster.

Resources (optional)

Contacts: @MalteJ, @gehoern, @MrBatschner, @danielfoehrKn

Definition of Done

Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
Unit tests are provided: Have you written automated unit tests?
Integration tests are provided: Have you written automated integration tests?
Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
Operations guide: Have you updated the operations guide about ops-relevant changes?
User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

The text was updated successfully, but these errors were encountered:

vlerenc · 2023-08-10T15:14:16Z

@unmarshall had two very valuable comments I will incorporate above:

Rolling updates provides some sort of safety net. Only when the new node is running will the old node be drained and subsequently deleted. In-place updates obviously does not offer this level of availability.

If we could have this option at machine-deployment level then that would be nice. Then for more expensive machines or machines which have lesser quota (either due to specialised extensions or high demand) can then be marked for in-place updates. For other machine-deployments it could be a std rolling-update.

gardener-robot added area/os Operation system related kind/enhancement Enhancement, improvement, extension os/garden-linux Related to Garden Linux OS topology/shoot Affects Shoot clusters labels Aug 8, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 18, 2024

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-Place Node OS Updates #120

In-Place Node OS Updates #120

vlerenc commented Aug 8, 2023 •

edited

Loading

vlerenc commented Aug 10, 2023 •

edited

Loading

In-Place Node OS Updates #120

In-Place Node OS Updates #120

Comments

vlerenc commented Aug 8, 2023 • edited Loading

Story

Motivation

Labels

Acceptance Criteria

Enhancement/Implementation Proposal (optional)

Further Considerations

Resources (optional)

Definition of Done

vlerenc commented Aug 10, 2023 • edited Loading

vlerenc commented Aug 8, 2023 •

edited

Loading

vlerenc commented Aug 10, 2023 •

edited

Loading