Consistently propagate down timeouts from MD => MS => Machines #10753

sbueringer · 2024-06-13T09:40:32Z

Goal

Goal of this issue is to consistently propagate down timeouts (NodeDrainTimeout, NodeDeletionTimeout, ...) from MDs to MSs to Machines. This is desirable so that users can still change timeouts even if a Machine is e.g. stuck in draining.

We had a first PR which ensures a MachineSet propagates down the timeouts to Machines which are in deleting: #10589

But there are a few other cases, as described here: #10589 (inlining below for convenience)

The following specifically focuses on cases where Machines are deleted by the MS controller.

Case 1. MD is deleted

The following happens:

MD goes away
ownerRef triggers MS deletion
MS goes away
ownerRef triggers Machine deletion

=> The MS will already be gone when the deletionTimestamp is set on the Machines. In this case folks would have to modify the timestamps on each Machine individually. Because the MS doesn't exist anymore it's not possible to propagate down timeouts from the MS to Machines

Case 2. MD is scaled down to 0

The following happens:

MD scales down MS to 0
MS deletes Machine

This use case was addressed by: #10589

Case 3. MD rollout

The following happens:

Someone updates the MD (e.g. bump the Kubernetes version)
MD creates a new MS and scales it up
In parallel MD scales down the old MS to 0

=> In this scenario today the MD controller does not propagate the timeouts from MD to all MS (only to the new/current one, not to the old ones). So the Machines of the old MS won't get new timeouts set in the MD

Implementation

To address all scenarios I would propose to always propagate timeouts from MD => MS => Machine. To make that happen we have to implement the following:

Ensure during MD deletion, MD & MS objects stay around until all Machines are deleted: Consider implementing "forced" MD foreground deletion #10710
- PR: 🌱 Foreground deletion for machine deployments #10791
Ensure timeouts are always propagated from MD to all MachineSets to all Machines
- Even if a MD, MS or Machine is in deleting (also both in regular reconcile & reconcileDelete)
- Even if a MS is not the "current" MS

Follow-up:

We should also check other objects like Cluster (topology), KCP, ...

k8s-ci-robot · 2024-06-13T09:40:41Z

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sbueringer · 2024-06-13T09:41:25Z

@enxebre @fabriziopandini @chrischdi @vincepri
Would be good if you can double check if I missed something and if my proposed implementation a) solves the entire problem b) sounds good to you.

vincepri · 2024-06-19T14:21:03Z

So the Machines of the old MS won't get new timeouts set in the MD

Isn't this a good default behavior? Ideally, if a MachineSet is stuck in deleting, I should be able to change only that specific one without impacting the new one?

sbueringer · 2024-06-19T14:28:36Z

Good question. Timeouts today don't trigger a rollout (i.e. a new MS) and also don't trigger a separate revision.

Might be annoying to have to manually go to the MS and update the timeouts there when you otherwise primarily/only interact with the MD

sbueringer · 2024-06-19T14:33:15Z

Maybe one important point to consider. Going forward we won't have old MachineSets that stay around forever. All of them will be in "scaling down" until they are gone (once we dropped revision management).

So I think our only goal for them is to properly scale them down. There probably is no reason that they have to preserve their previous timeout configuration

enxebre · 2024-06-19T14:37:17Z

The proposed steps sgtm Stefan.

Isn't this a good default behavior? Ideally, if a MachineSet is stuck in deleting, I should be able to change only that specific one without impacting the new one?

I think it would be suboptimal UX for the use cases where MachineDeloyment is the consumer facing API to force user interaction with the machineSets that are still managed by it.

vincepri · 2024-06-24T04:09:00Z

(once we dropped revision management).

I missed this part, what are we doing for rollbacks?

vincepri · 2024-06-24T04:14:11Z

Found #10479 — which I wasn't aware of, and I guess now this issue makes a bit more sense to me on the reasoning 😄

sbueringer added the area/machinedeployment Issues or PRs related to machinedeployments label Jun 13, 2024

k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024

sbueringer added area/machineset Issues or PRs related to machinesets and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024

sbueringer mentioned this issue Jun 13, 2024

🌱 Propagate timeout fields from MachineSet to Machine during Machine deletion #10589

Merged

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 3, 2024

sbueringer mentioned this issue Aug 29, 2024

Allow user to opt-in to Node draining during Cluster delete #9692

Open

sbueringer self-assigned this Aug 29, 2024

sbueringer mentioned this issue Sep 6, 2024

🌱 KCP: propagate timeouts to Machines with deletionTimestamp #11128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistently propagate down timeouts from MD => MS => Machines #10753

Consistently propagate down timeouts from MD => MS => Machines #10753

sbueringer commented Jun 13, 2024 •

edited

Loading

k8s-ci-robot commented Jun 13, 2024

sbueringer commented Jun 13, 2024

vincepri commented Jun 19, 2024

sbueringer commented Jun 19, 2024

sbueringer commented Jun 19, 2024

enxebre commented Jun 19, 2024

vincepri commented Jun 24, 2024

vincepri commented Jun 24, 2024

Consistently propagate down timeouts from MD => MS => Machines #10753

Consistently propagate down timeouts from MD => MS => Machines #10753

Comments

sbueringer commented Jun 13, 2024 • edited Loading

Goal

Case 1. MD is deleted

Case 2. MD is scaled down to 0

Case 3. MD rollout

Implementation

k8s-ci-robot commented Jun 13, 2024

sbueringer commented Jun 13, 2024

vincepri commented Jun 19, 2024

sbueringer commented Jun 19, 2024

sbueringer commented Jun 19, 2024

enxebre commented Jun 19, 2024

vincepri commented Jun 24, 2024

vincepri commented Jun 24, 2024

sbueringer commented Jun 13, 2024 •

edited

Loading