Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a better solution than standard kubectl drain #11024

Closed
Kun483 opened this issue Aug 6, 2024 · 42 comments
Closed

Provide a better solution than standard kubectl drain #11024

Kun483 opened this issue Aug 6, 2024 · 42 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Kun483
Copy link

Kun483 commented Aug 6, 2024

What would you like to be added (User Story)?

As a developer, I would like to be able filter out certain pods (certain pods won't be deleted) during the node draining process.

Detailed Description

I used Portworx as a distributed storage for container storage interface (CSI) drivers. When I upgraded my cluster, it caused “repaving” — replacing old nodes in the cluster one by one with new nodes that have the new desired state in place. When worker machines were repaving, CAPI continuously kill px-<cluster-name> pods, which caused CSI plugin never getting re-registered and the node hanging forever. Therefore, cluster upgrades got stuck.

In this case, if we can have a field in machine spec to filter out the pods that we don't want CAPI to delete during the node draining process, then pods like px-<cluster-name> can be re-registered and repaving will be done successfully. As discussion, we may need to not delete pods that have such toleration.

  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
var (
       unreachableToleration = corev1.Toleration{
		Key:      nodeUnreachableKey,
		Effect:   corev1.TaintEffectNoSchedule,
		Operator: corev1.TolerationOpExists,
	}
)
drainer := []kubedrain.PodFilter{
	SkipFuncGenerator(m.Spec.NodeDrainPodFilters),
},
func skipUnreachableTolerationPods(pod corev1.Pod) kubedrain.PodDeleteStatus {
	if pod.Spec.Tolerations == nil {
		return kubedrain.MakePodDeleteStatusOkay()
	}
	if HasTolerations(&pod, &unreachableToleration) {
		return kubedrain.MakePodDeleteStatusSkip()
	}
	return kubedrain.MakePodDeleteStatusOkay()
}

With helper function

func HasTolerations(pod *corev1.Pod, toleration *corev1.Toleration) bool {
	for _, t := range pod.Spec.Tolerations {
		if t.MatchToleration(toleration) {
			return true
		}
	}
	return false
}

We may add a field called NodeDrainPodFilters in MachineSpec. We can also add this field in KubeadmControlPlaneTemplateMachineTemplate struct

NodeDrainPodFilters *metav1.LabelSelector json:"nodeDrainPodFilters,omitempty"

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 6, 2024
@sbueringer
Copy link
Member

Q: Shouldn't pods like this not be deployed via a DaemonSet?

@kreeuwijk
Copy link

kreeuwijk commented Aug 6, 2024

Q: Shouldn't pods like this not be deployed via a DaemonSet?

A: That's not how Portworx does it I'm afraid. We have no control over that. Portworx is deployed through an operator and there is no daemonset for the main portworx worker pods. Instead the operator creates these pods itself on nodes it deems that need to participate in the Portworx storagecluster. The portworx-api pods are deployed as a daemonset by the operator though.

@sbueringer
Copy link
Member

So that means that portworx clusters are in general not compatible with kubectl drain? (because we are just doing the same)

@sadysnaat
Copy link
Contributor

I fear that's true @sbueringer, portworx will fail regular drain.

@kreeuwijk is there any guide around node drain options, maybe portworx has some advanced options.

But regardless of how portworx works. @sbueringer will it make sense to have pod filters functionality added to cluster api as it is part of kubectl drain cli

@kreeuwijk
Copy link

kreeuwijk commented Aug 6, 2024

No no, Portworx is perfectly fine with normal node drains. It's just the CAPI-based node drain that is a problem, as capi-controller will keep killing any and all non-daemonset workloads that get started on the drained node.

Portworx expects to be able to restart the portworx worker pod on a drained node. That pod has tolerations to allow it to come back up on an already drained node. It also just connects to the "real" systemd portworx service that runs on the host OS, so loss of the worker pod does not automatically mean loss of the storage service. However, when capi-controller keeps killing the portworx worker pod, it causes the CSI registration for that node to go into a bad state, which then causes PVC attachment issues for specific workloads, most notable Kubevirt VMs that are trying to live-migrate away to a different node.

Hence, we need to be able to inform the CAPI node drainer to purposely ignore pods with the name: portworx label, so that the pod is left alone and can restart on the node a few moments after the node drain.

Portworx is not going to change how their product works for this, in their eyes the problem is with CAPI, not with them.

Kun483 added a commit to Kun483/cluster-api that referenced this issue Aug 7, 2024
**What this PR does / why we need it**:
The PR adds support to filter out certain pods during node draining process

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes kubernetes-sigs#11024

<!--
Please label this pull request according to what area(s) you are addressing. For reference on PR/issue labels, see: https://github.com/kubernetes-sigs/cluster-api/labels?q=area+

Area example:
/area runtime-sdk
-->
@sbueringer
Copy link
Member

Wondering if we should just skip evicting Pods that have that toleration

Heads up. I'm planning to refactor the whole drain logic in the next few weeks. Basically it's just a bad idea to use the drain helper in a controller

@kreeuwijk
Copy link

Wondering if we should just skip evicting Pods that have that toleration

That's not a bad idea. For reference, these are the tolerations on the pod in question:

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists

@sbueringer
Copy link
Member

Yeah that should be the toleration for the taint that is set by cordon

  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists

The reasoning would be something like: Why should we drain/evict Pods that tolerate running on cordon'ed Nodes and thus can and will be simply re-scheduled on the same Node.

This would also solve other problems if folks are adding this toleration to their Pods and it wouldn't require any configuration / API change.

@sadysnaat
Copy link
Contributor

Thanks @kreeuwijk @sbueringer

Yeah that would be neat idea.

@Kun483
Copy link
Author

Kun483 commented Aug 9, 2024

Based on the discussion, I made a new change in issue description to introduce the AdditionalFilters to filter out pods that have such toleration.

  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists

@sbueringer
Copy link
Member

Just a heads up, i'm currently refactoring the drain implementation here: #11074

@fabriziopandini
Copy link
Member

What about introducing an annotation, and make the drain code to ignore pods with this annotation?

It is opt in and explicit; also, ideally it should be a core k8s annotation and the same should be respected both in kubectl drain and by CAPI drain (which is upstream aligned).

But I'm also open to consider having a CAPI specific annotation if it turns out that introducing a core k8s annotation for this is not viable

@enxebre
Copy link
Member

enxebre commented Aug 29, 2024

fwiw in practice it should be technically already possible by using the existing mirrorPodFilter and the mirror annotation (linking refactor as we have no vendor where this live today in git https://github.com/kubernetes-sigs/cluster-api/pull/11074/files#diff-b537db2a9a72904f2287ac4f091d98f1afd781f62e3064df6babaee73f287a13R188-R193). Thought I'm not sure that would cause some confusion for kas/kubelet and I don't think we should recommend so.

Autoscaler uses safe to evict annotation for this https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

For CAPI It might make sense to use the existing "machine.cluster.x-k8s.io/exclude-node-draining" and let it apply to pods.

In general I'd be interested in finding a way to stop fragmentation on how this problem is solved.

@sbueringer
Copy link
Member

sbueringer commented Aug 29, 2024

In general I'd be interested in finding a way to stop fragmentation on how this problem is solved.

Yup, that would be nice. It would also allow everyone to start using that one standardized annotation instead of having to use "whatever Cluster API, cluster-autoscaler, maybe Karpenter, maybe managed services like GKE, ...) supports.

I think the way to get this done is to standardize the annotation in k/k and use kubectl drain as a reference implementation.

At the very least we should open an issue in kubernetes/kubernetes (if none exists) and get feedback from maintainers there before we implement a new annotation in Cluster API.

@kreeuwijk
Copy link

kreeuwijk commented Sep 2, 2024

There is one more thing to consider, which is the protection of PDBs. When we fully ignore a pod based on its tolerations, this also bypasses the protection that PDBs provide, if they are configured for that workload. For example for Portworx, there is a PDB on the px-storage pods, ensuring no more than 1 of these pods goes offline in a planned matter. This is useful during repaves, as the PDB ensures the draining node is unable to drain the px-storage pod if a different node that was just added to the cluster has not completed its installation of Portworx yet (this takes 3-7 minutes).

If we ignore those pods completely, the node would not have waited until it was safe to shut down the Portworx pod.

So I think in order to be 100% safe, there should be 2 node drain actions:

  1. An initial node drain where pods with specific tolerations are ignored, allowing other workloads that depend on those (storage) pods to reliably terminate or live-migrate (in the case of Kubevirt VMs).
  2. A final node drain where pods with specific tolerations are NOT ignored, ensuring PDBs that are there to prevent certain workloads from terminating until its safe to do so, can have their intended effect.

While our own platform has the ability to configure arbitrary time intervals between node drains to work around the problem noted above, this is likely not the case for all platforms and hence we should make sure the logic is safe implicitly.

@sadysnaat FYI

@sbueringer
Copy link
Member

sbueringer commented Sep 2, 2024

The more I think about it the more I'm against ignoring Pods based on their tolerations. It's absolutely unclear if the intent of the toleration is that a Pod should not be evicted (tolerating a taint on a Node is not the same as opting into not being evicted during Node drain)

So it's much better to have some explicit way to mark precisely which Pods should not be evicted (e.g. via the annotation suggested above)

@kreeuwijk
Copy link

kreeuwijk commented Sep 2, 2024

The standard (non-CAPI) node drain actually does what is desired:

  • It tries to evict all workloads except daemonsets etc.
  • This also causes the Portworx pod to get evicted IF it's safe to do so from the PDB
  • While this initially causes a delay in some pods to terminate, the Portworx pod gets rescheduled back onto the draining node by the Portworx operator and at that point the node drain process no longer kills this pod.
  • When the Portworx pod is back up again, any lingering pods in a Terminating state can terminate normally. Same goes for any VMs stuck in live migration, those complete normally as well at that point.

The issue with the CAPI drain process is that it keeps killing the Portworx pod over and over, instead of only just once.

It would still be an improvement over the non-CAPI drain process if the CAPI node drain process had 2 steps though.

@sbueringer
Copy link
Member

sbueringer commented Sep 2, 2024

Yup and the reason for that is that Cluster API drains the Node until all Pods that should be evicted are evicted (should be == that fall under the eviction criteria (i.e. everything except DaemonSet & mirror Pods)).

Not sure if this is a flaw in Cluster API or kubectl drain. I think the correct way to handle this is to exclude Pods that should not be evicted from the drain (like it's done for DaemonSet & mirror Pods) instead of relying on the fact that drain is only done once.

@kreeuwijk
Copy link

kreeuwijk commented Sep 2, 2024

@sbueringer the problem is that after the drain completes, CAPI deletes the node (as part of the repave process). So if the excluded Pod was unsafe to exit at that point, this would still create a service interruption. We need to make sure the PDB can do its job.

@sbueringer
Copy link
Member

sbueringer commented Sep 2, 2024

I think now you lost me.

I was suggesting to introduce a way that allows to exclude Pods entirely from drain / eviction.

Are you saying that these Portworx Pods have to be evicted exactly once? (and not 0 or many times)

@kreeuwijk
Copy link

Before the node is told to shutdown, the excluded pods should be evicted, yes. That's the only way to ensure that before a node is removed from the cluster, any PDBs that are there to prevent this from happening in unsafe situations, have the ability to do so.

@kreeuwijk
Copy link

kreeuwijk commented Sep 2, 2024

Hence a 2-step drain process would be a nice solution. It can help evict/migrate virtually all the workloads away from the draining node, as preperation for it's imminent removal from the cluster. But a secondary run, only to happen after the first drain run has succesfully completed, should run without such custom exclusions, in order to ensure that PDBs for the pods in question can have their intended effect. If the PDB for Portworx allows the pod to evict, it is safe to do so and the second drain will complete as well (but this time without any service disruptions as all apps have already migrated away from this node and Portworx can handle the service shutting down on this node).

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 3, 2024

my 2 cents:

I like current CAPI approach because it ultimately protect workloads by trying to drain everything before removing the node (and drain respects PDB).

Users might argue this is slight different than users running kubectl once, but it is consistent to a user running it continuously until node is empty or timeout is reached (which might happen no matter if CAPI).

Looking at this from another angle, draining continuously is required to unblock a machine deletion when a pod eviction previously blocked by PDB is now possible.

What we can offer to address the specific Portworx use case / similar use cases are extensibility point.

  • If users want to drain first all the pods except Portworx, it is already possible today by using pre-drain hooks (it is true it requires a controller to do so, but with great powers...)
  • If users want to exclude a pod from CAPI drain, we can add support for an annotation that allows users to opt out from drain. However. as suggested by @enxebre, we should avoid fragmentation on how this problem is solved.

Users can eventually combine both of the above, by having a pre-drain hooks that selectively adds annotations to pods.

Considering this, I'm personally -1 to a 2-step drain process.

@kreeuwijk
Copy link

kreeuwijk commented Sep 3, 2024

@fabriziopandini is it possible to do this in phases?

  • pre-drain hook: add annotation for Portworx pods
  • < main draining process runs >
  • < main draining process completes >
  • post-drain hook: remove annotation for Portworx pods
  • < one final drain runs to also evict the Portworx pods >

@sbueringer
Copy link
Member

I'm really really hesitant to build a bespoke drain solution for this Portworx case.

I think in general you can do everything you want in pre-drain and then you can decide if the regular drain should include draining the Portworx Pod or not.

Or you can skip the Cluster API drain entirely (via machine.cluster.x-k8s.io/exclude-node-draining) and implement the entire drain.

@kreeuwijk
Copy link

Left or right, this results in the same solution: 2 drain steps. Why force everyone to build their own implementation?
That would cause more fragmentation vs doing this natively in the CAPI drain?

@kreeuwijk
Copy link

@sadysnaat if we can't get any help from the SIG, we will have to do it ourselves in the product by leveraging the pre-drain hook. It's unfortunate that nobody else would benefit from this, but at least our customers would be happy.

We can start by draining pods with either the kubevirt.io=virt-launcher or kubevirt.io=hotplug-disk label in the pre-drain hook. That should take care of the biggest issue: failing Kubevirt live migrations.

No DrainPodFilter should be configured in this case.

@sbueringer
Copy link
Member

sbueringer commented Sep 3, 2024

Why force every CAPI user to deal with the complexity of a two-phase drain if based on current data Portworx is the only thing that cannot deal with a simple "either drain the Pod or not" semantic? (+ corresponding maintenance effort to keep this working for the community)

(also when it's clearly possible to build a bespoke solution on top of CAPI)

But maybe it's just me. Let's see what other maintainers think (@fabriziopandini @enxebre @vincepri @killianmuldoon @chrischdi)

@kreeuwijk
Copy link

For most users it wouldn't look like a two-phase drain, since the CAPI drain process already performs continuous draining. The only difference is that if you'd configure a PodDrainFilter, the pods initially filtered out would drain last. If no filters are set, there would be no difference compared to current behavior.

@chrischdi
Copy link
Member

@kreeuwijk : I'd like to understand your use-case some more:

So in your case you'd like to:

  • drain all pods except e.g. your CSI related ones (Portworx)
  • after all is done, drain the CSI related ones (Portworx)

Question: Why is it important to you to do the second round of draining/evicting the Portworx pods?

@kreeuwijk
Copy link

kreeuwijk commented Sep 3, 2024

@chrischdi

Question: Why is it important to you to do the second round of draining/evicting the Portworx pods?

This is because of the CAPI repave process. Typically, most software-defined storage solutions (Rook-Ceph, Portworx) can tolerate 1 node failure at a time. So when CAPI repaves a node, that is fine as long as the next node in the process isn't removed from the cluster before the newest node has fully initialized and rejoined the storage cluster.

It's worth noting that in these repaves (of bare metal clusters), the data disks are preserved and so when a node is repaved, it will come back after having the OS & K8S reinstalled with its existing additional disks containing all the Portworx data. But it takes 4-10 minutes for Portworx to install & initialize on a fresh node.

To prevent external systems from successfully draining & deleting a node while the storage cluster is already in a degraded state, Portworx defines a PDB. This prevents the px-storage pod from exiting until the degraded state is resolved (by the freshly repaved node successfully rejoining the storage cluster).

Just filtering out the CSI pods from draining entirely would break the protection the PDB provides. Our platform has a "node repave interval" that sets the spec.minReadySeconds of the MachineDeployment object as a workaround to this problem; this way CAPI waits for an arbitrary amount of time before moving on with repaving the next node. But this would not be required if the CAPI drain process concludes with draining the filtered out (storage) pods at the end as well, allowing the PDB to be that final safety barrier.

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 3, 2024

Why force everyone to build their own implementation? That would cause more fragmentation vs doing this natively in the CAPI drain?

CAPI approach to new feature requests is described in https://cluster-api.sigs.k8s.io/user/manifesto#the-complexity-budget. TL;DR; We have to make choices, choices are always the result of an open discussion, and in some cases it is fair to say no (or not now).

Also, let me take the chance to thanks everyone investing time and energies in those discussions, we greatly appreciate and respect all the opinions.

@fabriziopandini is it possible to do this in phases? (and following comments about two phase drain)

It is probably technically possible, but the question is if CAPI is the right place to address this use case, if the solution is going to fix the entire problem, and if we have to do this now.

If I look at Portworx docs, the documented procedure for decommissioning a node does not rely on drain. It suggests to cordon the node then and delete pods with Portworx volumes one by one. The procedure also requires to apply custom labels on nodes.

I'm now wondering if Portworx actually supports drain at all.

The more I think about it, the more it seems to me that also the fact that Portworx reportedly works with plain kubectl drain isn't a great argument, because such statement relies on the assumption that the Portworx pod will come back and it will be remain there afterwards, which is a weak assumption.

Even If I assume that kubectl drain will be called only once and the Portworx pod that came back will be still there when the node gets finally deleted, drain only once implies that this Portworx pod will go away without checking (again) PDBs, which does not comply with the hard requirement of tolerating only 1 failure of px-storage at a time.

And if all of this works, according to the documentation it is still required figure out when/how to apply custom labels and wait for Portworx services to shut down properly.

All considered, I'm still not fully convinced about the 2-step drain proposal + some sort of PodDrainFilter, and probably a bespoke solutions is a better fit for this procedure.

I'm also not 100% sure that we are looking at the right layer to fix this issue.

Is this something to fix outside Portworx with workarounds/ad hoc operational procedures implemented in any lifecycle management tool like CAPI or Portworx itself should become more drain friendly?

Might be, what Portworx really needs is an improvement to eviction in core K8s so it will be possible to define declaratively that the Portworx pod should be actually evicted only after other pods are gone so one eviction just works without the need of considering hedge cases due to Portworx pods coming back...

But of course, looking forward to hear other's opinion about this.

@kreeuwijk
Copy link

@fabriziopandini the procedure from Portworx on node decommissioning is not relevant here, as that procedure is for permanent removal of a node, which is not the case here. All we're doing is repaving the nodes, cause each node to temporarily leave the cluster and rejoin 10-20 minutes later (running a newer K8S version) with the original data disks. So during the repave, the cluster just runs with 1 node less for short intervals and then comes back to full availability. The decommissioning procedure would cause Portworx to start resilvering data from lost replicas, which would be highly undesirable during cluster repaves.

I agree that the standard kubectl drain for Portworx isn't great either. While it allows PDB protection, if there is no degraded situation, the Portworx pod will exit on the node drain, causing some storage operations on that node to get delayed until the pod gets recreated on this node by the operator. That adds a couple minutes to the draining procedure that could've been avoided.

Again, I think CAPI drain has an opportunity to provide a better solution than standard kubectl drain, by allowing drains to happen in 2 phases: one drain that evicts all workloads except critical ones, such as software-defined storage pods, and one drain that evicts everything else. This would also benefit solutions such as Rook-Ceph, not only Portworx.

@sbueringer
Copy link
Member

Thx for the additional context, got it now.

I'll look into it. Just recently refactored the entire drain code. I have some ideas about how to implement this

/assign

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 6, 2024

Thinking a little bit, might be there is a way to generalise this problem without putting us in a corner with a two phase approach.

What about if we think at this problem in term of drain priority.

If we assume all Pods without a custom drain priority specified will get drain priority 0 by default, and then we assume to apply a custom drain priority -1 to the Portworx pod, the drain process can work this way:

  • All pods will be evicted first (priority 0)
  • Portworx pod will be evicted only after all Pods (priority -1)

Potentially drain priority can be used to solve other use cases as well (not sure if the examples makes actually sense, but math.MaxInt32 is the limit):

  • drain a pod (priority 1) first, then all the other pods >> drain before instead of drain after like in the Portworx example above
  • drain front-ends (priority 3) first, then microservices (priority 2), then DBs (priority 1), then all the other pods etc. >> many ordered "drain groups"

This approach will allow us to preserve current approach with drain that spans over many reconcile until all pods in scope are gone, which is simpler to reason about for users, simpler to represent in status via conditions, simpler to monitor and also simpler to implement and test.

Only the "list pods" inside the above drain process must be enhanced in order to group pods by "drain priority", and to return only the "drain group" with highest priority (with the other "drain groups" deferred to next reconcile/when the former "drain group" with highest priority is empty)

@kreeuwijk
Copy link

kreeuwijk commented Sep 9, 2024

@fabriziopandini yes that works very well, assuming the user would have a straightforward way to configure drain priorities. Note that a new drain priority label/annotation on pods would not be very helpful, as the user doesn't always have that much control over the pod's configuration. E.g. for anything that works with an operator (such as Portworx but also many other apps), the operator creates the pods in question and the user has very limited control over those pods.

A ConfigMap resource in the CAPI namespace that provides a custom map of pod selectors to drain priorities could work though.

@fabriziopandini
Copy link
Member

Renamed the issue to reflect the fact that we are now looking at at this problem from a different PoV; also, given that we are looking into this there are a few other use cases:

  • Exclude pods entirely from drain
  • Force drain for pods usually not drained (E.g. Daemonsets)
  • Better handling of volume detach (mostly for pods which are not drained, but TBD)

@fabriziopandini fabriziopandini changed the title Add NodeDrainPodFilters to filter out certain pods when draining nodes Provide a better solution than standard kubectl drain Sep 10, 2024
@fabriziopandini fabriziopandini added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Sep 11, 2024
@k8s-ci-robot k8s-ci-robot removed needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 11, 2024
@sftim
Copy link
Contributor

sftim commented Sep 26, 2024

Partly, I think we're asking node.kubernetes.io/unschedulable to carry a lot of load and meaning.

One part of how I'd approach this is to define more taints:

  • domain.example/under-maintenance (there is maintenance active)
  • domain.example/degraded (something, possibly maintenance, is harming normal operation; stronger than “don't schedule new things here“)
  • domain.example/terminating (not only is this node shutting down, it's not coming back afterwards)
    For example, if we delete a MachinePool, the InfrastructureMachinePool might have a hook that marks the machines in that pool as being due for removal. In this case, maybe there's no need to delete some daemon-like Pods. Deleting the machine and garbage collecting the Pods on its associated node might be enough.

BTW we can use Kubernetes domain names for the real taints. I didn't use official ones to emphasize that these are suggestions at this point.

Other parts of Kubernetes have similar needs. For example, when Karpenter consolidates several nodes onto one cheaper node, the new NodeClaim gets created and brought up, and then Karpenter triggers draining on the other nodes. Ideally, Karpenter has a way to signal its intent declaratively, but right now Karpenter needs to be allowed to delete arbitrary Pods and uses its own drain logic.

@sftim
Copy link
Contributor

sftim commented Sep 26, 2024

ℹ️ this is painting a picture of the future, aiming to influence what we eventually design

Rather than label Pods with a drain priority, we could allow a workload controller to take that responsibility. Two options here:

  1. No special label on a StatefulSet; statefulset controller labels its owned pods (eg as example.com/managed-drain-behavior: "false")
    When one of these Pods ends up on a node that needs draining, kube-controller-manager's drain controller notices the label, skips selecting the Pod, and allows the StatefulSet to handle the drain directly. The statefulset controller takes whatever action is appropriate to gracefully delete the pod that's on the node being drained, and from there on it takes its normal behavior to achieve desired state (and the scheduler avoids placing the Pod on the node being drained).
  2. StatefulSet itself is labelled as example.com/managed-drain-behavior: "false"
    In this story, the kube-controller-manager doesn't do any draining. Nor does the StatefulSet / Job / DaemonSet. Some other controller spots that the Node is tainted / draining / whatever and triggers Pod termination however it wants to. Maybe the custom controller signals the Pod to fail its readiness checks and otherwise carry on as normal. Maybe the custom controller instructs a Job worker not to fetch any more tasks from an internal pool. Maybe the custom controller deletes the Pods for a StatefulSet based on their ordinal. Whatever.

To make this work, SIG Apps would need to update the implementations for StatefulSet and friends so that they are more aware of node drain markers.

You could also do a drain for a node where the taint is PreferNoSchedule for the first 28 minutes and then NoExecute + NoSchedule for the final 120 seconds. Controllers, including custom code, should be able to cope - even if they end up having to handle more than one drain event for the same app.

@fabriziopandini
Copy link
Member

Those are interesting ideas, even if most of them aren't in the CAPI realm so might be it is more appropriate to discuss them somewhere else.

Overall, I think we all agree that Cluster API can really benefit from improved drain/node maintenance primitives in K8s.
But at the same time there will always be a gap between K8s, which focus on a single cluster, and CAPI multi-cluster management nature.

In other words, there are two sides of this problem, in this issue we are looking at the side in the scope of Cluster API, but in parallel we trying to stay in the loop on the discussion in K/K and possibly bring valuable contributions to it

@sbueringer
Copy link
Member

Hey folks,
I have opened a new issues & proposal PR to address this issue (among others).
Please take a look at the proposal: #11241

I think this issue will be addressed by the proposal, so I'll close this issue.

Please let us know on the proposal PR if you have any feedback / you think this issue would not be covered by the proposal

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

Hey folks,
I have opened a new issues & proposal PR to address this issue (among others).
Please take a look at the proposal: #11241

I think this issue will be addressed by the proposal, so I'll close this issue.

Please let us know on the proposal PR if you have any feedback / you think this issue would not be covered by the proposal

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants