-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a better solution than standard kubectl drain #11024
Comments
Q: Shouldn't pods like this not be deployed via a DaemonSet? |
A: That's not how Portworx does it I'm afraid. We have no control over that. Portworx is deployed through an operator and there is no daemonset for the main portworx worker pods. Instead the operator creates these pods itself on nodes it deems that need to participate in the Portworx storagecluster. The |
So that means that portworx clusters are in general not compatible with |
I fear that's true @sbueringer, portworx will fail regular drain. @kreeuwijk is there any guide around node drain options, maybe portworx has some advanced options. But regardless of how portworx works. @sbueringer will it make sense to have pod filters functionality added to cluster api as it is part of |
No no, Portworx is perfectly fine with normal node drains. It's just the CAPI-based node drain that is a problem, as Portworx expects to be able to restart the portworx worker pod on a drained node. That pod has tolerations to allow it to come back up on an already drained node. It also just connects to the "real" systemd Hence, we need to be able to inform the CAPI node drainer to purposely ignore pods with the Portworx is not going to change how their product works for this, in their eyes the problem is with CAPI, not with them. |
**What this PR does / why we need it**: The PR adds support to filter out certain pods during node draining process **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes kubernetes-sigs#11024 <!-- Please label this pull request according to what area(s) you are addressing. For reference on PR/issue labels, see: https://github.com/kubernetes-sigs/cluster-api/labels?q=area+ Area example: /area runtime-sdk -->
Wondering if we should just skip evicting Pods that have that toleration Heads up. I'm planning to refactor the whole drain logic in the next few weeks. Basically it's just a bad idea to use the drain helper in a controller |
That's not a bad idea. For reference, these are the tolerations on the pod in question:
|
Yeah that should be the toleration for the taint that is set by cordon
The reasoning would be something like: Why should we drain/evict Pods that tolerate running on cordon'ed Nodes and thus can and will be simply re-scheduled on the same Node. This would also solve other problems if folks are adding this toleration to their Pods and it wouldn't require any configuration / API change. |
Thanks @kreeuwijk @sbueringer Yeah that would be neat idea. |
Based on the discussion, I made a new change in issue description to introduce the AdditionalFilters to filter out pods that have such toleration.
|
Just a heads up, i'm currently refactoring the drain implementation here: #11074 |
What about introducing an annotation, and make the drain code to ignore pods with this annotation? It is opt in and explicit; also, ideally it should be a core k8s annotation and the same should be respected both in kubectl drain and by CAPI drain (which is upstream aligned). But I'm also open to consider having a CAPI specific annotation if it turns out that introducing a core k8s annotation for this is not viable |
fwiw in practice it should be technically already possible by using the existing mirrorPodFilter and the mirror annotation (linking refactor as we have no vendor where this live today in git https://github.com/kubernetes-sigs/cluster-api/pull/11074/files#diff-b537db2a9a72904f2287ac4f091d98f1afd781f62e3064df6babaee73f287a13R188-R193). Thought I'm not sure that would cause some confusion for kas/kubelet and I don't think we should recommend so. Autoscaler uses safe to evict annotation for this https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node For CAPI It might make sense to use the existing "machine.cluster.x-k8s.io/exclude-node-draining" and let it apply to pods. In general I'd be interested in finding a way to stop fragmentation on how this problem is solved. |
Yup, that would be nice. It would also allow everyone to start using that one standardized annotation instead of having to use "whatever Cluster API, cluster-autoscaler, maybe Karpenter, maybe managed services like GKE, ...) supports. I think the way to get this done is to standardize the annotation in k/k and use At the very least we should open an issue in kubernetes/kubernetes (if none exists) and get feedback from maintainers there before we implement a new annotation in Cluster API. |
There is one more thing to consider, which is the protection of PDBs. When we fully ignore a pod based on its tolerations, this also bypasses the protection that PDBs provide, if they are configured for that workload. For example for Portworx, there is a PDB on the px-storage pods, ensuring no more than 1 of these pods goes offline in a planned matter. This is useful during repaves, as the PDB ensures the draining node is unable to drain the px-storage pod if a different node that was just added to the cluster has not completed its installation of Portworx yet (this takes 3-7 minutes). If we ignore those pods completely, the node would not have waited until it was safe to shut down the Portworx pod. So I think in order to be 100% safe, there should be 2 node drain actions:
While our own platform has the ability to configure arbitrary time intervals between node drains to work around the problem noted above, this is likely not the case for all platforms and hence we should make sure the logic is safe implicitly. @sadysnaat FYI |
The more I think about it the more I'm against ignoring Pods based on their tolerations. It's absolutely unclear if the intent of the toleration is that a Pod should not be evicted (tolerating a taint on a Node is not the same as opting into not being evicted during Node drain) So it's much better to have some explicit way to mark precisely which Pods should not be evicted (e.g. via the annotation suggested above) |
The standard (non-CAPI) node drain actually does what is desired:
The issue with the CAPI drain process is that it keeps killing the Portworx pod over and over, instead of only just once. It would still be an improvement over the non-CAPI drain process if the CAPI node drain process had 2 steps though. |
Yup and the reason for that is that Cluster API drains the Node until all Pods that should be evicted are evicted (should be == that fall under the eviction criteria (i.e. everything except DaemonSet & mirror Pods)). Not sure if this is a flaw in Cluster API or |
@sbueringer the problem is that after the drain completes, CAPI deletes the node (as part of the repave process). So if the excluded Pod was unsafe to exit at that point, this would still create a service interruption. We need to make sure the PDB can do its job. |
I think now you lost me. I was suggesting to introduce a way that allows to exclude Pods entirely from drain / eviction. Are you saying that these Portworx Pods have to be evicted exactly once? (and not 0 or many times) |
Before the node is told to shutdown, the excluded pods should be evicted, yes. That's the only way to ensure that before a node is removed from the cluster, any PDBs that are there to prevent this from happening in unsafe situations, have the ability to do so. |
Hence a 2-step drain process would be a nice solution. It can help evict/migrate virtually all the workloads away from the draining node, as preperation for it's imminent removal from the cluster. But a secondary run, only to happen after the first drain run has succesfully completed, should run without such custom exclusions, in order to ensure that PDBs for the pods in question can have their intended effect. If the PDB for Portworx allows the pod to evict, it is safe to do so and the second drain will complete as well (but this time without any service disruptions as all apps have already migrated away from this node and Portworx can handle the service shutting down on this node). |
my 2 cents: I like current CAPI approach because it ultimately protect workloads by trying to drain everything before removing the node (and drain respects PDB). Users might argue this is slight different than users running kubectl once, but it is consistent to a user running it continuously until node is empty or timeout is reached (which might happen no matter if CAPI). Looking at this from another angle, draining continuously is required to unblock a machine deletion when a pod eviction previously blocked by PDB is now possible. What we can offer to address the specific Portworx use case / similar use cases are extensibility point.
Users can eventually combine both of the above, by having a pre-drain hooks that selectively adds annotations to pods. Considering this, I'm personally -1 to a 2-step drain process. |
@fabriziopandini is it possible to do this in phases?
|
I'm really really hesitant to build a bespoke drain solution for this Portworx case. I think in general you can do everything you want in pre-drain and then you can decide if the regular drain should include draining the Portworx Pod or not. Or you can skip the Cluster API drain entirely (via |
Left or right, this results in the same solution: 2 drain steps. Why force everyone to build their own implementation? |
@sadysnaat if we can't get any help from the SIG, we will have to do it ourselves in the product by leveraging the We can start by draining pods with either the No DrainPodFilter should be configured in this case. |
Why force every CAPI user to deal with the complexity of a two-phase drain if based on current data Portworx is the only thing that cannot deal with a simple "either drain the Pod or not" semantic? (+ corresponding maintenance effort to keep this working for the community) (also when it's clearly possible to build a bespoke solution on top of CAPI) But maybe it's just me. Let's see what other maintainers think (@fabriziopandini @enxebre @vincepri @killianmuldoon @chrischdi) |
For most users it wouldn't look like a two-phase drain, since the CAPI drain process already performs continuous draining. The only difference is that if you'd configure a PodDrainFilter, the pods initially filtered out would drain last. If no filters are set, there would be no difference compared to current behavior. |
@kreeuwijk : I'd like to understand your use-case some more: So in your case you'd like to:
Question: Why is it important to you to do the second round of draining/evicting the Portworx pods? |
This is because of the CAPI repave process. Typically, most software-defined storage solutions (Rook-Ceph, Portworx) can tolerate 1 node failure at a time. So when CAPI repaves a node, that is fine as long as the next node in the process isn't removed from the cluster before the newest node has fully initialized and rejoined the storage cluster. It's worth noting that in these repaves (of bare metal clusters), the data disks are preserved and so when a node is repaved, it will come back after having the OS & K8S reinstalled with its existing additional disks containing all the Portworx data. But it takes 4-10 minutes for Portworx to install & initialize on a fresh node. To prevent external systems from successfully draining & deleting a node while the storage cluster is already in a degraded state, Portworx defines a PDB. This prevents the px-storage pod from exiting until the degraded state is resolved (by the freshly repaved node successfully rejoining the storage cluster). Just filtering out the CSI pods from draining entirely would break the protection the PDB provides. Our platform has a "node repave interval" that sets the |
CAPI approach to new feature requests is described in https://cluster-api.sigs.k8s.io/user/manifesto#the-complexity-budget. TL;DR; We have to make choices, choices are always the result of an open discussion, and in some cases it is fair to say no (or not now). Also, let me take the chance to thanks everyone investing time and energies in those discussions, we greatly appreciate and respect all the opinions.
It is probably technically possible, but the question is if CAPI is the right place to address this use case, if the solution is going to fix the entire problem, and if we have to do this now. If I look at Portworx docs, the documented procedure for decommissioning a node does not rely on drain. It suggests to cordon the node then and delete pods with Portworx volumes one by one. The procedure also requires to apply custom labels on nodes. I'm now wondering if Portworx actually supports drain at all. The more I think about it, the more it seems to me that also the fact that Portworx reportedly works with plain kubectl drain isn't a great argument, because such statement relies on the assumption that the Portworx pod will come back and it will be remain there afterwards, which is a weak assumption. Even If I assume that kubectl drain will be called only once and the Portworx pod that came back will be still there when the node gets finally deleted, drain only once implies that this Portworx pod will go away without checking (again) PDBs, which does not comply with the hard requirement of tolerating only 1 failure of px-storage at a time. And if all of this works, according to the documentation it is still required figure out when/how to apply custom labels and wait for Portworx services to shut down properly. All considered, I'm still not fully convinced about the 2-step drain proposal + some sort of PodDrainFilter, and probably a bespoke solutions is a better fit for this procedure. I'm also not 100% sure that we are looking at the right layer to fix this issue. Is this something to fix outside Portworx with workarounds/ad hoc operational procedures implemented in any lifecycle management tool like CAPI or Portworx itself should become more drain friendly? Might be, what Portworx really needs is an improvement to eviction in core K8s so it will be possible to define declaratively that the Portworx pod should be actually evicted only after other pods are gone so one eviction just works without the need of considering hedge cases due to Portworx pods coming back... But of course, looking forward to hear other's opinion about this. |
@fabriziopandini the procedure from Portworx on node decommissioning is not relevant here, as that procedure is for permanent removal of a node, which is not the case here. All we're doing is repaving the nodes, cause each node to temporarily leave the cluster and rejoin 10-20 minutes later (running a newer K8S version) with the original data disks. So during the repave, the cluster just runs with 1 node less for short intervals and then comes back to full availability. The decommissioning procedure would cause Portworx to start resilvering data from lost replicas, which would be highly undesirable during cluster repaves. I agree that the standard Again, I think CAPI drain has an opportunity to provide a better solution than standard |
Thx for the additional context, got it now. I'll look into it. Just recently refactored the entire drain code. I have some ideas about how to implement this /assign |
Thinking a little bit, might be there is a way to generalise this problem without putting us in a corner with a two phase approach. What about if we think at this problem in term of drain priority. If we assume all Pods without a custom drain priority specified will get drain priority 0 by default, and then we assume to apply a custom drain priority -1 to the Portworx pod, the drain process can work this way:
Potentially drain priority can be used to solve other use cases as well (not sure if the examples makes actually sense, but math.MaxInt32 is the limit):
This approach will allow us to preserve current approach with drain that spans over many reconcile until all pods in scope are gone, which is simpler to reason about for users, simpler to represent in status via conditions, simpler to monitor and also simpler to implement and test. Only the "list pods" inside the above drain process must be enhanced in order to group pods by "drain priority", and to return only the "drain group" with highest priority (with the other "drain groups" deferred to next reconcile/when the former "drain group" with highest priority is empty) |
@fabriziopandini yes that works very well, assuming the user would have a straightforward way to configure drain priorities. Note that a new drain priority label/annotation on pods would not be very helpful, as the user doesn't always have that much control over the pod's configuration. E.g. for anything that works with an operator (such as Portworx but also many other apps), the operator creates the pods in question and the user has very limited control over those pods. A ConfigMap resource in the CAPI namespace that provides a custom map of pod selectors to drain priorities could work though. |
Renamed the issue to reflect the fact that we are now looking at at this problem from a different PoV; also, given that we are looking into this there are a few other use cases:
|
Partly, I think we're asking One part of how I'd approach this is to define more taints:
BTW we can use Kubernetes domain names for the real taints. I didn't use official ones to emphasize that these are suggestions at this point. Other parts of Kubernetes have similar needs. For example, when Karpenter consolidates several nodes onto one cheaper node, the new NodeClaim gets created and brought up, and then Karpenter triggers draining on the other nodes. Ideally, Karpenter has a way to signal its intent declaratively, but right now Karpenter needs to be allowed to delete arbitrary Pods and uses its own drain logic. |
ℹ️ this is painting a picture of the future, aiming to influence what we eventually design Rather than label Pods with a drain priority, we could allow a workload controller to take that responsibility. Two options here:
To make this work, SIG Apps would need to update the implementations for StatefulSet and friends so that they are more aware of node drain markers. You could also do a drain for a node where the taint is PreferNoSchedule for the first 28 minutes and then NoExecute + NoSchedule for the final 120 seconds. Controllers, including custom code, should be able to cope - even if they end up having to handle more than one drain event for the same app. |
Those are interesting ideas, even if most of them aren't in the CAPI realm so might be it is more appropriate to discuss them somewhere else. Overall, I think we all agree that Cluster API can really benefit from improved drain/node maintenance primitives in K8s. In other words, there are two sides of this problem, in this issue we are looking at the side in the scope of Cluster API, but in parallel we trying to stay in the loop on the discussion in K/K and possibly bring valuable contributions to it |
Hey folks, I think this issue will be addressed by the proposal, so I'll close this issue. Please let us know on the proposal PR if you have any feedback / you think this issue would not be covered by the proposal /close |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What would you like to be added (User Story)?
As a developer, I would like to be able filter out certain pods (certain pods won't be deleted) during the node draining process.
Detailed Description
I used Portworx as a distributed storage for container storage interface (CSI) drivers. When I upgraded my cluster, it caused “repaving” — replacing old nodes in the cluster one by one with new nodes that have the new desired state in place. When worker machines were repaving, CAPI continuously kill
px-<cluster-name>
pods, which caused CSI plugin never getting re-registered and the node hanging forever. Therefore, cluster upgrades got stuck.In this case, if we can have a field in machine spec to filter out the pods that we don't want CAPI to delete during the node draining process, then pods like
px-<cluster-name>
can be re-registered and repaving will be done successfully. As discussion, we may need to not delete pods that have such toleration.With helper function
We may add a field calledNodeDrainPodFilters
inMachineSpec
. We can also add this field inKubeadmControlPlaneTemplateMachineTemplate
structNodeDrainPodFilters *metav1.LabelSelectorjson:"nodeDrainPodFilters,omitempty"
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature
The text was updated successfully, but these errors were encountered: