PersistentVolumes stuck after node consolidation / termination #944

jmdeal · 2024-01-17T22:53:05Z

Description

Observed Behavior:
After Karpenter performs a scale-down event, PersistentVolumes may fail to be detached from the deleted nodes. As a result, pods which are dependent on these PersistentVolumes are left in a pending state until the Attach/Detach controller forcibly detaches the PV after 6 minutes.

This is related to the following upstream kubernetes issue:

Graceful node shutdown doesn't wait for volume teardown kubernetes/kubernetes#115148

While this is an upstream issue, there has been real impact on Karpenter users today and could potentially be addressed as part of Karpenter's node drain logic.

Expected Behavior:
All persistent volumes should be detached from the node as part of the termination process.

Additional References:

Volume hang on Karpenter Node Consolidation/Termination aws-ebs-csi-driver#1665

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jmdeal · 2024-01-17T23:09:46Z

On the AWS side this has been fixed via this PR to the aws-ebs-csi driver, at least for graceful terminations: kubernetes-sigs/aws-ebs-csi-driver#1736. However, this doesn't address the issue for CSI drivers at large. Curious if any of the Azure folks have ran into similar issues (cc @tallaxes @Bryce-Soghigian @jackfrancis)?

jmdeal · 2024-01-17T23:12:47Z

/remove-label needs-triage

k8s-ci-robot · 2024-01-17T23:12:49Z

@jmdeal: The label(s) /remove-label needs-triage cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label needs-triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jmdeal · 2024-02-13T18:15:15Z

We've got some updates to add here, at least for AWS users. This problem can be largely mitigated by ensuring your nodes are configured for gracefule node shutdown. Currently this isn't enabled by default on AL2, the following UserData should be sufficient to enable it:

#!/bin/bash
echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
systemctl restart systemd-logind

echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

Note: This is specific for the AL2 EKS optimized AMI, enabling graceful termination will differ depending on your AMI.
There still may be some delays with in EBS detachment but this should solve the 6 minute detach delay caused by the k8s side of things.

levanlongktmt · 2024-03-04T04:52:37Z

@jmdeal I tried use the userdata on AL2023 and the 6 minute stuck still happen 😢

johnjeffers · 2024-04-09T17:36:58Z

Any suggestions on the correct userdata for EKS Bottlerocket AMIs? Can't use what's suggested above because Bottlerocket doesn't have the systemd-logind service.

jmdeal · 2024-04-09T23:50:17Z

The AWS provider has supported configuring graceful shutdown for bottlerocket since v0.31.0 (aws/karpenter-provider-aws#4571) and as far as I can tell bottlerocket should have systemd-logind (bottlerocket-os/bottlerocket#3308). I'm not sure if any additional configuration for systemd-logind is needed for bottlerocket, I'm going to wager no based on what I've briefly read.

jmdeal · 2024-04-10T00:04:33Z

@levanlongktmt apologies for missing your question, you likely need to handle the kubelet configuration through the nodeadm config rather than bash. This should work:

    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"

    --//
    Content-Type: application/node.eks.awsContent-Type: application/node.eks.aws

    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      kubelet:
        config:
          shutdownGracePeriod: 45s
          shutdownGracePeriodCriticalPods: 15s
    --//
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    --//--

toVersus · 2024-04-11T02:26:36Z

@jmdeal
Thanks for sharing the example userdata! However, it contains some typos in the configuration. I can confirm that the following userdata works perfectly.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2023
  (...)
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"

    --//
    Content-Type: application/node.eks.aws

    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      kubelet:
        config:
          shutdownGracePeriod: 45s
          shutdownGracePeriodCriticalPods: 15s
    --//
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    --//

A Karpenter node is configured as expected:

❯ kubectl debug -it node/ip-172-30-20-187.ap-northeast-1.compute.internal --image=alpine

/ # chroot /host
[root@ip-172-30-20-187 /]# cat /etc/systemd/logind.conf | grep InhibitDelayMaxSec
#InhibitDelayMaxSec=5
InhibitDelayMaxSec=45
[root@ip-172-30-20-187 /]# cat /etc/kubernetes/kubelet/config.json | grep shutdownGracePeriod
    "shutdownGracePeriod": "45s",
    "shutdownGracePeriodCriticalPods": "15s",

jonathan-innis · 2024-04-11T20:22:05Z

After some discussion with the EBS CSI Driver team, I think the real solve here is for the client-side drain behavior to actually wait on the volume detachment since that happens asynchronously. Realistically, it seems like this would also be a problem for Cluster Autoscaler if they don't have logic to implement this as well, so we should think about how to solve this generally across SIG Autoscaling if we can.

We're receptive to someone taking this change in and implementing it. The change should be waiting after we drain here: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/controller.go#L86. Realistically, for this to work, we need to make sure to avoid draining the EBS CSI driver DaemonSet, which means that EBS shouldn't tolerate the Karpenter taint so that it sticks around for the entire lifetime of the node. Either that or it gets drained as part of the teramination flow and it keeps its pre-stop logic like it has today.

youwalther65 · 2024-04-12T08:36:14Z

@jonathan-innis That's interesting and the opposite of what I've thought. So the EBS CSI node pod should not tolerate all tainst (which is the current default) in order to get drained, which runs the preStop hook as described here

toVersus · 2024-04-14T15:28:22Z

I have been thinking about how to determine if a volume has been unmounted. In the current implementation, we wait for the evicted Pods on the node to go into the Succeeded / Failed phase, but this is not sufficient.

A similar discussion can be seen in the upstream regarding the PVC deletion protection feature at kubernetes/kubernetes#123320 (comment), where it's pointed out that even if the Pod phase becomes Succeeded / Failed, it does not guarantee that the volume has been detached. Therefore, there seems to be an attempt to add a new Pod condition called PodTerminated to indicate that the volume has been unmounted, and KEP-4569: Conditions for terminated pod was just created.

Since I think this feature would be useful for Karpenter / Cluster Autoscaler as well, how about mentioning it as a use case?

levanlongktmt · 2024-05-22T05:54:06Z

@jonathan-innis @jmdeal do you have any plan for this issue?

jmdeal · 2024-08-15T22:41:51Z

Closing, solved by #1294.

jmdeal added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2024

jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 18, 2024

jonathan-innis mentioned this issue Apr 9, 2024

Volume still hang on Karpenter Node Consolidation/Termination kubernetes-sigs/aws-ebs-csi-driver#1955

Closed

jonathan-innis assigned jmdeal Apr 22, 2024

engedaam mentioned this issue Jul 24, 2024

Unexpected termination of pods with PVC aws/karpenter-provider-aws#6581

Closed

jmdeal closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PersistentVolumes stuck after node consolidation / termination #944

PersistentVolumes stuck after node consolidation / termination #944

jmdeal commented Jan 17, 2024 •

edited

Loading

jmdeal commented Jan 17, 2024 •

edited

Loading

jmdeal commented Jan 17, 2024

k8s-ci-robot commented Jan 17, 2024

jmdeal commented Feb 13, 2024 •

edited

Loading

levanlongktmt commented Mar 4, 2024

johnjeffers commented Apr 9, 2024

jmdeal commented Apr 9, 2024

jmdeal commented Apr 10, 2024

toVersus commented Apr 11, 2024

jonathan-innis commented Apr 11, 2024

youwalther65 commented Apr 12, 2024 •

edited

Loading

toVersus commented Apr 14, 2024

levanlongktmt commented May 22, 2024

jmdeal commented Aug 15, 2024

PersistentVolumes stuck after node consolidation / termination #944

PersistentVolumes stuck after node consolidation / termination #944

Comments

jmdeal commented Jan 17, 2024 • edited Loading

Description

jmdeal commented Jan 17, 2024 • edited Loading

jmdeal commented Jan 17, 2024

k8s-ci-robot commented Jan 17, 2024

jmdeal commented Feb 13, 2024 • edited Loading

levanlongktmt commented Mar 4, 2024

johnjeffers commented Apr 9, 2024

jmdeal commented Apr 9, 2024

jmdeal commented Apr 10, 2024

toVersus commented Apr 11, 2024

jonathan-innis commented Apr 11, 2024

youwalther65 commented Apr 12, 2024 • edited Loading

toVersus commented Apr 14, 2024

levanlongktmt commented May 22, 2024

jmdeal commented Aug 15, 2024

jmdeal commented Jan 17, 2024 •

edited

Loading

jmdeal commented Jan 17, 2024 •

edited

Loading

jmdeal commented Feb 13, 2024 •

edited

Loading

youwalther65 commented Apr 12, 2024 •

edited

Loading