Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PersistentVolumes stuck after node consolidation / termination #944

Closed
jmdeal opened this issue Jan 17, 2024 · 14 comments
Closed

PersistentVolumes stuck after node consolidation / termination #944

jmdeal opened this issue Jan 17, 2024 · 14 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jmdeal
Copy link
Member

jmdeal commented Jan 17, 2024

Description

Observed Behavior:
After Karpenter performs a scale-down event, PersistentVolumes may fail to be detached from the deleted nodes. As a result, pods which are dependent on these PersistentVolumes are left in a pending state until the Attach/Detach controller forcibly detaches the PV after 6 minutes.

This is related to the following upstream kubernetes issue:

While this is an upstream issue, there has been real impact on Karpenter users today and could potentially be addressed as part of Karpenter's node drain logic.

Expected Behavior:
All persistent volumes should be detached from the node as part of the termination process.

Additional References:

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jmdeal jmdeal added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2024
@jmdeal
Copy link
Member Author

jmdeal commented Jan 17, 2024

On the AWS side this has been fixed via this PR to the aws-ebs-csi driver, at least for graceful terminations: kubernetes-sigs/aws-ebs-csi-driver#1736. However, this doesn't address the issue for CSI drivers at large. Curious if any of the Azure folks have ran into similar issues (cc @tallaxes @Bryce-Soghigian @jackfrancis)?

@jmdeal
Copy link
Member Author

jmdeal commented Jan 17, 2024

/remove-label needs-triage

@k8s-ci-robot
Copy link
Contributor

@jmdeal: The label(s) /remove-label needs-triage cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label needs-triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jonathan-innis jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 18, 2024
@jmdeal
Copy link
Member Author

jmdeal commented Feb 13, 2024

We've got some updates to add here, at least for AWS users. This problem can be largely mitigated by ensuring your nodes are configured for gracefule node shutdown. Currently this isn't enabled by default on AL2, the following UserData should be sufficient to enable it:

#!/bin/bash
echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
systemctl restart systemd-logind

echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

Note: This is specific for the AL2 EKS optimized AMI, enabling graceful termination will differ depending on your AMI.
There still may be some delays with in EBS detachment but this should solve the 6 minute detach delay caused by the k8s side of things.

@levanlongktmt
Copy link

@jmdeal I tried use the userdata on AL2023 and the 6 minute stuck still happen 😢

@johnjeffers
Copy link

Any suggestions on the correct userdata for EKS Bottlerocket AMIs? Can't use what's suggested above because Bottlerocket doesn't have the systemd-logind service.

@jmdeal
Copy link
Member Author

jmdeal commented Apr 9, 2024

The AWS provider has supported configuring graceful shutdown for bottlerocket since v0.31.0 (aws/karpenter-provider-aws#4571) and as far as I can tell bottlerocket should have systemd-logind (bottlerocket-os/bottlerocket#3308). I'm not sure if any additional configuration for systemd-logind is needed for bottlerocket, I'm going to wager no based on what I've briefly read.

@jmdeal
Copy link
Member Author

jmdeal commented Apr 10, 2024

@levanlongktmt apologies for missing your question, you likely need to handle the kubelet configuration through the nodeadm config rather than bash. This should work:

    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"

    --//
    Content-Type: application/node.eks.awsContent-Type: application/node.eks.aws

    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      kubelet:
        config:
          shutdownGracePeriod: 45s
          shutdownGracePeriodCriticalPods: 15s
    --//
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    --//--

@toVersus
Copy link

@jmdeal
Thanks for sharing the example userdata! However, it contains some typos in the configuration. I can confirm that the following userdata works perfectly.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2023
  (...)
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"

    --//
    Content-Type: application/node.eks.aws

    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      kubelet:
        config:
          shutdownGracePeriod: 45s
          shutdownGracePeriodCriticalPods: 15s
    --//
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    --//

A Karpenter node is configured as expected:

kubectl debug -it node/ip-172-30-20-187.ap-northeast-1.compute.internal --image=alpine

/ # chroot /host
[root@ip-172-30-20-187 /]# cat /etc/systemd/logind.conf | grep InhibitDelayMaxSec
#InhibitDelayMaxSec=5
InhibitDelayMaxSec=45
[root@ip-172-30-20-187 /]# cat /etc/kubernetes/kubelet/config.json | grep shutdownGracePeriod
    "shutdownGracePeriod": "45s",
    "shutdownGracePeriodCriticalPods": "15s",

@jonathan-innis
Copy link
Member

After some discussion with the EBS CSI Driver team, I think the real solve here is for the client-side drain behavior to actually wait on the volume detachment since that happens asynchronously. Realistically, it seems like this would also be a problem for Cluster Autoscaler if they don't have logic to implement this as well, so we should think about how to solve this generally across SIG Autoscaling if we can.

We're receptive to someone taking this change in and implementing it. The change should be waiting after we drain here: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/controller.go#L86. Realistically, for this to work, we need to make sure to avoid draining the EBS CSI driver DaemonSet, which means that EBS shouldn't tolerate the Karpenter taint so that it sticks around for the entire lifetime of the node. Either that or it gets drained as part of the teramination flow and it keeps its pre-stop logic like it has today.

@youwalther65
Copy link

youwalther65 commented Apr 12, 2024

@jonathan-innis That's interesting and the opposite of what I've thought. So the EBS CSI node pod should not tolerate all tainst (which is the current default) in order to get drained, which runs the preStop hook as described here

@toVersus
Copy link

I have been thinking about how to determine if a volume has been unmounted. In the current implementation, we wait for the evicted Pods on the node to go into the Succeeded / Failed phase, but this is not sufficient.

A similar discussion can be seen in the upstream regarding the PVC deletion protection feature at kubernetes/kubernetes#123320 (comment), where it's pointed out that even if the Pod phase becomes Succeeded / Failed, it does not guarantee that the volume has been detached. Therefore, there seems to be an attempt to add a new Pod condition called PodTerminated to indicate that the volume has been unmounted, and KEP-4569: Conditions for terminated pod was just created.

Since I think this feature would be useful for Karpenter / Cluster Autoscaler as well, how about mentioning it as a use case?

@levanlongktmt
Copy link

@jonathan-innis @jmdeal do you have any plan for this issue?

@jmdeal
Copy link
Member Author

jmdeal commented Aug 15, 2024

Closing, solved by #1294.

@jmdeal jmdeal closed this as completed Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants