-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CrashloopBackoff after upgrade to 1.17.1 #258
Comments
@ James-Quigley It looks like you've stale Network Policy and/or |
@achevuru what makes them "stale"? And how can that be resolved? Or are you saying that we can't have any NetworkPolicies in the cluster if we're going to have netpols disabled on the cni daemonset? |
When you enable Network Policy in VPC CNI, Network Policy controller will resolve the selectors in the NP spec and will create an intermediate custom resource called You can have Network Policy resources in your cluster with NP disabled in VPC CNI. But if you enabled it in VPC CNI and are now trying to disable it, we need to clear out the resources. So,
You can then reconfigure your NPs..(if you want another NP solution to act on them) |
Would deleting the PolicyEndpoints resources fix the problem? Also why does this only seem to happen sometimes, for some specific pods? |
+1, deleting the PolicyEndpoints fixed. are there any plans to resolve the nil pointer deference / delete PolicyEndpoints on NP disabling? |
+1, stumbled on this while trying to disable NetworkPolicies during a failed rollout. Secret behaviors like this are not fun, and a hard panic seems like a poor way for the CNI agent to handle it. |
As I called out above, we took this approach to prevent stale resources in the cluster. Network Policy controller and agent should be allowed to clear out the ebpf probes configured against individual pods to enforce the NPs. Hard failures will alert the users about stale firewall rules that are still active against running pods. So, the recommended way to disable the NP feature is to follow the above sequence (Delete NPs and then Disable NP feature in configMap and NP agent) |
Ok, so it's a choice. But maybe a log stating that instead of just a There's no mention of this in your README, nor in the AWS docs at https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html It's not obvious to go delete a bunch of resources in order to change an enforcement flag from |
Fair enough. We will call this out clearly in the documentation. |
We also hit this issue when attempting to disable enforcement via VPC CNI, leading to a failed apply of the configuration update via EKS, and crashlooping aws-node pods. Even AWS support were unaware of this limitation and it took a significant amount of time to track down the fault. |
Also hit this bug in the latest version (v1.18.2) trying to disable network policies. The reason that I had to disable network policies was because I hit another issue where the policy behavior in the AWS addon doesn't seem to match the Kubernetes docs and doesn't match the previous Calico install we were using: aws/amazon-network-policy-controller-k8s#121 |
Same experience here last week. Spent nearly 2 hours on a call with AWS support just to diagnose. |
What happened:
certain Pods Crashlooping after upgrading versions of aws-vpc-cni
Attach logs
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
I don't have a specific reproduction. It seems to only happen occasionally and only on certain pods/nodes. I haven't been able to determine yet exactly what the cause is.
Anything else we need to know?:
Might be related to aws/amazon-vpc-cni-k8s#2562
We have network policies currently disabled, both in the configmap and the command line flags.
Environment:
kubectl version
): 1.26cat /etc/os-release
): Bottlerocketuname -a
):Linux <name> 5.15.148 aws/amazon-vpc-cni-k8s#1 SMP Fri Feb 23 02:47:29 UTC 2024 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: