Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrashloopBackoff after upgrade to 1.17.1 #258

Open
James-Quigley opened this issue Apr 24, 2024 · 12 comments
Open

CrashloopBackoff after upgrade to 1.17.1 #258

James-Quigley opened this issue Apr 24, 2024 · 12 comments
Labels
documentation Improvements or additions to documentation

Comments

@James-Quigley
Copy link

What happened:
certain Pods Crashlooping after upgrading versions of aws-vpc-cni

Attach logs

2024-04-24 11:12:32.290147858 +0000 UTC Logger.check error: failed to get caller
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xaaaaea2a319c]
goroutine 99 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116 +0x1a4
panic({0xaaaaeb14afa0?, 0xaaaaec6eade0?})
	/root/sdk/go1.21.7/src/runtime/panic.go:914 +0x218
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).configureeBPFProbes(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {0x40006027d0, 0x44}, {0x4000644d40?, 0x1, 0x0?}, {0x40006bd100, 0x2, ...}, ...)
	/workspace/controllers/policyendpoints_controller.go:292 +0x34c
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcilePolicyEndpoint(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, 0x4000655520)
	/workspace/controllers/policyendpoints_controller.go:266 +0x58c
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcile(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {{{0x4000489f80, 0x1c}, {0x400005bdc0, 0x32}}})
	/workspace/controllers/policyendpoints_controller.go:145 +0x1a4
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).Reconcile(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {{{0x4000489f80, 0x1c}, {0x400005bdc0, 0x32}}})
	/workspace/controllers/policyendpoints_controller.go:126 +0xe4
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xaaaaeb520850?, {0xaaaaeb51dff8?, 0x4000794b10?}, {{{0x4000489f80?, 0xb?}, {0x400005bdc0?, 0x0?}}})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x40000e68c0, {0xaaaaeb51e030, 0x400001f450}, {0xaaaaeb255960?, 0x40004e6a80?})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316 +0x2e8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x40000e68c0, {0xaaaaeb51e030, 0x400001f450})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x16c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x74
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 78
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x43c

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
I don't have a specific reproduction. It seems to only happen occasionally and only on certain pods/nodes. I haven't been able to determine yet exactly what the cause is.

Anything else we need to know?:
Might be related to aws/amazon-vpc-cni-k8s#2562

We have network policies currently disabled, both in the configmap and the command line flags.

Environment:

  • Kubernetes version (use kubectl version): 1.26
  • CNI Version: 1.17.1
  • OS (e.g: cat /etc/os-release): Bottlerocket
  • Kernel (e.g. uname -a): Linux <name> 5.15.148 aws/amazon-vpc-cni-k8s#1 SMP Fri Feb 23 02:47:29 UTC 2024 x86_64 GNU/Linux
@James-Quigley James-Quigley added the bug Something isn't working label Apr 24, 2024
@achevuru achevuru transferred this issue from aws/amazon-vpc-cni-k8s Apr 24, 2024
@achevuru
Copy link
Contributor

@ James-Quigley It looks like you've stale Network Policy and/or PolicyEndpoint resources in your cluster although NP is disabled (kubectl get networkpolicies -A /kubectl get policyendpoints -A)

@James-Quigley
Copy link
Author

@achevuru what makes them "stale"? And how can that be resolved? Or are you saying that we can't have any NetworkPolicies in the cluster if we're going to have netpols disabled on the cni daemonset?

@achevuru
Copy link
Contributor

achevuru commented Apr 24, 2024

When you enable Network Policy in VPC CNI, Network Policy controller will resolve the selectors in the NP spec and will create an intermediate custom resource called PolicyEndpoints. This resource is specific to the Network Policy implementation of VPC CNI. If you then disable NP in VPC CNI, we need to make sure these resources are cleared up. If not, there can be stale firewall rules enforced on pod interfaces resulting in unexpected behavior. Deleting the NP resources will clear out the corresponding PolicyEndpoint resources. These resources are stale because they reflect the state of the endpoints when the feature was enabled in the cluster.

You can have Network Policy resources in your cluster with NP disabled in VPC CNI. But if you enabled it in VPC CNI and are now trying to disable it, we need to clear out the resources. So,

  • Delete NPs
  • Disable NP feature in configMap and NP agent

You can then reconfigure your NPs..(if you want another NP solution to act on them)

@James-Quigley
Copy link
Author

Would deleting the PolicyEndpoints resources fix the problem? Also why does this only seem to happen sometimes, for some specific pods?

@Zygimantass
Copy link

+1, deleting the PolicyEndpoints fixed. are there any plans to resolve the nil pointer deference / delete PolicyEndpoints on NP disabling?

@paul-yolabs
Copy link

+1, stumbled on this while trying to disable NetworkPolicies during a failed rollout. Secret behaviors like this are not fun, and a hard panic seems like a poor way for the CNI agent to handle it.

@achevuru
Copy link
Contributor

As I called out above, we took this approach to prevent stale resources in the cluster. Network Policy controller and agent should be allowed to clear out the ebpf probes configured against individual pods to enforce the NPs. Hard failures will alert the users about stale firewall rules that are still active against running pods. So, the recommended way to disable the NP feature is to follow the above sequence (Delete NPs and then Disable NP feature in configMap and NP agent)

@paul-yolabs
Copy link

Ok, so it's a choice. But maybe a log stating that instead of just a panic might help the user, since it appears this behavior is only really documented in this thread?

There's no mention of this in your README, nor in the AWS docs at https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html

It's not obvious to go delete a bunch of resources in order to change an enforcement flag from true to false, so a little help here would be appreciated.

@achevuru
Copy link
Contributor

Fair enough. We will call this out clearly in the documentation.

@andrewjmurphy
Copy link

We also hit this issue when attempting to disable enforcement via VPC CNI, leading to a failed apply of the configuration update via EKS, and crashlooping aws-node pods. Even AWS support were unaware of this limitation and it took a significant amount of time to track down the fault.
Both explicit mention of the limitation in the documentation and some more useful log messages would have saved a lot of wasted time.

@jayanthvn jayanthvn added documentation Improvements or additions to documentation and removed bug Something isn't working labels Jun 11, 2024
@pgier
Copy link

pgier commented Jul 17, 2024

Also hit this bug in the latest version (v1.18.2) trying to disable network policies. The reason that I had to disable network policies was because I hit another issue where the policy behavior in the AWS addon doesn't seem to match the Kubernetes docs and doesn't match the previous Calico install we were using: aws/amazon-network-policy-controller-k8s#121

@andreyastashov
Copy link

andreyastashov commented Sep 3, 2024

We also hit this issue when attempting to disable enforcement via VPC CNI, leading to a failed apply of the configuration update via EKS, and crashlooping aws-node pods. Even AWS support were unaware of this limitation and it took a significant amount of time to track down the fault. Both explicit mention of the limitation in the documentation and some more useful log messages would have saved a lot of wasted time.

Same experience here last week. Spent nearly 2 hours on a call with AWS support just to diagnose.
I found that removing and re-adding the CNI add-on (which defaults to an empty extra config) eventually resolved the problem. It’s also worth noting that you can remove the add-on without needing to wait for an update status—this might save some time if anyone else runs into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

8 participants