CrashloopBackoff after upgrade to 1.17.1 #258

James-Quigley · 2024-04-24T17:46:14Z

What happened:
certain Pods Crashlooping after upgrading versions of aws-vpc-cni

Attach logs

2024-04-24 11:12:32.290147858 +0000 UTC Logger.check error: failed to get caller
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xaaaaea2a319c]
goroutine 99 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116 +0x1a4
panic({0xaaaaeb14afa0?, 0xaaaaec6eade0?})
	/root/sdk/go1.21.7/src/runtime/panic.go:914 +0x218
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).configureeBPFProbes(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {0x40006027d0, 0x44}, {0x4000644d40?, 0x1, 0x0?}, {0x40006bd100, 0x2, ...}, ...)
	/workspace/controllers/policyendpoints_controller.go:292 +0x34c
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcilePolicyEndpoint(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, 0x4000655520)
	/workspace/controllers/policyendpoints_controller.go:266 +0x58c
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcile(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {{{0x4000489f80, 0x1c}, {0x400005bdc0, 0x32}}})
	/workspace/controllers/policyendpoints_controller.go:145 +0x1a4
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).Reconcile(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {{{0x4000489f80, 0x1c}, {0x400005bdc0, 0x32}}})
	/workspace/controllers/policyendpoints_controller.go:126 +0xe4
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xaaaaeb520850?, {0xaaaaeb51dff8?, 0x4000794b10?}, {{{0x4000489f80?, 0xb?}, {0x400005bdc0?, 0x0?}}})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x40000e68c0, {0xaaaaeb51e030, 0x400001f450}, {0xaaaaeb255960?, 0x40004e6a80?})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316 +0x2e8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x40000e68c0, {0xaaaaeb51e030, 0x400001f450})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x16c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x74
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 78
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x43c

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
I don't have a specific reproduction. It seems to only happen occasionally and only on certain pods/nodes. I haven't been able to determine yet exactly what the cause is.

Anything else we need to know?:
Might be related to aws/amazon-vpc-cni-k8s#2562

We have network policies currently disabled, both in the configmap and the command line flags.

Environment:

Kubernetes version (use kubectl version): 1.26
CNI Version: 1.17.1
OS (e.g: cat /etc/os-release): Bottlerocket
Kernel (e.g. uname -a): Linux <name> 5.15.148 aws/amazon-vpc-cni-k8s#1 SMP Fri Feb 23 02:47:29 UTC 2024 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

achevuru · 2024-04-24T17:53:33Z

@ James-Quigley It looks like you've stale Network Policy and/or PolicyEndpoint resources in your cluster although NP is disabled (kubectl get networkpolicies -A /kubectl get policyendpoints -A)

James-Quigley · 2024-04-24T18:33:07Z

@achevuru what makes them "stale"? And how can that be resolved? Or are you saying that we can't have any NetworkPolicies in the cluster if we're going to have netpols disabled on the cni daemonset?

achevuru · 2024-04-24T18:51:48Z

When you enable Network Policy in VPC CNI, Network Policy controller will resolve the selectors in the NP spec and will create an intermediate custom resource called PolicyEndpoints. This resource is specific to the Network Policy implementation of VPC CNI. If you then disable NP in VPC CNI, we need to make sure these resources are cleared up. If not, there can be stale firewall rules enforced on pod interfaces resulting in unexpected behavior. Deleting the NP resources will clear out the corresponding PolicyEndpoint resources. These resources are stale because they reflect the state of the endpoints when the feature was enabled in the cluster.

You can have Network Policy resources in your cluster with NP disabled in VPC CNI. But if you enabled it in VPC CNI and are now trying to disable it, we need to clear out the resources. So,

Delete NPs
Disable NP feature in configMap and NP agent

You can then reconfigure your NPs..(if you want another NP solution to act on them)

James-Quigley · 2024-04-24T19:01:27Z

Would deleting the PolicyEndpoints resources fix the problem? Also why does this only seem to happen sometimes, for some specific pods?

Zygimantass · 2024-05-21T14:59:09Z

+1, deleting the PolicyEndpoints fixed. are there any plans to resolve the nil pointer deference / delete PolicyEndpoints on NP disabling?

paul-yolabs · 2024-05-29T01:31:28Z

+1, stumbled on this while trying to disable NetworkPolicies during a failed rollout. Secret behaviors like this are not fun, and a hard panic seems like a poor way for the CNI agent to handle it.

achevuru · 2024-05-29T05:11:55Z

As I called out above, we took this approach to prevent stale resources in the cluster. Network Policy controller and agent should be allowed to clear out the ebpf probes configured against individual pods to enforce the NPs. Hard failures will alert the users about stale firewall rules that are still active against running pods. So, the recommended way to disable the NP feature is to follow the above sequence (Delete NPs and then Disable NP feature in configMap and NP agent)

paul-yolabs · 2024-05-29T14:34:20Z

Ok, so it's a choice. But maybe a log stating that instead of just a panic might help the user, since it appears this behavior is only really documented in this thread?

There's no mention of this in your README, nor in the AWS docs at https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html

It's not obvious to go delete a bunch of resources in order to change an enforcement flag from true to false, so a little help here would be appreciated.

achevuru · 2024-05-29T17:08:41Z

Fair enough. We will call this out clearly in the documentation.

andrewjmurphy · 2024-05-30T07:51:35Z

We also hit this issue when attempting to disable enforcement via VPC CNI, leading to a failed apply of the configuration update via EKS, and crashlooping aws-node pods. Even AWS support were unaware of this limitation and it took a significant amount of time to track down the fault.
Both explicit mention of the limitation in the documentation and some more useful log messages would have saved a lot of wasted time.

pgier · 2024-07-17T14:35:14Z

Also hit this bug in the latest version (v1.18.2) trying to disable network policies. The reason that I had to disable network policies was because I hit another issue where the policy behavior in the AWS addon doesn't seem to match the Kubernetes docs and doesn't match the previous Calico install we were using: aws/amazon-network-policy-controller-k8s#121

andreyastashov · 2024-09-03T14:13:19Z

We also hit this issue when attempting to disable enforcement via VPC CNI, leading to a failed apply of the configuration update via EKS, and crashlooping aws-node pods. Even AWS support were unaware of this limitation and it took a significant amount of time to track down the fault. Both explicit mention of the limitation in the documentation and some more useful log messages would have saved a lot of wasted time.

Same experience here last week. Spent nearly 2 hours on a call with AWS support just to diagnose.
I found that removing and re-adding the CNI add-on (which defaults to an empty extra config) eventually resolved the problem. It’s also worth noting that you can remove the add-on without needing to wait for an update status—this might save some time if anyone else runs into this.

James-Quigley added the bug Something isn't working label Apr 24, 2024

achevuru transferred this issue from aws/amazon-vpc-cni-k8s Apr 24, 2024

jayanthvn added documentation Improvements or additions to documentation and removed bug Something isn't working labels Jun 11, 2024

jaydeokar mentioned this issue Oct 29, 2024

Avoid panic when network policy agent is disabled #323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrashloopBackoff after upgrade to 1.17.1 #258

CrashloopBackoff after upgrade to 1.17.1 #258

James-Quigley commented Apr 24, 2024

achevuru commented Apr 24, 2024

James-Quigley commented Apr 24, 2024

achevuru commented Apr 24, 2024 •

edited

Loading

James-Quigley commented Apr 24, 2024

Zygimantass commented May 21, 2024

paul-yolabs commented May 29, 2024

achevuru commented May 29, 2024

paul-yolabs commented May 29, 2024

achevuru commented May 29, 2024

andrewjmurphy commented May 30, 2024

pgier commented Jul 17, 2024

andreyastashov commented Sep 3, 2024 •

edited

Loading

CrashloopBackoff after upgrade to 1.17.1 #258

CrashloopBackoff after upgrade to 1.17.1 #258

Comments

James-Quigley commented Apr 24, 2024

achevuru commented Apr 24, 2024

James-Quigley commented Apr 24, 2024

achevuru commented Apr 24, 2024 • edited Loading

James-Quigley commented Apr 24, 2024

Zygimantass commented May 21, 2024

paul-yolabs commented May 29, 2024

achevuru commented May 29, 2024

paul-yolabs commented May 29, 2024

achevuru commented May 29, 2024

andrewjmurphy commented May 30, 2024

pgier commented Jul 17, 2024

andreyastashov commented Sep 3, 2024 • edited Loading

achevuru commented Apr 24, 2024 •

edited

Loading

andreyastashov commented Sep 3, 2024 •

edited

Loading