-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network policy blocks established connections to STS. #73
Comments
So it looks like that network-policy-agent has missed initial connection which is established already and from firewall perspective that is a new ingress traffic which it is successfully blocking. log file has records dated from Sep 19th:
and the pod is in running state only 22hours:
|
ipamd.log
plugin.log
network-policy-agent.log:
Pod info:
So pod was created at I've added |
Any ideas how to fix it without adding sleep command? |
same issue with the latest 1.0.5 |
1.0.6 still the same issue. All established connection are being denied after network policy is applied:
|
@wiseelf - Will you be able to try this image -
Please make sure you replace the account number and region. |
@jayanthvn i have the same issue:
|
Actually I missed your comment -
1.0.7 doesn't fix it..Since I assumed it was long standing connection. Local conntrack cache of the agent will only be populated if the network policy allows the connection i.e, for the first packet and future connections will go thru the local conntrack. Here seems like sts connection is established between 14:00:02 and 14:00:03 and once the deny-all policy is applied everything is getting blocked because we need to explicitly allow the connection. So the 5s sleep is for NP to enforce? |
Well, it is not a problem, but it is also not a solution at all. I believe cilium or calico do not have such issue. I'll definitely try them when I have time. |
Tried cilium and calico, both of them do not have this issue. |
Tracking this as another issue that should be addressed by strict mode implementation |
I also see For example:
|
Here the pod attempted to start a connection before NP enforcement and hence response packet is dropped. Pl refer to this #189 (comment) for detailed explanation. Our recommended solution for this is Strict mode, which will gate pod launch until policies are configured against the newly launched pod - https://github.com/aws/amazon-vpc-cni-k8s?tab=readme-ov-file#network_policy_enforcing_mode-v1171 |
@jayanthvn how exactly that will help? Which means initial connection to STS will be denied. |
@sknmi - In the original issue adding 5s delay the issue was mitigated. So as explained in #189 (comment) when the initial connection was made network policy was not enforced i.e, egress connection happens right after pod startup and before the policies are enforced and there will be no conntrack entry since no probes are attached yet. Before the return traffic arrives the network policy would have enforced and there will be no conntrack entry and the ingress rules in the configured policy do not allow traffic resulting in a drop. Hence adding few seconds delay helped.. With strict mode, pod launch will be blocked until policies are reconciled... |
@jayanthvn strict mode requires to have network policies for kube-system namespace, for example coredns and others. Do you have any template for that case or maybe some best practices? What will happen if we have coredns on fargate nodes? |
Hi. After days of investigation and wasted time, we found this thread. We are facing the same issue with Switching to @jayanthvn, are there any concrete plans to address this issue? The team is currently losing trust in the AWS solution for network policies. |
@creinheimer - Sorry to hear that and we would be happy to help debug the issue. Just to clarify, when you mention random failures - Initial connection is allowed before network policy enforcement and the return packet is dropped correct? (#189 (comment)) Regarding this issue, we are thinking about few alternatives. Are you on the latest version (v1.1.2) of NP agent? It won't fix this issue but we have fixed few other timeout related issues. |
@wiseelf Were you able to try the strict mode and did the help with fixing the issue ? |
@Pavani-Panakanti nope, I had the same issue as creinheimer here: #288 |
@wiseelf - As called out here - #288 (comment) we made a fix which would impact |
@jayanthvn thank you. will test. |
@jayanthvn tested, problem still exists. |
What happened:
I have cli script in one namespace and I applied this network policy:
Script is PHP application that use aws-php-sdk. It uses service account to assume role to access s3 bucket. We also have interface endpoint for STS. After I applied that policy I see that my container is stuck and cannot assume role.
strace:
lsof:
As you can see connection is in ESTABLISHED state.
Here is the logs from the instance:
But from inside the container new command that tests connection to s3 works well:
Basically 4 requests: DNS -> STS -> DNS -> S3
This is container IP: 10.1.201.3
And this is STS interface IP: 10.1.12.220
And if I remove that network policy everything works well again. Any ideas?
Environment:
Kubernetes version (use kubectl version):
Server Version: v1.27.4-eks-2d98532
CNI Version:
v1.15.0-eksbuild.2
OS (e.g: cat /etc/os-release):
bottlerocket $ cat /etc/os-release NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/"
Kernel (e.g. uname -a):
Linux ip-10-1-193-17.ec2.internal 5.15.128 aws/amazon-vpc-cni-k8s#1 SMP Thu Sep 14 21:42:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: