-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3866: kube-proxy nftables mode #3824
Conversation
Skipping CI for Draft Pull Request. |
/cc |
backends of kube-proxy, and does not feel like it can ditch `ipvs` | ||
because of performance issues with `iptables`. If we create a new | ||
backend which is as functional and non-buggy as `iptables` but as | ||
performant as `ipvs`, then we could (eventually) deprecate both of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should validate this assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which assumption? That we could ditch ipvs if we had something better?
Also as mentioned elsewhere, "ditch" may mean "push out of tree a la cri-dockerd"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that ipvs proxy is more performant than the iptables ones,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be careful what we promise - removing iptables and IPVS modes may take a VERY long time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you setup a service with 10,000 endpoints, each SYN will have to be compared with ~5,000 probability rules. Ipvs uses a hash. So for TCP connect there is a difference. Once the connection is setup conntrack takes over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is undeniable, let me explain myself better , there are multiple dimensions in scalability, it is clear and acknowledged what are the weakness of iptables, but all the reports and blogs comparing iptables proxy with proxy-foo were focused on stress the iptables weakness, but kube-proxy iptables was running and being tested in very large clusters (5000 nodes) for a lot of years , and tested daily.
What I mean here is that we should do this exercise, I'd like to see similar tests that the ones we are running so we can compare apples and apples, also analysis of the performance vs scale.
It doesn't matter if it can send 1Mpps in a single node if, per example, it can not run reliable in 1000 nodes and keep the same performance ... size matters (and I've seen reports that iptables beats other implementations at scale, yeah, also eBPF ones) ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reluncantly , ACK. I didn't get into all the details of nft (I need to learn some stuff) but overall the idea seems sound.
Can we fix some other long-lingering issues with iptables mode, too?
- when do we drop packets vs when do we ICMP-REJECT them?
- where should 3rd parties (e.g. sysadmins, debug docs, etc) drop custom rules that need to not get nuked upon reconcile?
- default-deny for protocols/ports on service IPs that are not explicitly accepted (e.g. no nodeports on service VIPs)
Also...should we aim to make this the vehicle for ClusterIPs as Gateways? (and NodePorts?) It's an expansion but I really want us to do it eventually...
backends of kube-proxy, and does not feel like it can ditch `ipvs` | ||
because of performance issues with `iptables`. If we create a new | ||
backend which is as functional and non-buggy as `iptables` but as | ||
performant as `ipvs`, then we could (eventually) deprecate both of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be careful what we promise - removing iptables and IPVS modes may take a VERY long time
I was going to suggest that and I think that this KEP is perfect for that, instead of doing a s/iptables/nftables/ we fix the known problems by leveraging Gateway API ... and we can stop overloading Services |
(I had been feeling like "nft" was kind of ruined as an acronym so I've mostly been saying "nftables" when talking about this... Although NFTs are less in the news now than they were a few months ago so maybe we'll get that acronym back? At any rate, although the command-line client is
I had thought the current rules for this were intentional: if a service has no endpoints at all, we reject connections to it, because we want the client to get "connection refused". If the service has Local traffic policy, and no local endpoints, we drop connections, because we assume the LB is just transiently out-of-date, and if the client tries again, hopefully it will get routed to a correct node the next time (and the client is more likely to try again if it looks like its first connection attempt got lost in the internet, whereas it is not likely to retry if it gets back "connection refused"). (I say "connection refused", but in most cases the client actually gets "no route to host", which isn't really what we want, but IIRC we can't actually fix that if we're using ICMP to reject; we'd have to arrange to redirect the packet to a port that actually isn't listening so the kernel will send back a TCP RST or whatever.)
Yeah... so as I was explaining to Antonio, the normal way to do things in nftables is that everyone creates their own tables and doesn't interfere with anyone else's. In your table you create your own chains that connect to the standard hooks ( If people want to stick rules into particular service chains or whatever, then that will likely run into the same problems you get now with the rules getting overwritten by resyncs... But if that's a use case we want to support, we can definitely think about how to make it work.
Right. I was mentally including that along with localhost nodeports: we should be a lot more precise about exactly where nodeports work (and maybe try to default to "they only work on one interface" rather than "they work everywhere".)
I don't have a good sense of what that would mean... |
and write rules, much like how we use command-line tools in the | ||
`iptables` and `ipvs` backends. However, the `nft` tool is mostly just | ||
a thin wrapper around `libnftables`, and it would be possible to use | ||
that directly instead in the future, given a cgo wrapper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I read somewhere some problems about using CGO ... we should check if this is a problem for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's just a bad meme. Someone wrote a blog post about how cgo is bad, and now everyone knows that, but the original author was just pointing out that a cgo function call has like 10x-100x the overhead of a go function call, so trying to optimize your performance-critical code by calling into C is often a lose. But the overhead of go→C is much less than the overhead of userspace→kernel so if you're calling into a C function that makes a syscall then the cgo overhead is just noise anyway. (And the cgo overhead is way less than the overhead of spawning a binary to make the syscall for you.)
ACK :)
We also have the case of addressing a service IP on a port that isn't declared - I think we DROP that, but I think it should be a fast error.
Yeah, I guess what I want is docs and some sort of contract - we expect X and Y and as long as you don't break that, you can attach Here, Here, and Here to you r heart's content.
If we have a "ClusterIP" gateway class, this would require a per-node agent to watch gateways (or something else of lower cardinality, derived from gateways of this class) and implement the moral equivalent of a ClusterIP service. But it needs some design work. But nobody is doing the design work because we don't have any impl to prove it out on... |
ah, yeah, we don't actively This is one of several things that would be easier to deal with if kube-proxy knew the service CIDR(s). Then we could just add a rule at the end saying "all other traffic to the service CIDR gets rejected". (But even without that, we could implement this now in iptables kube-proxy at the cost of an extra iptables rule per clusterIP, and we could implement it in nftables with just one rule and a set containing all clusterIPs.)
That doesn't seem that backend dependent though... does implementing Gateway require any backend-level functionality beyond what implementing Service requires? It seems like the changes would mostly be in the |
f97e5e7
to
52097eb
Compare
with ipvs). Eventually, if nftables becomes the default, then this | ||
would be flipped around to having a legacy "iptables" job. | ||
|
||
The handful of e2e tests that specifically examine iptables rules will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this still exist, but is not great for e2e test to have this implementation dependencies, I rather rewrite them to focus on the expected behavior or drop them entirely, I think you already did that for some of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... I had thought there were more of these. I guess they mostly got fixed. (You're right, I did just remove one of them.) There is only one remaining test that parses iptables rules, "It should recreate its iptables rules if they are deleted [Disruptive]"
, which tests 2 things:
- kubelet recreates
KUBELET-IPTABLES-HINT
if it is deleted (which is iptables-specific and should stick around regardless of kube-proxy, for as long as we expect third-party components to use iptables). - Services eventually start working again even if you delete all
KUBE-*
iptables chains. It doesn't actually test that they stop working after you delete those chains, so it would still pass in nftables mode, it just wouldn't test anything interesting. (And as discussed in the KEP text, we probably don't need to do monitoring for nftables.)
There are also a few tests that manually invoke iptables to break the network via TestUnderTemporaryNetworkFailure
(plus one that doesn't use that function but should). This will eventually need to be updated to have an nftables-based version, but that's independent of kube-proxy.
I think that's it? Yay us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed kubernetes/kubernetes#119047 about TestUnderTemporaryNetworkFailure
|
||
The actual feature gate enablement/disablement itself is not | ||
interesting, since it only controls whether `--proxy-mode nftables` | ||
can be selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ideal world I would expect the clearing to happen out of the box. But here we have a tradeoff with enabling some logic even if the feature is off by-default.
My above thinking was a bit of a mental shortcut, so let me clarify that.
I think that we should split our thinking into two separate flows:
(1) allowing users to choose nftables as a backend
(2) making nftables as a default backend for their clusters
If we would be talking about (2), then I'm definitely with you - that path show have cleanup enabled by default in the version that is previous that enabled nftables as a default backend.
That said, this KEP is purely about (1). So an operator has to make an explicit action to enable the feature.
In such case, I expect that an operator is actually aware of consequences and know that some additional actions may need to be taken on rollback.
Because you don't need these additional actions by default - you only need them if you explicitly opted-in for using this feature for your cluster. Especially, that would really be the case for enabling a Beta feature.
I expect that together with graduating this KEP to beta, we will enable the cleanup logic by default. so once the feature reaches GA and someone opts-in for that when it's in GA, they will get cleanup for free on rollback [because it will be rolling back to state with feature already enabled].
Does that make sense to you?
performance as good as iptables and ipvs. | ||
|
||
- Documentation describes any changes in behavior between the | ||
`iptables` and `ipvs` modes and the `nftables` mode. Any warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this is a perfect example of thing that we should have (for Beta).
implementations, etc, may need updates in order to work with it. (This | ||
may further limit the amount of testing the new mode can get during | ||
the Alpha phase, if it is not yet compatible with popular network | ||
plugins at that point.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this - do we have some mitigation here that we want to mention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really; we can try to avoid adding making it gratuitously different, but we can't avoid the fact that it's going to be different.
(This is a separate "don't be too different" concern from the user-level behavioral differences.)
presented: it allows you to run a docker registry in a pod and then | ||
have nodes use a NodePort service via `127.0.0.1` to access that | ||
registry. Docker treats `127.0.0.1` as an "insecure registry" by | ||
default (though containerd and cri-o do not) and so does not require |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're all better positioned to make this decision. My only request for Beta will be that will not work the same way in nftables mode is easily to discover for operators of existing clusters (metrics).
Anyway sig network can decide wether the code will go in tree or not ? Seemed like there was openness to considering different options ( out vs in tree ).... |
|
||
[KEP-3786]: https://github.com/kubernetes/enhancements/issues/3786 | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a goal like "Design and document where and how external actors can plug into the packet processing logic safely". We never made this easy in iptables mode, e.g. if I wanted to -j LOG
for a particular service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this below, but maybe a nod here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's there in Goals:
- Document specific details of the nftables implementation that we
want to consider as "API". In particular, document the
high-level behavior that authors of network plugins can rely
on. We may also document ways that third parties or
administrators can integrate with kube-proxy's rules at a lower
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, wonderful KEP. Very comprehensive and clear. Thanks.
Second, I did not get to the PRR part yet, but here are my comments.
Third, this is getting ahead of ourselves, but.... In the future, I hope to see Service LBs be replaced by Gateway and maybe even Servicer ClusterIPs. This would enable things like all-ports and all-protocols. As we go about buildingin this it would be great to keep an eye on "if we were watching Gateways, that logic would go here"?
presented: it allows you to run a docker registry in a pod and then | ||
have nodes use a NodePort service via `127.0.0.1` to access that | ||
registry. Docker treats `127.0.0.1` as an "insecure registry" by | ||
default (though containerd and cri-o do not) and so does not require |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can, cautiously, kill this. We should do some outreach through usual channels and see if anyone screams.
``` | ||
<<[UNRESOLVED dnat-but-no-route_localnet ]>> | ||
|
||
As a possible compromise, we could make the `nftables` backend create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep this in our pocket, in case a real use-case emerges, but not implement until then?
The nftables proxy should default to only opening NodePorts on a | ||
single interface, probably the interface with the default route by | ||
default. (Ideally, you really want it to accept NodePorts on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are revisiting this from scratch, let's dial it ALL the way back. I like starting with the IP with default route. If that is not sufficient, IPs matching a flag is pretty powerful. If that is not sufficient, we can look at other options (same as detect-local-mode)
``` | ||
<<[UNRESOLVED unused service IP ports ]>> | ||
|
||
@thockin has suggested that service IPs should reject connections on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reject or drop, but almost certainly not ignore such connections. I lean towards reject, but I can see arguments in both directions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reject improves supportability, the error is more meaningful "connection refused" indicates someone does not want you to go through , drop is "connection timeout"
I've seen thousands of those red herrings in customers tickets at SUSE, at RedHat and at Google now, it is easy to write a playbook as "connect to a service and connection refused check if there are endpoints", however, "connect timeout check if there are endpoints and if everything in the path is not filtering the packet"
|
||
@thockin has suggested that service IPs should reject connections on | ||
ports they aren't using. (This would most easily be implemented by | ||
adding a `--service-cidr` flag to kube-proxy so we could just "reject |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the less things that we teach about things like service-cidr the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with KEP-1880 kube-proxy could just watch the ServiceCIDR objects, and no one would need to explicitly tell it anything
<<[UNRESOLVED service IP pings ]>> | ||
|
||
Users sometimes get confused by the fact that service IPs do not | ||
respond to ICMP pings, and perhaps this is something we could change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting ping is nice and friendly, and I don't see any reason NOT to really, but I also don't see a strong reason in favor. One could argue that it's an info-leak but DNS is already in play :)
|
||
[KEP-3786]: https://github.com/kubernetes/enhancements/issues/3786 | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this below, but maybe a nod here?
<<[/UNRESOLVED]>> | ||
``` | ||
|
||
#### Defining an API for integration with admin/debug/third-party rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this!
f448bf6
to
9aa7e13
Compare
Hopefully everyone should be happy with the latest updates:
|
9aa7e13
to
1be36b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just one minor (non-blocking) comment at this point - it looks good enough for Alpha to me.
@danwinship - I'm waiting for SIG-level approval and then ping me on slack for PRR approval.
|
||
The actual feature gate enablement/disablement itself is not | ||
interesting, since it only controls whether `--proxy-mode nftables` | ||
can be selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would make sense to summarize this comment in the KEP for the future (or maybe just link this comment thread)?
1be36b7
to
b276ce5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 options here IMO.
- mark it
provisional
, merge it as is. Iterate. - leave it
implementable
, do the tidy up to be ready for alpha and trackable as beta, merge that
Honestly I recommend the provisional
and iterate approach. Not my call though.
this LGTM for alpha from PRR pov /approve PRR |
This is pretty much ready for alpha - it's fine to have unresolved items, these are blocker for beta, not for alpha. |
Just my 2cents, Is there a reason we are continuing to bloat in tree vs advancing the out of tree options like is being done extensively throughout the k8s codebase. I get the need for nftables but would think pushing forwards an agreed upon out of tree model would make more sense than bloating up in tree logic. |
The move out of tree might be a different enhancement. This one is a time-sensitive piece specifically about nftables. |
/lgtm |
Hi @kubernetes/sig-network-leads, 👋 1.29 Enhancement Lead here. Just wanted to confirm- is this waiting for an approval label or is more discussion required? |
Sorry, I thought it was approved! /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, soltysh, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I would like to give this a go-ahead for 1.29 please |
One-line PR description:
Initial KEP proposing a kube-proxy nftables mode.
Issue link:
nftables kube-proxy backend #3866
Other comments: