Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contour leader doesn't update endpoints in xDS cache after upstream pods recreation #6743

Open
philimonoff opened this issue Oct 28, 2024 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor.

Comments

@philimonoff
Copy link

What steps did you take and what happened:

  1. There is something about 8000 httpproxy objects with same ingressclass.
  2. There are 2 contour pods (leader and replica) and four envoy pods (daemonset).
  3. We recreate pods of application, that are upstreams of corresponding envoy cluster.
  4. After recreating of these pods, contour-replica has ip addresses of new pods as endpoints for this envoy cluster in eDS (via contour cli).
  5. Contour-leader has ip addresses of old (deleted) pods as endpoints for this envoy cluster in eDS (via contour cli).
  6. Envoy pods connected to contour leader pod return 503 error for requests to corresponding hosts.
  7. Envoy pods connected to contour replica pod serves requests correctly.
  8. Recreating of contour pods fixes this problem for a while.

What did you expect to happen:

Leader pod updates it's state after app's pod recreation.

Anything else you would like to add:

Environment:

  • Contour version: 1.29.2
  • Kubernetes version: (use kubectl version): 1.25.16, 1.27.16
  • Kubernetes installer & version: kops 1.26.5, kubeadm 1.27.16
  • Cloud provider or hardware configuration: AWS, Openstack
  • OS (e.g. from /etc/os-release): Ubuntu 24.04 LTS
@philimonoff philimonoff added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Oct 28, 2024
Copy link

Hey @philimonoff! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Contour. You can also join us on our mailing list and in our channel in the Kubernetes Slack Workspace

@tsaarni
Copy link
Member

tsaarni commented Oct 28, 2024

Hi @philimonoff, I haven’t tried to reproduce this yet, but I wanted to ask - does the issue depend on having a large number of HTTPProxies, or have you observed it occurring with fewer (or even a single) HTTPProxy as well?

@philimonoff
Copy link
Author

@tsaarni thank you for your fast reaction. We don't see this on small installations. I can't say which number of proxies is the exact trigger bound, but this situation occurs sometimes on the installation with 5000 proxies. But if with 5000 it occurs sometimes, with 8000 and more it's a strong pattern.

@tsaarni
Copy link
Member

tsaarni commented Oct 28, 2024

@philimonoff Could this be due to rate limiting? The API server client library has limits on requests, which can result in significant delays if a large number of resources change simultaneously. You could try adjusting these parameters in the Contour deployment for the contour serve command to see if it helps: --kubernetes-client-qps=<qps> and --kubernetes-client-burst=<burst>. Use large values such as 100 or larger, to observe any difference. For details, check out this article.

@philimonoff
Copy link
Author

@tsaarni I tried 100 qps and 150 burst, and it didn't help. More than that, all hosts started return 503, so I removed these flags. I don't know whats happened, because it's a production environment and I can't let it stay broken.

@tsaarni
Copy link
Member

tsaarni commented Oct 30, 2024

@philimonoff Unfortunately at this point, I don’t have any other ideas what could cause the issue. I assume you've already checked for any errors in the leader’s logs? It’s possible the Contour pod might be under heavy resource constraints (like CPU), but if that were the case, I’d expect it to impact contour cli responses as well, which didn't seem to be the case.

@philimonoff
Copy link
Author

@tsaarni before opening this issue, I had already tried to read contour debug logs (they have very intensive rate), record pprof sessions and traces (nothing suspicious), watch all metrics contour has. My next idea is to add my own logs on any step of the way of eventslice from api-server to xds cache. Now I also can't even imagine what is the cause of it.

@philimonoff
Copy link
Author

@tsaarni hello again.
I'd spent last month trying to debug this problem in contour. And I wonder the problem is somewhere not even in the contour code itself. After calling the onUpdate method I can see changes in the xds delta stream instantly. As I understand, onUpdate is called by this informer, but this is already controller-manager, not contour.
The problem is: the changes of endpointSlices arrive to replica pod of contour faster than to the leader pod. Sometimes it takes tens of minutes for leader pod.

One more question. What amount of resources do You consider as enough resources for setup with 8000 proxies? In our case we have 3 cores in request and no limits. Each pod launches on the node with 8 cores, what means it has GOMAXPROCS=8. But leader pod consumes only ~2500m of CPU. Can GOMAXPROCS be too low for this case?

@tsaarni
Copy link
Member

tsaarni commented Dec 6, 2024

What amount of resources do You consider as enough resources for setup with 8000 proxies?

Hi @philimonoff, sorry I don't know. Unfortunately I have no experiences running that kind of workload. I guess pprof should/could have revealed something suspicious if there was contention, but you already checked that with no findings...

@tsaarni
Copy link
Member

tsaarni commented Dec 6, 2024

I still wonder if the issue could be related to Kubernetes API rate limiting, like discussed previously. From what I understood, a large number of resources need to be updated at the same time. If rate limiting blocks the propagation of status updates, I wonder if that could negatively impact the configuration updates as well.

@philimonoff
Copy link
Author

Hi @tsaarni!

We conducted additional tests and found that the increased load on Contour pods is not directly related to whether a pod is the leader but instead depends on the number of Envoy pods connected to a Contour pod.

Our rolling update pattern for Contour pods had resulted in a situation where all Envoy pods were connected only to the leader pod. Based on this, our current understanding is that the load on Contour is directly proportional to the number of EDS subscriptions.

Here’s the scenario we analyzed:

  • Each Envoy creates a separate EDS watch for every cluster.
  • In our example, the kubernetes cluster contains 8,000 HTTPProxy objects, and 4 Envoy pods are connected to a single Contour pod.
  • This results in 32,000 subscriptions to the Contour pod (8,000 ClusterLoadAssignments × 4 Envoys).

The main issue is that whenever the Contour pod receives an update for a single EndpointSlice from the Kubernetes API, it triggers updates across all subscriptions. Specifically, updates for all ClusterLoadAssignments are sent to all subscriptions each time an event for an associated EndpointSlice is processed.

Although there is no explicit CPU throttling during these moments, we observed significant CPU pressure stalls (as described in the PSI documentation: link).

Here’s the root cause of the issue:

  1. The endpointSliceTranslator.onUpdate() method is called.
  2. Inside this method, e.Observer.Refresh() is invoked.
  3. During the Refresh() call, snapshotCache.SetSnapshot is triggered.

The default cache used by go-control-plane updates the entire snapshot for every SetSnapshot call, which is why all subscribers receive new data.

Proposed Solution:
We suggest replacing the default cache for edsCache in the SnapshotHandler with LinearCache.

With LinearCache, we can update only the specific ClusterLoadAssignment that has changed. Since the onUpdate method already identifies the object being updated, we can perform targeted updates in LinearCache, reducing unnecessary updates and load.

Our Experience:
We implemented these changes in our fork and tested them. The results show that this approach resolves the issue effectively.

However, we would like to hear your thoughts on this proposal and discuss the potential for incorporating these changes into the main branch of Contour.

@saley89
Copy link

saley89 commented Dec 12, 2024

Hi @philimonoff you may be interested to read our recently added PR #6806. Your issue reads just like what we were facing. Essentially at high levels of HttpProxy (in our case at least 5000+) we see the state of the world endpoint communication for endpoint's being so large that it overwhelms the system and can't keep up in a timely manner.

Our PR swaps the communication method from default GRPC to ADS Delta_GRPC and we are seeing the dramatic improvements as discussed in the PR.

Let us know your thoughts/findings on that idea but so far it is working extremely well for us from a place where it was essentially broken and unusable.

@VVvKamper
Copy link

Just a brief update: we’ve addressed this issue with the following patch:
fix-eds-perfomance-issues.patch

It has been running smoothly on our production clusters for several days without any problems. Here are a few images illustrating the impact on a cluster with approximately 7k httpproxies.

image image

This could be a lower-impact alternative compared to migrating to the ADS Delta gRPC xDS variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor.
Projects
None yet
Development

No branches or pull requests

4 participants