Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Open
t0lya opened this issue Jan 16, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@t0lya
Copy link

t0lya commented Jan 16, 2025

What happened:
We have a fleet of 40 clusters with service discovery enabled using Submariner. We rely on Submariner to sync endpointslices across the fleet. We currently have over 1500 endpointslices in our broker cluster.

We have noticed an issue where a cluster in our fleet can stop syncing endpointslices from the broker for up to 10 minutes. This happens when submariner-lighthouse-agent pod restarts on the cluster. We have seen pod restarts due to preemption by Kubernetes scheduler and due to nodes getting tainted and drained for maintenance. We fixed the former by increasing submariner-lighthouse-agent scheduling priority. But node draining still affects submariner-lighthouse-agent availability.

When the agent pod restarts, we see the agent resync all endpointslices from broker to cluster (the processing seems to be done in endpointslice name/namespace alphabetical order). Even if endpointslices are updated on the broker during this resync, these update events do not get processed until the processing loop reaches the endpointslices in its alphabetical order. Due to the volume of endpointslices in our broker (> 1500), we see that endpointslices that are in the end of resync list get processed only after 10 mins.

What you expected to happen:

Is it possible to improve the availability of submariner-lighthouse-agent and make it able to tolerate random ocassional pod restarts? One solution is to increase agent replicas and add leader election to help keep agent keep running even when one of the pods goes down. We can discuss better alternative solutions.

How to reproduce it (as minimally and precisely as possible):

Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart lighthouse agent in cluster B.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.

Anything else we need to know?:

Environment: Linux

  • Diagnose information (use subctl diagnose all):
  • Gather information (use subctl gather):
  • Cloud provider or hardware configuration: Azure Kubernetes
  • Install tools: Submariner Helm chart
  • Others:
@t0lya t0lya added the bug Something isn't working label Jan 16, 2025
@tpantelis
Copy link
Contributor

What version of Submariner are you using?

@leasonliu
Copy link

What version of Submariner are you using?

0.17.3

@tpantelis
Copy link
Contributor

This looks similar to #1623 which eliminated the periodic resync but all EndpointSlices are still processed on startup and they're all added to the queue quickly which kicks in the rate limiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants