Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

t0lya · 2025-01-16T02:32:18Z

What happened:
We have a fleet of 40 clusters with service discovery enabled using Submariner. We rely on Submariner to sync endpointslices across the fleet. We currently have over 1500 endpointslices in our broker cluster.

We have noticed an issue where a cluster in our fleet can stop syncing endpointslices from the broker for up to 10 minutes. This happens when submariner-lighthouse-agent pod restarts on the cluster. We have seen pod restarts due to preemption by Kubernetes scheduler and due to nodes getting tainted and drained for maintenance. We fixed the former by increasing submariner-lighthouse-agent scheduling priority. But node draining still affects submariner-lighthouse-agent availability.

When the agent pod restarts, we see the agent resync all endpointslices from broker to cluster (the processing seems to be done in endpointslice name/namespace alphabetical order). Even if endpointslices are updated on the broker during this resync, these update events do not get processed until the processing loop reaches the endpointslices in its alphabetical order. Due to the volume of endpointslices in our broker (> 1500), we see that endpointslices that are in the end of resync list get processed only after 10 mins.

What you expected to happen:

Is it possible to improve the availability of submariner-lighthouse-agent and make it able to tolerate random ocassional pod restarts? One solution is to increase agent replicas and add leader election to help keep agent keep running even when one of the pods goes down. We can discuss better alternative solutions.

How to reproduce it (as minimally and precisely as possible):

Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart lighthouse agent in cluster B.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.

Anything else we need to know?:

Environment: Linux

Diagnose information (use subctl diagnose all):
Gather information (use subctl gather):
Cloud provider or hardware configuration: Azure Kubernetes
Install tools: Submariner Helm chart
Others:

The text was updated successfully, but these errors were encountered:

tpantelis · 2025-01-16T13:38:56Z

What version of Submariner are you using?

leasonliu · 2025-01-16T19:00:30Z

What version of Submariner are you using?

0.17.3

tpantelis · 2025-01-16T19:20:12Z

This looks similar to #1623 which eliminated the periodic resync but all EndpointSlices are still processed on startup and they're all added to the queue quickly which kicks in the rate limiter.

t0lya added the bug Something isn't working label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

t0lya commented Jan 16, 2025

tpantelis commented Jan 16, 2025

leasonliu commented Jan 16, 2025

tpantelis commented Jan 16, 2025

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Comments

t0lya commented Jan 16, 2025

tpantelis commented Jan 16, 2025

leasonliu commented Jan 16, 2025

tpantelis commented Jan 16, 2025