You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
We have a fleet of 40 clusters with service discovery enabled using Submariner. We rely on Submariner to sync endpointslices across the fleet. We currently have over 1500 endpointslices in our broker cluster.
We have noticed an issue where a cluster in our fleet can stop syncing endpointslices from the broker for up to 10 minutes. This happens when submariner-lighthouse-agent pod restarts on the cluster. We have seen pod restarts due to preemption by Kubernetes scheduler and due to nodes getting tainted and drained for maintenance. We fixed the former by increasing submariner-lighthouse-agent scheduling priority. But node draining still affects submariner-lighthouse-agent availability.
When the agent pod restarts, we see the agent resync all endpointslices from broker to cluster (the processing seems to be done in endpointslice name/namespace alphabetical order). Even if endpointslices are updated on the broker during this resync, these update events do not get processed until the processing loop reaches the endpointslices in its alphabetical order. Due to the volume of endpointslices in our broker (> 1500), we see that endpointslices that are in the end of resync list get processed only after 10 mins.
What you expected to happen:
Is it possible to improve the availability of submariner-lighthouse-agent and make it able to tolerate random ocassional pod restarts? One solution is to increase agent replicas and add leader election to help keep agent keep running even when one of the pods goes down. We can discuss better alternative solutions.
How to reproduce it (as minimally and precisely as possible):
Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart lighthouse agent in cluster B.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.
Anything else we need to know?:
Environment: Linux
Diagnose information (use subctl diagnose all):
Gather information (use subctl gather):
Cloud provider or hardware configuration: Azure Kubernetes
Install tools: Submariner Helm chart
Others:
The text was updated successfully, but these errors were encountered:
This looks similar to #1623 which eliminated the periodic resync but all EndpointSlices are still processed on startup and they're all added to the queue quickly which kicks in the rate limiter.
What happened:
We have a fleet of 40 clusters with service discovery enabled using Submariner. We rely on Submariner to sync endpointslices across the fleet. We currently have over 1500 endpointslices in our broker cluster.
We have noticed an issue where a cluster in our fleet can stop syncing endpointslices from the broker for up to 10 minutes. This happens when submariner-lighthouse-agent pod restarts on the cluster. We have seen pod restarts due to preemption by Kubernetes scheduler and due to nodes getting tainted and drained for maintenance. We fixed the former by increasing submariner-lighthouse-agent scheduling priority. But node draining still affects submariner-lighthouse-agent availability.
When the agent pod restarts, we see the agent resync all endpointslices from broker to cluster (the processing seems to be done in endpointslice name/namespace alphabetical order). Even if endpointslices are updated on the broker during this resync, these update events do not get processed until the processing loop reaches the endpointslices in its alphabetical order. Due to the volume of endpointslices in our broker (> 1500), we see that endpointslices that are in the end of resync list get processed only after 10 mins.
What you expected to happen:
Is it possible to improve the availability of submariner-lighthouse-agent and make it able to tolerate random ocassional pod restarts? One solution is to increase agent replicas and add leader election to help keep agent keep running even when one of the pods goes down. We can discuss better alternative solutions.
How to reproduce it (as minimally and precisely as possible):
Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart lighthouse agent in cluster B.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.
Anything else we need to know?:
Environment: Linux
subctl diagnose all
):subctl gather
):The text was updated successfully, but these errors were encountered: