Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large LIST calls being made to Kube API server #10931

Closed
prateekgogia opened this issue Apr 17, 2023 · 12 comments
Closed

Large LIST calls being made to Kube API server #10931

prateekgogia opened this issue Apr 17, 2023 · 12 comments
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug

Comments

@prateekgogia
Copy link

Summary

While debugging the load on API server and etcd instances, I found that argo workflow controller is making List calls every 1 minute and listing all the workflow objects in the clusters

Screen Shot 2023-03-27 at 9 53 46 PM

What change needs making?
Can this implementation be switched to a WATCH call instead of using a List call?

Use Cases

When would you use this?


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@prateekgogia prateekgogia added the type/feature Feature request label Apr 17, 2023
@terrytangyuan
Copy link
Member

Could you paste one of the the complete API endpoint URL? Do you have user agent information for those list calls?

@prateekgogia
Copy link
Author

RequestURI - /apis/argoproj.io/v1alpha1/workflows?labelSelector=%21workflows.argoproj.io%2Fphase%2C%21workflows.argoproj.io%2Fcontroller-instanceid&limit=200

user agent- workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v3.4.6 argo-controller
username - system:serviceaccount:argo:argo

@tooptoop4
Copy link
Contributor

guess its liveness https://github.com/argoproj/argo-workflows/blob/v3.4.7/workflow/controller/healthz.go#L34

@terrytangyuan
Copy link
Member

Workaround is to remove your liveness probe or reduce the limit via HEALTHZ_LIST_LIMIT. Then the list call should be very minimal.

@prateekgogia
Copy link
Author

Thanks so if I understand correctly, HEALTHZ_LIST_LIMIT can limit the number of workflow objects requested in the List API call?

@terrytangyuan
Copy link
Member

Yes. It should reduce the load.

@prateekgogia
Copy link
Author

Thanks for confirming, I am double checking with our etcd team because there has been some discussion around how limit param can also cause some excessive load. I will get back once I get an answer from etcd team.

@andrewsykim
Copy link

andrewsykim commented Apr 20, 2023

I experienced a similar issue with workflow controller, except it was doing large LIST requests for Pods. It seems like workflow controller is issuing periodic list requests without setting resourceVersion and with a labelSelector, which requires apiserver to fetch objects directly from etcd instead of using it's in-memory cache, generating a lot of heavy load on apiserver and etcd.

Ideally these sort of requests can use controller list/watch pattern instead of doing periodic lists, Google has some documentation around this here: https://cloud.google.com/kubernetes-engine/docs/concepts/planning-scalability#use_list_and_watch_pattern_instead_of_periodic_listing

Here's an example LIST request that was generating a lot of load, I redacted some fields that aren't relevant.

"HTTP" verb="LIST" URI="/api/v1/namespaces/<namespace>/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3D<workflow>" latency="5.232824206s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format"

@andrewsykim
Copy link

andrewsykim commented Apr 20, 2023

Dug around and see that this issue has been fixed already! #4024

I believe the version of workflow controller being used for this cluster did not include this performance improvement

@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 20, 2023

Would you like to try a potential fix in this new image tag argoproj/workflow-controller:dev-fix-list-load? It will be ready once all builds finish https://github.com/argoproj/argo-workflows/actions/runs/4758024718/jobs/8455548828

@tooptoop4
Copy link
Contributor

@prateekgogia did u retest?

@terrytangyuan
Copy link
Member

I think this was fixed in one of these PRs: #11722, #9700, #12133, #11375

Feel free to re-open if you are still having issues.

@agilgur5 agilgur5 added type/bug P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important area/controller Controller issues, panics and removed type/feature Feature request labels May 12, 2024
@argoproj argoproj locked as resolved and limited conversation to collaborators Jun 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug
Projects
None yet
Development

No branches or pull requests

5 participants