-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller: High RAM Usage and Frequent Crashes on Large Scale Workflows #11948
Comments
What kind of workflow do you run ? do you load test it maybe with simple 5000 cowsay workflow concurrent ? I think this may help you with that https://pipekit.io/blog/upgrade-to-kubernetes-127-take-advantage-of-performance-improvements |
Thank you for your prompt response! We run a variety of workflows, encompassing various tasks and complexities. We will proceed with updating to Kubernetes 1.27 and see if it resolves the high RAM consumption and controller crash issues. However, I'm skeptical if this update would address the problem with the archived workflows not being deleted. Can you provide more insights or potential solutions regarding that? Thank you in advance for your support! |
I think the root cause can caused by GC not clearing some data. Maybe this will help you https://argoproj.github.io/argo-workflows/running-at-massive-scale/ But in my use-case we trying to migrate from Azkaban to Argo-Workflow. Before that we run concurrent workflow just like what the pipe article do and i think it manage the resource pretty well (using GKE 1.24). But yeah you can try to run test scenario to load test it first and track where is the issue come from |
Yea GC/TTL sounds like what you'd want to tune for deleting archived workflows from your cluster. There are more detailed docs in the Operator Guide, such as for GC: https://argoproj.github.io/argo-workflows/cost-optimisation/#limit-the-total-number-of-workflows-and-pods I'm also working on horizontally scaling the Controller in #9990 |
There must be errors when deleting the live workflows in the cluster after archiving. Could you paste relevant logs from controller and server? |
Thanks for your support. These are the only error messages we receive in the controller logs:
|
Adding the following settings to the controller config fixed the cleanup of the already archived workflows and the RAM consuption:
Anyways, the controller is still repeatingly restarting. I will update k8s to |
Updating K8s to
The workflow server started to display warnings in the logs regarding the TokenRequest API. These warnings were not present in version
|
Awesome! Curious, for reference, what did your memory consumption drop to?
Hmm this doesn't seem like the same error message, but #11657 is perhaps related. The ConfigMap watcher should be more resilient after #11855. This is functionality that lets you change the ConfigMap without needing to restart the Controller. Does it crash after this error? Or was that just another error you found in the logs?
There's a section in the docs about k8s Secrets for SAs, but that should have already been in effect since k8s 1.24 🤔 Not sure what that warning traces back to |
Yes, it seems to be crashing now after this log message. |
Can you check |
Is it running out of CPU potentially? Since you had to increase You've shown that there is correlation between memory usage and lack of GC, but now it would be important to figure out what the precursor / root cause of the lack of GC is. Would need more information to try to ascertain that |
We analyzed the issue further after applying your suggested changes. Now we see frequent restarts of the controller and found out that it is restarted due to failing health checks, which are caused by the same issue described here: Liveness probe fails with 500 and "workflow never reconciled" |
@jkldrr @sgutwein it looks like your k8s API server and etcd are heavily loaded with a number of workflow objects and pods objects. All API requests are timing out or rate limiting(if you are using any vendor k8s solution).
Here few suggestions when you are running high scale of workflows
|
Overall it seems that we need to tweak our settings in order to find a stable configuration. |
@sarabala1979 (workflows.argoproj.io/workflow-archiving-status=Pending) |
I observed a similar issue in our site after upgrading from 3.3.6 to 3.4.11:
Based on above observations, I have a hypothesis:
I made some changes and it looks normal now:
Hope these changes help in your case. Good luck. ---- Update ----- Here is an example of workflow controller keeps deleting a deleted workflow. workflow_controller_logs_delete_workflows.txt |
@krumlovchiu I guess your hypothesis is about right, we observe the following log messages: It takes a very long time until a workflow gets garbage collected after the TTL ran out. We tried to compensate by setting |
This issue becomes worse, when the number of workflow resources keeps increasing while the archiving can´t keep up. At a certain point, the k8s API is overwhelmed and calling
Because the request to the k8s API timeouts, the workflow controller crashes and tries to restart again. The same problem occurs again and the workflows controller is in a restart loop. The only way to recover from this, is to manually delete workflow resources until the k8s api response times for |
I discovered a discrepancy in the watch requests generated by wfInformer in workflow/controller/controller compared to workflow/cron/controller when communicating with the k8s API server. Specifically, the resourceVersion parameter is missing in the requests from workflow/controller/controller. This omission seems to prompt the k8s API server to return WatchEvents from the latest version, potentially leading to missed WatchEvents by wfInformer, such as delete events. As a result, succeeded workflows may remain in wfInformer indefinitely. workflow/controller/controller missing resourceVersion:
workflow/cron/controller:
This behavior appears to have been introduced in PR #11343 (v3.4.9) where tweakListOptions is utilized for both list and watch requests. To validate this, we reverted to v3.4.8 and observed that the watch requests now include the resourceVersion parameter:
We are currently monitoring to see if the count of succeeded workflows continues to rise over time. |
Thank you! @krumlovchiu I was just looking at this exact code yesterday and realized the same thing. It looks Are you also seeing errors like |
Please keep us posted here. |
…ests. Fixes #11948 Signed-off-by: Yuan Tang <[email protected]>
@carolkao Good catch. It looks like not all changes are included when cherry-picking. I am building |
@terrytangyuan sounds good! After the build is ready, I can arrange the version upgrade maybe Tomorrow (11/7). With this new version, I will check if the k8s API calls and the function works as expected or not and get back to you. |
Hi @terrytangyuan , Looks like build |
@carolkao I guess you forgot to update the image for argo-server. Here's the only difference from v3.4.13 v3.4.13...dev-fix-informer-3.4.13 |
Let me check it |
Hi @terrytangyuan, here are some updates: In addition to this weird behavior, we noticed there exists another UI bug in this new version. Some nodes of the workflows are missing in Argo UI graph view, but actually the workflows looks running correctly and we can see the pods in timeline view. Sorry that with above concern which will impact our service users, I'm afraid that I cannot arrange the version upgrade to our production environment at this moment to monitor if it fix the workflow succeeded incremental issue or not. But with build
|
Thank you! We can track the node missing issue separately as I don't think that's related to this issue. #12165 |
Thanks for pointing this out! |
We can revert that in future 3.4 patches. It doesn't affect usage. Tracking in #11851. |
Please try out v3.4.14 which fixes both this issue and the node missing issue. |
does the fix reduce |
The fix only affects workflow objects, so I don't think so. |
does the fix solved succeeded workflow increase continuously? |
…ests. Fixes argoproj#11948 (argoproj#12133) Signed-off-by: Dillen Padhiar <[email protected]>
Hi @tooptoop4 , |
after cherry-pick the pr to v3.4.9, the problem is still exists. |
This comment was marked as spam.
This comment was marked as spam.
@zhucan As I've told you repeatedly elsewhere, older patches are not supported by definition. Reporting something on an older patch is not helpful, counter-productive, and against our issue templates.
That is a fork. Forks are, by definition, not supported. If you're running a custom version of Argo, that's on you to maintain; please do not expect official support for something the project did not release.
You did not provide any details to support that your issue is due to the same problem or that you have the same issue even. This is similarly not helpful and counter-productive.
Your upload did not complete so there is nothing here. You seem to have uploaded a similar file, a Given that this specific issue has received a resolution and that further comments have not been productive, I will be locking this issue. If you have a similar issue of unbounded growth of memory or lack of GC, please try the latest patch and then, if you still suspect a problem, open a separate issue with as many details as possible. For instance, OP and team provided many graphs of metrics, configurations, etc. Without substantive details, it is impossible to a trace an issue. |
Pre-requisites
:latest
What happened/what you expected to happen?
Description:
We are observing significant issues with the Argo Workflows Controller while handling a large number of workflows in parallel.
Environment:
Argo Workflows version: 3.4.11
Nodes count: 300+
Parallel workflows: 5000+
What happened:
The Argo Workflows Controller's memory consumption increases exponentially, sometimes surpassing 100GB. Despite this excessive memory usage, the controller crashes frequently. Notably, despite workflows being archived, they aren't deleted post-archiving, possibly contributing to the memory usage. It does not log any specific error messages prior to these crashes, making it challenging to pinpoint the cause or underlying issue.
What you expected to happen:
We expected the Argo Workflows Controller to handle the parallel execution of 5000+ workflows across 300+ nodes without such a drastic increase in RAM consumption. We also expected a more resilient behavior, not prone to unexpected crashes, and better error logging for troubleshooting.
How to reproduce it (as minimally and precisely as possible):
Set up an environment with 300+ nodes.
Launch 5000+ workflows in parallel.
Monitor the RAM usage of the Argo Workflows Controller and note any unexpected crashes.
Additional context:
Given the scale at which we are operating, it's critical for our operations that Argo can handle such workloads efficiently. Any assistance in resolving this issue or guidance on potential optimizations would be greatly appreciated.
Version
v3.4.11
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: