Controller: High RAM Usage and Frequent Crashes on Large Scale Workflows #11948

sgutwein · 2023-10-05T08:07:55Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Description:
We are observing significant issues with the Argo Workflows Controller while handling a large number of workflows in parallel.

Environment:
Argo Workflows version: 3.4.11
Nodes count: 300+
Parallel workflows: 5000+

What happened:
The Argo Workflows Controller's memory consumption increases exponentially, sometimes surpassing 100GB. Despite this excessive memory usage, the controller crashes frequently. Notably, despite workflows being archived, they aren't deleted post-archiving, possibly contributing to the memory usage. It does not log any specific error messages prior to these crashes, making it challenging to pinpoint the cause or underlying issue.

What you expected to happen:
We expected the Argo Workflows Controller to handle the parallel execution of 5000+ workflows across 300+ nodes without such a drastic increase in RAM consumption. We also expected a more resilient behavior, not prone to unexpected crashes, and better error logging for troubleshooting.

How to reproduce it (as minimally and precisely as possible):
Set up an environment with 300+ nodes.
Launch 5000+ workflows in parallel.
Monitor the RAM usage of the Argo Workflows Controller and note any unexpected crashes.

Additional context:
Given the scale at which we are operating, it's critical for our operations that Argo can handle such workloads efficiently. Any assistance in resolving this issue or guidance on potential optimizations would be greatly appreciated.

Version

v3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

The text was updated successfully, but these errors were encountered:

mozarik · 2023-10-05T08:45:08Z

What kind of workflow do you run ? do you load test it maybe with simple 5000 cowsay workflow concurrent ?

I think this may help you with that https://pipekit.io/blog/upgrade-to-kubernetes-127-take-advantage-of-performance-improvements

sgutwein · 2023-10-05T09:14:38Z

Thank you for your prompt response!

We run a variety of workflows, encompassing various tasks and complexities. We will proceed with updating to Kubernetes 1.27 and see if it resolves the high RAM consumption and controller crash issues. However, I'm skeptical if this update would address the problem with the archived workflows not being deleted. Can you provide more insights or potential solutions regarding that?

Thank you in advance for your support!

mozarik · 2023-10-05T09:37:09Z

I think the root cause can caused by GC not clearing some data. Maybe this will help you https://argoproj.github.io/argo-workflows/running-at-massive-scale/

But in my use-case we trying to migrate from Azkaban to Argo-Workflow. Before that we run concurrent workflow just like what the pipe article do and i think it manage the resource pretty well (using GKE 1.24).

But yeah you can try to run test scenario to load test it first and track where is the issue come from

agilgur5 · 2023-10-05T14:41:34Z

Yea GC/TTL sounds like what you'd want to tune for deleting archived workflows from your cluster. There are more detailed docs in the Operator Guide, such as for GC: https://argoproj.github.io/argo-workflows/cost-optimisation/#limit-the-total-number-of-workflows-and-pods
and other scalability tuneables: https://argoproj.github.io/argo-workflows/scaling/.

I'm also working on horizontally scaling the Controller in #9990

terrytangyuan · 2023-10-05T17:11:53Z

There must be errors when deleting the live workflows in the cluster after archiving. Could you paste relevant logs from controller and server?

sgutwein · 2023-10-06T07:36:03Z

There must be errors when deleting the live workflows in the cluster after archiving. Could you paste relevant logs from controller and server?

Thanks for your support. These are the only error messages we receive in the controller logs:

W1006 07:30:58.789341       1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1alpha1.WorkflowTaskResult: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 43; INTERNAL_ERROR; received from peer
I1006 07:30:58.789408       1 trace.go:205] Trace[1803986942]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (06-Oct-2023 07:29:58.783) (total time: 60005ms):
Trace[1803986942]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 43; INTERNAL_ERROR; received from peer 60005ms (07:30:58.789)
Trace[1803986942]: [1m0.005483658s] [1m0.005483658s] END
E1006 07:30:58.789422       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.WorkflowTaskResult: failed to list *v1alpha1.WorkflowTaskResult: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 43; INTERNAL_ERROR; received from peer

sgutwein · 2023-10-06T07:56:17Z

Yea GC/TTL sounds like what you'd want to tune for deleting archived workflows from your cluster. There are more detailed docs in the Operator Guide, such as for GC: https://argoproj.github.io/argo-workflows/cost-optimisation/#limit-the-total-number-of-workflows-and-pods and other scalability tuneables: https://argoproj.github.io/argo-workflows/scaling/.

I'm also working on horizontally scaling the Controller in #9990

Adding the following settings to the controller config fixed the cleanup of the already archived workflows and the RAM consuption:

      extraArgs:
        - '--workflow-ttl-workers'
        - '32'
        - '--pod-cleanup-workers'
        - '32'
      extraEnv: # https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md
        - name: DEFAULT_REQUEUE_TIME
          value: "1m"

Anyways, the controller is still repeatingly restarting. I will update k8s to 1.27.3 and give you feedback if that fixed this issue as well.

sgutwein · 2023-10-06T13:58:15Z

There must be errors when deleting the live workflows in the cluster after archiving. Could you paste relevant logs from controller and server?

Thanks for your support. These are the only error messages we receive in the controller logs:

W1006 07:30:58.789341       1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1alpha1.WorkflowTaskResult: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 43; INTERNAL_ERROR; received from peer
I1006 07:30:58.789408       1 trace.go:205] Trace[1803986942]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (06-Oct-2023 07:29:58.783) (total time: 60005ms):
Trace[1803986942]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 43; INTERNAL_ERROR; received from peer 60005ms (07:30:58.789)
Trace[1803986942]: [1m0.005483658s] [1m0.005483658s] END
E1006 07:30:58.789422       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.WorkflowTaskResult: failed to list *v1alpha1.WorkflowTaskResult: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 43; INTERNAL_ERROR; received from peer

Updating K8s to 1.27.3 seems to have solved the error logs above. Sadly, the controller still crashes from time to time. I found this error message:

time="2023-10-06T13:32:28.353Z" level=fatal msg="Failed to register watch for controller config map: Get \"https://10.0.0.1:443/api/v1/namespaces/argo/configmaps/argo-workflows-workflow-controller-configmap\": dial tcp 10.0.0.1:443: connect: connection refused"

The workflow server started to display warnings in the logs regarding the TokenRequest API. These warnings were not present in version 1.26.x.

W1006 13:14:15.375792       1 warnings.go:70] Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.

agilgur5 · 2023-10-06T14:15:43Z

Adding the following settings to the controller config fixed the cleanup of the already archived workflows and the RAM consuption

Awesome! Curious, for reference, what did your memory consumption drop to?

Sadly, the controller still crashes from time to time. I found this error message:

time="2023-10-06T13:32:28.353Z" level=fatal msg="Failed to register watch for controller config map: Get \"https://10.0.0.1:443/api/v1/namespaces/argo/configmaps/argo-workflows-workflow-controller-configmap\": dial tcp 10.0.0.1:443: connect: connection refused"

Hmm this doesn't seem like the same error message, but #11657 is perhaps related. The ConfigMap watcher should be more resilient after #11855. This is functionality that lets you change the ConfigMap without needing to restart the Controller.

Does it crash after this error? Or was that just another error you found in the logs?

The workflow server started to display warnings in the logs regarding the TokenRequest API. These warnings were not present in version 1.26.x.
W1006 13:14:15.375792       1 warnings.go:70] Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.

There's a section in the docs about k8s Secrets for SAs, but that should have already been in effect since k8s 1.24 🤔 Not sure what that warning traces back to

sgutwein · 2023-10-06T14:40:42Z

Does it crash after this error? Or was that just another error you found in the logs?

Yes, it seems to be crashing now after this log message.

terrytangyuan · 2023-10-06T14:49:19Z

Can you check kubectl logs -p?

sgutwein · 2023-10-09T09:04:31Z

It seems that, over the weekend, the worklfows contoller, again, stopped cleaning up the archived workflows... After it crashed and restarted it was cleaing up the workflows. Found nothing uncommon at log levels error & fatal in the logs.

PS: I think it was OOM killed by K8s.

agilgur5 · 2023-10-09T14:58:58Z

Is it running out of CPU potentially? Since you had to increase --workflow-ttl-workers before, I wonder if it's still not keeping up.

You've shown that there is correlation between memory usage and lack of GC, but now it would be important to figure out what the precursor / root cause of the lack of GC is. Would need more information to try to ascertain that

sgutwein · 2023-10-11T09:04:57Z

Running out of CPU don't seems to be the problem. It using (at max) 20% of the CPU resources. We adjusted the workflow workers to fit the cleanup workers (both are now at 32). Didn't changed the behaviour. Some new fatal logs appeared this night:

time="2023-10-11T03:47:02.429Z" level=fatal msg="Timed out waiting for caches to sync"
time="2023-10-11T02:58:13.413Z" level=fatal msg="Timed out waiting for caches to sync"
time="2023-10-11T02:46:01.699Z" level=fatal msg="the server was unable to return a response in the time allotted, but may still be processing the request (get workflows.argoproj.io)"

Do you have any idea how to debug this further? I'm running out of ideas...

jkldrr · 2023-10-13T14:06:22Z

We analyzed the issue further after applying your suggested changes. Now we see frequent restarts of the controller and found out that it is restarted due to failing health checks, which are caused by the same issue described here: Liveness probe fails with 500 and "workflow never reconciled"

sarabala1979 · 2023-10-15T05:02:45Z

@jkldrr @sgutwein it looks like your k8s API server and etcd are heavily loaded with a number of workflow objects and pods objects. All API requests are timing out or rate limiting(if you are using any vendor k8s solution).

time="2023-10-06T13:32:28.353Z" level=fatal msg="Failed to register watch for controller config map: Get \"https://10.0.0.1:443/api/v1/namespaces/argo/configmaps/argo-workflows-workflow-controller-configmap\": dial tcp 10.0.0.1:443: connect: connection refused"

Now we see frequent restarts of the controller and found out that it is restarted due to failing health checks, which are caused by the same issue described here: https://github.com/argoproj/argo-workflows/issues/11051

Here few suggestions when you are running high scale of workflows

Increase k8s control plane resources to support the number of objects on your cluster
Increate Workflow controller resources CPU/Memory to hold a number of objects
As suggested above Workflow TTL and podGC --> Will help to reduce the completed objects in etcd.

jkldrr · 2023-10-15T09:16:51Z

since we are running on AKS, we unfortunately don't have access to the master node
after your recommendation we reduced TTL and set the podGC. But the number of workflows in state 'archiving pending' seems to still grow. How can we speed up the archiving process? (Our Postgres instance is not the bottleneck)

Overall it seems that we need to tweak our settings in order to find a stable configuration.
Are rateLimit (limit/burst), the different worker numbers (--workflow-workers, --workflow-ttl-workers...) and TTL/GC the correct places to look? Or are there other settings to consider?

jkldrr · 2023-10-19T12:29:38Z

@sarabala1979
Is there a way to improve on how fast workflows are archived?

(workflows.argoproj.io/workflow-archiving-status=Pending)

krumlovchiu · 2023-10-24T00:46:08Z

@jkldrr

I observed a similar issue in our site after upgrading from 3.3.6 to 3.4.11:

high memory usage, OOMKilled by k8s
succeeded workflow number increases over time
most succeeded workflows cannot be found via kubectl get workflow, these succeeded workflows are re-deleted every 20 minutes but got this message "Workflow already deleted". 20 minutes is the wfInformer's resyncPeriod
some pending archiving workflows block succeeded workflows to be deleted and workflow controller failed to catch up. In this case, kubectl find these workflows
I found some workflows are archiving multiple times
lots of throttling logs

Waited for ... due to client-side throttling, not priority and fairness, request: GET:https://.../apis/argoproj.io/v1alpha1/namespaces/argo/...

all workflow workers(32) are busy at peak time

Based on above observations, I have a hypothesis:

the workflow workers fail to digest the incoming tasks due to qps throttling
since the incoming tasks are queued, the workflow state be handling are delayed
the delayed and incorrect workflow state will be written back to wfInformer and make the succeeded workflow leakage and redundant archiving
succeeded workflow leakage and redundant archiving will make things worse quickly due to more loads to workflow controller

I made some changes and it looks normal now:

increase qps to 100, burst to 150 to avoid throttling and workflow workers can digest load. This reduced the memory pressure a lot. Reference: https://argoproj.github.io/argo-workflows/scaling/#k8s-api-client-side-rate-limiting

however the memory still increased slowly due to succeeded workflows still increased slowly in peak time. I added the environment variable INFORMER_WRITE_BACK=false to avoid workflow controller write back to wfInformer. Reference:

argo-workflows/docs/environment-variables.md

Line 35 in 5896f75

    
           | `INFORMER_WRITE_BACK`                    | `bool`              | `true`                                                                                      | Whether to write back to informer instead of catching up.                                                                                                                                                                                                                |

Hope these changes help in your case. Good luck.

---- Update -----
The above change didn't work well. After a few days, the succeeded workflows started increasing and workflow controller try to delete them every 20 minutes.

Here is an example of workflow controller keeps deleting a deleted workflow. workflow_controller_logs_delete_workflows.txt

jkldrr · 2023-10-24T07:16:26Z

@krumlovchiu I guess your hypothesis is about right, we observe the following log messages:
time="2023-10-24T07:08:45.371Z" level=info msg="Queueing Succeeded workflow abc for delete in -5h49m56s due to TTL"
time="2023-10-24T07:08:45.371Z" level=info msg="Deleting garbage collected workflow ‘abc’”

It takes a very long time until a workflow gets garbage collected after the TTL ran out.

We tried to compensate by setting --workflow-ttl-workers 96 but so far no success.

n1klasD · 2023-11-02T08:08:40Z

This issue becomes worse, when the number of workflow resources keeps increasing while the archiving can´t keep up. At a certain point, the k8s API is overwhelmed and calling get workflows.argoproj.io is very slow. When the workflows controller restarts, it fetches all workflow resources. If the k8s API is overwhelmed like this, the controller fails:

"2023-10-30T07:38:43.298Z" level=fatal msg="the server was unable to return a response in the time allotted, but may still be processing the request (get workflows.argoproj.io)"

Because the request to the k8s API timeouts, the workflow controller crashes and tries to restart again. The same problem occurs again and the workflows controller is in a restart loop. The only way to recover from this, is to manually delete workflow resources until the k8s api response times for get workflows.argoproj.io decrease.

krumlovchiu · 2023-11-03T13:16:30Z

@terrytangyuan

I discovered a discrepancy in the watch requests generated by wfInformer in workflow/controller/controller compared to workflow/cron/controller when communicating with the k8s API server. Specifically, the resourceVersion parameter is missing in the requests from workflow/controller/controller. This omission seems to prompt the k8s API server to return WatchEvents from the latest version, potentially leading to missed WatchEvents by wfInformer, such as delete events. As a result, succeeded workflows may remain in wfInformer indefinitely.

workflow/controller/controller missing resourceVersion:

__source__=log_service __time__=1698635337 __topic__=k8s_api_server _container_name_=kube-apiserver _source_=stderr _time_=2023-10-30T03:08:56.831235179Z content=I1030 11:08:56.831175       1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/argoproj.io/v1alpha1/namespaces/canal-flow/workflows?allowWatchBookmarks=true&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid&timeoutSeconds=557&watch=true" latency="9m17.000861432s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v3.4.11 argo-controller" audit-ID="57565d21-1654-4807-8c2c-c34e5bc1af04" srcIP="172.22.160.38:35015" resp=200 contentType="application/json" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="539.269772ms" apf_execution_time="539.27239ms" resp=200
__source__=log_service __time__=1698634780 __topic__=k8s_api_server _container_name_=kube-apiserver _source_=stderr _time_=2023-10-30T02:59:39.829672376Z content=I1030 10:59:39.829600       1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/argoproj.io/v1alpha1/namespaces/canal-flow/workflows?allowWatchBookmarks=true&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid&timeoutSeconds=486&watch=true" latency="8m6.00118578s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v3.4.11 argo-controller" audit-ID="1a771981-2665-42de-9b63-8bf1366d8730" srcIP="172.22.160.38:56251" resp=200 contentType="application/json" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="661.519019ms" apf_execution_time="661.521015ms" resp=200

workflow/cron/controller:

__source__=log_service __time__=1698635529 __topic__=k8s_api_server _container_name_=kube-apiserver _source_=stderr _time_=2023-10-30T03:12:08.448808764Z content=I1030 11:12:08.448734       1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/argoproj.io/v1alpha1/namespaces/canal-flow/workflows?allowWatchBookmarks=true&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid%2Cworkflows.argoproj.io%2Fcron-workflow&resourceVersion=6422307129&timeoutSeconds=498&watch=true" latency="8m18.001101775s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v3.4.11 argo-controller" audit-ID="a93bff01-e817-461f-a13b-eba47e2b72ef" srcIP="172.22.160.38:50343" resp=200 contentType="application/json" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="356.215µs" apf_execution_time="357.388µs" resp=200
__source__=log_service __time__=1698635031 __topic__=k8s_api_server _container_name_=kube-apiserver _source_=stderr _time_=2023-10-30T03:03:50.446786285Z content=I1030 11:03:50.446711       1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/argoproj.io/v1alpha1/namespaces/canal-flow/workflows?allowWatchBookmarks=true&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid%2Cworkflows.argoproj.io%2Fcron-workflow&resourceVersion=6422260990&timeoutSeconds=543&watch=true" latency="9m3.000783408s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v3.4.11 argo-controller" audit-ID="a726b820-8da9-4e4b-be97-79d1ad6f395d" srcIP="172.22.160.38:54040" resp=200 contentType="application/json" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="350.437µs" apf_execution_time="351.577µs" resp=200

This behavior appears to have been introduced in PR #11343 (v3.4.9) where tweakListOptions is utilized for both list and watch requests. To validate this, we reverted to v3.4.8 and observed that the watch requests now include the resourceVersion parameter:

__source__=log_service __time__=1699011092 __topic__=k8s_api_server _container_name_=kube-apiserver _source_=stderr _time_=2023-11-03T11:31:31.353724739Z content=I1103 19:31:31.353665       1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/argoproj.io/v1alpha1/namespaces/canal-flow/workflows?allowWatchBookmarks=true&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid&resourceVersion=6457248239&timeoutSeconds=533&watch=true" latency="8m53.000581908s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v3.4.8 argo-controller" audit-ID="e6506332-99e2-4401-b28b-f16b48ddccfa" srcIP="172.22.160.38:31549" resp=200 contentType="application/json" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="303.127µs" apf_execution_time="304.331µs" resp=200

We are currently monitoring to see if the count of succeeded workflows continues to rise over time.

terrytangyuan · 2023-11-03T13:42:15Z

Thank you! @krumlovchiu I was just looking at this exact code yesterday and realized the same thing.

It looks tweakListOptions will be passed to NewFilteredUnstructuredInformer, including even WatchFunc .

Are you also seeing errors like the object has been modified; please apply your changes to the latest version and try again Conflict by any chance?

terrytangyuan · 2023-11-03T13:44:33Z

We are currently monitoring to see if the count of succeeded workflows continues to rise over time.

Please keep us posted here.

…ests. Fixes #11948 Signed-off-by: Yuan Tang <[email protected]>

terrytangyuan · 2023-11-06T09:33:51Z

@carolkao Good catch. It looks like not all changes are included when cherry-picking. I am building dev-fix-informer-3.4.13. Once all builds finish, a new image tag dev-fix-informer-3.4.13 should be available.

carolkao · 2023-11-06T09:42:00Z

@terrytangyuan sounds good!

After the build is ready, I can arrange the version upgrade maybe Tomorrow (11/7). With this new version, I will check if the k8s API calls and the function works as expected or not and get back to you.
However, our high loading appears start from Thursday night (Taipei time), so we might need to wait until 11/10 to confirm if the succeeded workflow increased issue is gone.

carolkao · 2023-11-07T03:59:21Z

Hi @terrytangyuan ,

Looks like build dev-fix-informer-3.4.13 including changes in v3.5 which is different from our expectation.

Can you just cherry-pick the fix for this informer issue into v3.4 and have a new build? Thank you.

terrytangyuan · 2023-11-07T04:03:06Z

@carolkao I guess you forgot to update the image for argo-server. Here's the only difference from v3.4.13 v3.4.13...dev-fix-informer-3.4.13

carolkao · 2023-11-07T04:04:12Z

Let me check it

carolkao · 2023-11-07T08:43:41Z

Hi @terrytangyuan, here are some updates:
I've confirmed both argo-server and workflow-controller are using the same image version - dev-fix-informer-3.4.13. I'm curious if this note also appears in v3.4.13, so I tried and found in v3.4.13, argo-ui does display the release note of v3.5.

In addition to this weird behavior, we noticed there exists another UI bug in this new version. Some nodes of the workflows are missing in Argo UI graph view, but actually the workflows looks running correctly and we can see the pods in timeline view.

I guess this issue was introduced recently since we didn't see it when using v3.4.11.

Sorry that with above concern which will impact our service users, I'm afraid that I cannot arrange the version upgrade to our production environment at this moment to monitor if it fix the workflow succeeded incremental issue or not. But with build dev-fix-informer-3.4.13 in our dev env, I did see the k8s api query for workflow watch has resourceVersion added.

// watch workflows with resourceVersion=5849977253, with userAgent:"... argo-workflows/untagged ..." which indicates the request was come from your untagged dev build
__source__=log_service __time__=1699331612 __topic__=k8s_api_server _container_name_=kube-apiserver _source_=stderr _time_=2023-11-07T04:33:31.930048025Z content=I1107 12:33:31.929973       1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/argoproj.io/v1alpha1/namespaces/canal-flow/workflows?allowWatchBookmarks=true&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid&resourceVersion=5849977253&timeoutSeconds=471&watch=true" latency="7m51.000562646s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/untagged argo-controller" audit-ID="8dda91a9-8d89-46b6-929f-65974b1131e3" srcIP="172.25.16.27:38487" resp=200 contentType="application/json" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="324.407µs" apf_execution_time="325.738µs" resp=200
// watch workflows with resourceVersion=5849988603
__source__=log_service __time__=1699332150 __topic__=k8s_api_server _container_name_=kube-apiserver _source_=stderr _time_=2023-11-07T04:42:29.931940422Z content=I1107 12:42:29.931865       1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/argoproj.io/v1alpha1/namespaces/canal-flow/workflows?allowWatchBookmarks=true&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid&resourceVersion=5849988603&timeoutSeconds=538&watch=true" latency="8m58.000866255s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/untagged argo-controller" audit-ID="6151f625-c8f7-449c-8d2e-4dd6539236ba" srcIP="172.25.16.27:21991" resp=200 contentType="application/json" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="417.471µs" apf_execution_time="419.107µs" resp=200

terrytangyuan · 2023-11-07T22:46:04Z

Thank you! We can track the node missing issue separately as I don't think that's related to this issue. #12165

agilgur5 · 2023-11-08T02:14:07Z

I'm curious if this note also appears in v3.4.13, so I tried and found in v3.4.13, argo-ui does display the release note of v3.5.

Thanks for pointing this out!
It looks like the v3.5 upgrade note got accidentally cherry-picked into 3.4.12+: 46297ca
@terrytangyuan not sure how you want to handle that since the history and 2 versions are now affected.

terrytangyuan · 2023-11-08T02:50:51Z

We can revert that in future 3.4 patches. It doesn't affect usage. Tracking in #11851.

…ests. Fixes argoproj#11948 (argoproj#12133)

…ests. Fixes #11948 (#12133)

terrytangyuan · 2023-11-27T20:04:44Z

Please try out v3.4.14 which fixes both this issue and the node missing issue.

tooptoop4 · 2024-01-28T09:43:24Z

does the fix reduce apiserver_storage_objects{resource="events"} ?

Joibel · 2024-01-31T12:10:44Z

does the fix reduce apiserver_storage_objects{resource="events"} ?

The fix only affects workflow objects, so I don't think so.

imliuda · 2024-04-17T15:46:58Z

does the fix solved succeeded workflow increase continuously?

…ests. Fixes argoproj#11948 (argoproj#12133) Signed-off-by: Dillen Padhiar <[email protected]>

tooptoop4 · 2024-05-11T12:02:34Z

@sgutwein @carolkao were u able to retest with the new version?

carolkao · 2024-05-13T02:32:47Z

Hi @tooptoop4 ,
Sorry, but our team currently has no resources to allocate for a version upgrade, so we will remain on version 3.4.8 for now. However, after July might be a more suitable time for the upgrade.

zhucan · 2024-06-27T04:55:46Z

after cherry-pick the pr to v3.4.9, the problem is still exists.

zhucan · 2024-06-27T06:14:09Z

agilgur5 · 2024-06-27T13:39:25Z

v3.4.9

@zhucan As I've told you repeatedly elsewhere, older patches are not supported by definition. Reporting something on an older patch is not helpful, counter-productive, and against our issue templates.

after cherry-pick the pr

That is a fork. Forks are, by definition, not supported. If you're running a custom version of Argo, that's on you to maintain; please do not expect official support for something the project did not release.

the problem is still exists.

You did not provide any details to support that your issue is due to the same problem or that you have the same issue even. This is similarly not helpful and counter-productive.

Uploading profile001.svg…

Your upload did not complete so there is nothing here. You seem to have uploaded a similar file, a pprof profile, on Slack (around here), but that one has no details around it to tell if they're the same. Your memory usage there is also only at ~6GB, while OP has 100+ GB (i.e. at least an order of magnitude difference). You did not provide any other metrics to support this or give context on this either.

Given that this specific issue has received a resolution and that further comments have not been productive, I will be locking this issue. If you have a similar issue of unbounded growth of memory or lack of GC, please try the latest patch and then, if you still suspect a problem, open a separate issue with as many details as possible. For instance, OP and team provided many graphs of metrics, configurations, etc. Without substantive details, it is impossible to a trace an issue.

sgutwein added the type/bug label Oct 5, 2023

agilgur5 added area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Oct 5, 2023

agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Nov 3, 2023

terrytangyuan added a commit that referenced this issue Nov 3, 2023

fix: Resource version incorrectly overridden for wfInformer list requ…

acaf655

…ests. Fixes #11948 Signed-off-by: Yuan Tang <[email protected]>

terrytangyuan mentioned this issue Nov 3, 2023

fix: Resource version incorrectly overridden for wfInformer list requests. Fixes #11948 #12133

Merged

terrytangyuan mentioned this issue Nov 7, 2023

steps not displaying in UI after v3.5.1 #12165

Closed

3 tasks

terrytangyuan closed this as completed in 222d53c Nov 13, 2023

jiachengxu pushed a commit to akuity/argo-workflows that referenced this issue Nov 17, 2023

fix: Resource version incorrectly overridden for wfInformer list requ…

9db0bd0

…ests. Fixes argoproj#11948 (argoproj#12133)

terrytangyuan mentioned this issue Nov 22, 2023

Release v3.5 patch releases discussion #11997

Open

terrytangyuan added a commit that referenced this issue Nov 27, 2023

fix: Resource version incorrectly overridden for wfInformer list requ…

a69ca23

…ests. Fixes #11948 (#12133)

terrytangyuan added a commit that referenced this issue Nov 27, 2023

fix: Resource version incorrectly overridden for wfInformer list requ…

d9a0797

…ests. Fixes #11948 (#12133)

agilgur5 changed the title ~~High RAM Usage and Frequent Crashes with Argo Workflows Controller on Large Scale Workflows~~ Controller: High RAM Usage and Frequent Crashes on Large Scale Workflows Feb 5, 2024

agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 18, 2024

dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024

fix: Resource version incorrectly overridden for wfInformer list requ…

87b389e

…ests. Fixes argoproj#11948 (argoproj#12133) Signed-off-by: Dillen Padhiar <[email protected]>

This comment was marked as spam.

Sign in to view

argoproj locked as resolved and limited conversation to collaborators Jun 27, 2024

Controller: High RAM Usage and Frequent Crashes on Large Scale Workflows #11948

Controller: High RAM Usage and Frequent Crashes on Large Scale Workflows #11948

Comments

sgutwein commented Oct 5, 2023 • edited Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

mozarik commented Oct 5, 2023 • edited Loading

sgutwein commented Oct 5, 2023

mozarik commented Oct 5, 2023 • edited Loading

agilgur5 commented Oct 5, 2023 • edited Loading

terrytangyuan commented Oct 5, 2023

sgutwein commented Oct 6, 2023

sgutwein commented Oct 6, 2023 • edited Loading

sgutwein commented Oct 6, 2023 • edited Loading

agilgur5 commented Oct 6, 2023 • edited Loading

sgutwein commented Oct 6, 2023

terrytangyuan commented Oct 6, 2023

sgutwein commented Oct 9, 2023 • edited Loading

agilgur5 commented Oct 9, 2023

sgutwein commented Oct 11, 2023

jkldrr commented Oct 13, 2023 • edited Loading

sarabala1979 commented Oct 15, 2023

jkldrr commented Oct 15, 2023

jkldrr commented Oct 19, 2023 • edited Loading

krumlovchiu commented Oct 24, 2023 • edited Loading

jkldrr commented Oct 24, 2023 • edited Loading

n1klasD commented Nov 2, 2023

krumlovchiu commented Nov 3, 2023

terrytangyuan commented Nov 3, 2023 • edited Loading

terrytangyuan commented Nov 3, 2023

terrytangyuan commented Nov 6, 2023

carolkao commented Nov 6, 2023

carolkao commented Nov 7, 2023

terrytangyuan commented Nov 7, 2023 • edited Loading

carolkao commented Nov 7, 2023

carolkao commented Nov 7, 2023 • edited Loading

terrytangyuan commented Nov 7, 2023

agilgur5 commented Nov 8, 2023

terrytangyuan commented Nov 8, 2023

terrytangyuan commented Nov 27, 2023 • edited by agilgur5 Loading

tooptoop4 commented Jan 28, 2024 • edited by agilgur5 Loading

Joibel commented Jan 31, 2024

imliuda commented Apr 17, 2024

tooptoop4 commented May 11, 2024

carolkao commented May 13, 2024

zhucan commented Jun 27, 2024

This comment was marked as spam.

zhucan commented Jun 27, 2024

agilgur5 commented Jun 27, 2024 • edited Loading

sgutwein commented Oct 5, 2023 •

edited

Loading

mozarik commented Oct 5, 2023 •

edited

Loading

mozarik commented Oct 5, 2023 •

edited

Loading

agilgur5 commented Oct 5, 2023 •

edited

Loading

sgutwein commented Oct 6, 2023 •

edited

Loading

sgutwein commented Oct 6, 2023 •

edited

Loading

agilgur5 commented Oct 6, 2023 •

edited

Loading

sgutwein commented Oct 9, 2023 •

edited

Loading

jkldrr commented Oct 13, 2023 •

edited

Loading

jkldrr commented Oct 19, 2023 •

edited

Loading

krumlovchiu commented Oct 24, 2023 •

edited

Loading

jkldrr commented Oct 24, 2023 •

edited

Loading

terrytangyuan commented Nov 3, 2023 •

edited

Loading

terrytangyuan commented Nov 7, 2023 •

edited

Loading

carolkao commented Nov 7, 2023 •

edited

Loading

terrytangyuan commented Nov 27, 2023 •

edited by agilgur5

Loading

tooptoop4 commented Jan 28, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jun 27, 2024 •

edited

Loading