-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulty scaling when running many workflows and/or steps #4634
Comments
|
Indeed we have read the thread, and we followed all of the workaround suggested there. We have been keeping up to date with the latest builds from
We haven't seen a controller crash on the most recent builds.
In our first test, we submitted three workflows per second of two steps each.
The highest number of pods we've seen run concurrently in our steady state tests has been around 200. In those tests, the number of workflows in the In the case of the few workflows with many steps each, the number of pods in the |
What TZ are you in? Can we get on a Zoom in the next few days? |
Your actions:
My actions
|
@alexec Thanks again for your time today. We're looking forward to getting to the bottom of this :)
We ran this test for about 30 mins using the script in Fig 2 (submits 3 wf/sec, two steps total, about 60-70 seconds expected total time per workflow). Workflows seemed to start and complete smoothly with a steady accumulation of "succeeded" workflows. No workflow GC ever appeared to run, which is unexpected and different from our prior tests. Workflow throughput dropped by 50% about 20 minutes into the run, which is what we've normally seen when the GC operation starts, but the number of succeeded workflows never decreased. Pending workflows hovered around 300 throughout the test. When we submitted a couple of workflows in the middle of the test using the CLI tool, it took around 3.5 minutes for the first step to run. We're not sure what to conclude here, but running without
We think this was a red herring. In later tests, memory usage seemed to increase in proportion to the number of workflows that were being tracked, which seems reasonable. Our current manifest reserves 1GiB of memory for the Argo controller but doesn't set a limit. The node has significant memory available, so we shouldn't be bumping into any physical or OS limits.
We tried using semaphores while
Only when we canceled the script and cleaned up the workflows ( This is our configmap and the relevant part of the workflow spec, which is adapted from the synchronization-wf-level example:
We'll give these a try tomorrow and will report back.
BTW, we ran pprof against the controller a few weeks ago and didn't see any interesting CPU hot spots (aside from |
Forgot to mention that we've been seeing a lot of "Deadline exceeded" warnings in our controller logs during recent runs (not just today). I've added a metric to our controller build to count them, so we should have hard numbers to share next time if that would be a useful signal. |
I've done some analysis and can't see any memory leaks. That ties with your analysis. I've created a dev build intended to reduce the amount to Kubernetes API requests the controller makes by limiting each workflow to one reconciliation every 15s (configurable). The details are in the linked PR. Would you like to try that out? |
Okay, we have a few more test results. Short version: no significant improvement.
The close timing between succeeded workflows getting GC'd and performance becoming unstable seems interesting. Would it be worthwhile to try disabling GC or making the GC period very long to see if there's actually a connection? Anything else we should try? |
Do you mean you had zombies? |
Completed and GC pods are bounded to 512 before blocking:
|
I don't think so, at least by the definition given in #4560. The workflows seemed to reliably complete and get GC'd, but we had a lot of completed pods hanging around (visible using
We noticed this a while back and wondered if it was contributing to our perf issues. What controller behavior would you expect to see when those channels are full?
I didn't pay close attention to CPU usage yesterday, but I can confirm that it was about 6x lower compared to other recent runs. Nice! |
Good news about the CPU. I think these are some design issues in pod GC (see #4693).
When the channel gets full, it will block when new entries are added to it. This means that reconciliation will take longer. I'm creating a new build with a configurable fix for #4693. |
@acj I've created a new dev build. Can you run this firstly without any env var changes as a baseline? I expect you to see an improvement with default settings. Then can you try the new env vars as listed in PR please? Thank you again. |
Will do. We ran the latter test (today's Do you think we should also include the zombie env vars and
|
I think it is best to test one thing at a time :) |
Knapkin math:
Can you try:
I think you should get more workflow throughput (i.e. the queue should not grow). |
Completely agree. I was trying to get clarity on what you meant here:
Whether we should apply all of those new env vars at once, or by group, or one at a time, etc. |
I've updated the PR description to make this clearer. |
We see a pretty big difference between The customized env vars from the previous test seemed to give us the best results, so we carried those into the next tests. Changing On a whim, we reran that last test (requeue time at 60s) with a modified script that submits 2 WF/sec instead of 3. The queue depth, time in queue, and pending workflow count all stayed near zero. It took around 30 seconds to submit a workflow (sleeps for 1s and exits) using the CLI and see it fully complete, which is a little sluggish, but much better than we saw in our pre- We also tried making the script submit 5 WF/sec. Notably, the k8s API request rate seemed to plateau at the same point whether we submitted 3/sec or 5/sec, which makes me think we're being throttled either by the API server or by the rate limiters in the controller (e.g. workqueue or similar). Maybe that's our next bottleneck? |
Can I ask you to execute another test? |
@acj I think we're getting closer to a solution, so as a result I've created a series of PRs with one fix ( Assumptions:
Rejected hypothesis:
Accepted hypothesis:
I've created dev builds specifically to address these items:
Would you be able to get onto a Zoom early next week please? |
Yep, we'll give this a try tomorrow
Given the plateauing behavior we were seeing last week (bottom of my last comment), is it safe to reject this hypothesis? I'm still wondering about a possible bottleneck there
Sure. We'll ping you in slack once we get our schedule sorted |
We ran a test with
When we ran with
Stack trace from the controller crash:
fig. 1fig. 2 |
Test results from using the
|
I'm pretty sure we have a new bug in TTL. |
Testing with
Testing with
Testing with
fig. 1fig. 2fig. 3 |
I've been doing some exploratory testing and it is clear TTL often just does not happen. This is a functional bug, not in fact a scaling issue. I'll fix this and get back to you as I'm currently trying to improve the performance of something fundamentally broken. |
Accidentally close. |
I should note I've run submitting 600 workflows at once on my Macbook. That peaked at 60 concurrent a second. I'm going to launch this on a test cluster tomorrow. |
With
As per our conversation, we are going to run a similar test, this time skipping Argo to verify that the cluster can sustain the desired capacity. |
Signed-off-by: Alex Collins <[email protected]>
Using the latest images and recommendations:
We observed the following: We are seeing far more consistently running workflows, but now we are also seeing a lot of failed workflows as well. Unfortunately the failed pods were reaped immediately so we have only the following information:
Thank you! |
Good news! I have a branch name
|
Running on
Adding the environment variables as recommended:
We'll likely try out v2.12.3 as well and report back. |
After a bit of testing, I've found that "running workflows" != "workflows actually doing work". You can have a lot of running workflows, but pods are actually pending. You charts you shared don't have legends on them so I don't know what metrics they are, but there is a new metric called (I think) Alex |
@tomgoren we should get on a call again. When are you free? |
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further. |
did u solve @tomgoren ? |
I believe that ultimately the "Emissary" executor solved the bulk of this problem. It's been a few years since this has been relevant. |
Problem statement
Environment details
master
at commit5c538d7a918e41029d3911a92c6ac615f04d3b80
parallelism: 800
, otherwise we observed the EKS control place becoming unresponsivecontainerRuntimeExecutor: kubelet
on AWS Bottlerocket instancesCase 1 (many workflows, few steps)
wfc.wfQueue
andwfc.podQueue
in controller/controller.go). The workflow queue oscillates between 1000 and 1500 items during our test. However, the pod queue consistently stays at 0.Case 2 (few workflows, many steps)
Completed
, the workflow lingers in the Running state (fig. 5).Things we tried
In trying to address these issues, we changed the values of the following parameters without much success:
pod-workers
workflow-workers
(the default of 32 was a bottleneck, but anything over 128 didn’t make a difference)INFORMER_WRITE_BACK=false
--qps
,--burst
workflowResyncPeriod
andpodResyncPeriod
fig. 1
fig. 2
fig. 3
fig.4
fig. 5
fig. 6
fig. 7
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: