-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web replicas CPU consumption issue. Jobs start with delay #14395
Comments
@lioramazor Thank you for your time! |
@djyasin We have a persistent project set up. We have been using AWX for years to deploy our product's infra, and sometimes hundreds of workflows run simultaneously. We have many different AWX projects and types of workflows. We have a production environment with version 21.10.2 where the issue doesn't happen, and everything is stable. We have another environment for developers with version 22.5, upgraded from 22.3, that started having issues with the web pods' CPU consumption and long locks after the upgrade. Please see these logs about the job 8206451.
Please see the web task resource consumption after the upgrade (the upgrade happened on the first of August). One of the lines is resource request (we were changing it from time to time when we were trying to handle this issue), and another is the actual CPU consumption that became much higher after the upgrade. This problem is blocking us from upgrading our PROD environment to 22.5 and aligning our environments. We will be glad to provide any info that is needed. |
Hi @djyasin and @TheRealHaoLiu, I want to say that I am from the @lioramazor team, and we are struggling with this issue right now. We will gladly provide additional info and arrange a session if needed. There seems to be some issue with the CPU consumption inserted between 22.3 and 22.5; we will be glad to contribute to troubleshooting and resolving it. |
An update. This issue seems to be related to the RECEPTOR_KUBE_SUPPORT_RECONNECT, which was enabled. We had to redeploy the development AWX with RECEPTOR_KUBE_SUPPORT_RECONNECT disabled, and for now, everything looks good. |
I fail to see how RECEPTOR_KUBE_SUPPORT_RECONNECT could possibly cause increase CPU utilization on web pod RECEPTOR_KUBE_SUPPORT_RECONNECT is only set on the env var of the controlplane-ee container that's deployed in task pod and rest of AWX is not aware nor care about the env var. lets do some digging on the DB side and see if the database is struggling somehow. |
btw when u set/remove RECEPTOR_KUBE_SUPPORT_RECONNECT it cause a deployment change which cause pod to be restarted that's probably the drop that you see there. |
Me too. But it is what is happening for now. We will keep watching, but the CPU looks ok for now. I wonder if there can be some indirect relation.
Yes, sure thing. Will see if there won't be any spikes after turning off the reconnect support. By the way, do you think some changes between 22.3 and 22.5 could cause the CPU spikes? |
Hi @TheRealHaoLiu, you were right; the problem is not related to RECEPTOR_KUBE_SUPPORT_RECONNECT Let me describe, in detail, what is happening after an upgrade. We upgraded from 22.3 to 22.5 on August 2. We even did it with you when we had a DB migration issue. =). Immediately after an upgrade, we noticed that the web replicas consume much more CPU than before: And task replicas consume more memory: We are not sure, if it is related, but we also saw an increase in CPU and Average Active sessions of our RDS: We had RDS CPU hitting 100% and slowness in AWX, so on August 24, we decided to make a cleanup (we deleted all jobs older than 45 days and all notification and notification templates, that we don't need). According to the screenshot of the RDS performance above, it seems it helped to remediate the RDS CPU issue. At least, it didn't hit 100% since then. During the last week, we ran tests requiring about 50 workflows simultaneously. Our users started complaining about the slowness of the workflows; the processes that took 1 hour before, now can take up to 3 hours. It is not a permanent issue, and it looks like it depends on the AWX load, which it can handle anymore in the 22.5. I'll describe, what we see in detail. We are running some workflows from API and see that some are stuck on one of the jobs. This job takes about 2 minutes to finish, but because of some problem, it took 45 minutes. The job number is 8229831. Here are the logs, related to this job:
From here, we started getting a message, that the job is blocked by project update 8229780. We get this message for about 3.5 minutes and then the jobs starts running.
And from now on, something very strange will happen. From the last logs, we see, that they stopped at 31/Aug/2023:09:33:43. And between 31/Aug/2023:09:33:43 and 31/Aug/2023:09:45:30, nothing happens in logs (at least, there are no records, that contain the job number) So after that, we can see
Here comes this message that seems to be important:
And then preparing the playbook:
And the job is running after it. We mentioned a project update 8229780. Here are the logs, related to it:
Then we can see, that many other jobs were blocked by this project update. We see hundreds of records like this, about the different jobs:
In the middle of the messages about the jobs, blocked by project update we get this:
The logs about jobs, that are blocked by project update again and finally:
So, to sum up.
This is what happened with the web CPU during that time: After the upgrade to 22.5 we also experience that RDS CPU reaches 100%. Here is what we can see in logs during this time
This query seems to be the heavies (takes 60% of CPU): We will be glad to share all needed info and logs with you to troubleshoot this issue. Please tell me if any more info is needed. |
We found the root cause of this issue and opened some related bugs. Please see them here: |
Please confirm the following
[email protected]
instead.)Bug Summary
We are running AWX 22.5.0, deployed on Kubernetes using awx-operator 2.4.0.
We run 2 awx-task pods and 2 awx-web pods.
We recently upgraded to version 22.5.0 from 22.3.0.
Since the upgrade, when running jobs, some jobs take a while to start the actual run, as seen on the log below, we see the following message:
The time of course changes from job to job, sometimes waiting 10 seconds and sometimes up to 15-20 minutes
Also since the upgrade, We started to notice that the AWX web pod is consuming high CPU, between 2-3 cores, while it's request is 1.5 cpus. the way we use AWX did not change, yet the CPU is now higher than ever (before the upgrade, we used no more than 0.5 cpus for each web pod).
The locking behavior used to happen to us in 22.3.0 as well, but not for longer than 70 seconds at peak hours. and now we reach as long as 15 minutes of lock time, not even during peak hours.
We are not sure the problems stated above are related, but we suspect they do, as when the CPU is higher, generally the lock wait times are higher.
AWX version
22.5.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Deploying AWX 22.5 on EKS 1.24 -> running jobs
Expected results
The jobs should start running after a few seconds, or 2-3 minutes if a project update is performed.
Actual results
The AWX web CPU is high and the jobs are waiting between 4-10 minutes to start running.
Additional information
No response
The text was updated successfully, but these errors were encountered: