Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web replicas CPU consumption issue. Jobs start with delay #14395

Open
5 of 11 tasks
lioramazor opened this issue Aug 30, 2023 · 9 comments
Open
5 of 11 tasks

Web replicas CPU consumption issue. Jobs start with delay #14395

lioramazor opened this issue Aug 30, 2023 · 9 comments

Comments

@lioramazor
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

We are running AWX 22.5.0, deployed on Kubernetes using awx-operator 2.4.0.

We run 2 awx-task pods and 2 awx-web pods.

We recently upgraded to version 22.5.0 from 22.3.0.

Since the upgrade, when running jobs, some jobs take a while to start the actual run, as seen on the log below, we see the following message:

waited 252.4184970855713 to acquire lock for local source tree for path

The time of course changes from job to job, sometimes waiting 10 seconds and sometimes up to 15-20 minutes

Also since the upgrade, We started to notice that the AWX web pod is consuming high CPU, between 2-3 cores, while it's request is 1.5 cpus. the way we use AWX did not change, yet the CPU is now higher than ever (before the upgrade, we used no more than 0.5 cpus for each web pod).

The locking behavior used to happen to us in 22.3.0 as well, but not for longer than 70 seconds at peak hours. and now we reach as long as 15 minutes of lock time, not even during peak hours.

We are not sure the problems stated above are related, but we suspect they do, as when the CPU is higher, generally the lock wait times are higher.



Date	Host	Service	Message
2023-08-30T09:12:05.407Z			2023-08-30 09:12:04,194 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 notifications sent
2023-08-30T09:12:04.762Z			[pid: 3280|app: 0|req: 208/647118] REDACTED () {64 vars in 3450 bytes} [Wed Aug 30 09:12:03 2023] GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=48587&or__inventoryupdate__inventory_source__inventory=48587&or__job__inventory=48587&or__workflowjob__inventory=48587&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 194 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:12:04.762Z			REDACTED - - [30/Aug/2023:09:12:04 +0000] "GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=48587&or__inventoryupdate__inventory_source__inventory=48587&or__job__inventory=48587&or__workflowjob__inventory=48587&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:12:04.762Z			[pid: 3285|app: 0|req: 341/647117] REDACTED () {64 vars in 3207 bytes} [Wed Aug 30 09:12:03 2023] GET /api/v2/unified_jobs/?id__in=8206451&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 184 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:12:04.762Z			REDACTED - - [30/Aug/2023:09:12:04 +0000] "GET /api/v2/unified_jobs/?id__in=8206451&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:12:04.762Z			[pid: 3290|app: 0|req: 289/647116] REDACTED () {64 vars in 2466 bytes} [Wed Aug 30 09:12:03 2023] GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=50719&or__inventoryupdate__inventory_source__inventory=50719&or__job__inventory=50719&or__workflowjob__inventory=50719&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 188 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:12:04.762Z			REDACTED - - [30/Aug/2023:09:12:04 +0000] "GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=50719&or__inventoryupdate__inventory_source__inventory=50719&or__job__inventory=50719&or__workflowjob__inventory=50719&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:12:04.487Z			[pid: 3281|app: 0|req: 338/647114] REDACTED () {64 vars in 2837 bytes} [Wed Aug 30 09:12:03 2023] GET /api/v2/unified_jobs/?id=8201333&id__in=8206451&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 200 msecs (HTTP/1.1 200) 14 headers in 579 bytes (1 switches on core 0)
2023-08-30T09:12:04.487Z			REDACTED - - [30/Aug/2023:09:12:04 +0000] "GET /api/v2/unified_jobs/?id=8201333&id__in=8206451&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:12:04.406Z			2023-08-30 09:12:03,778 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 finalize run
2023-08-30T09:12:04.406Z			2023-08-30 09:12:03,494 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 post run
2023-08-30T09:12:04.406Z			2023-08-30 09:12:03,484 INFO [0253207eebc648e9bffea0707239c9bd] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 8206451
2023-08-30T09:12:04.406Z			2023-08-30 09:12:03,433 INFO [-] awx.analytics.job_lifecycle job-8206451 stats wrapup finished
2023-08-30T09:12:04.397Z			ERROR 2023/08/30 09:12:03 [pV9H7SHu] Error reading from pod awx-workers/automation-job-8206451-zgf46: context canceled
2023-08-30T09:11:28.380Z			2023-08-30 09:11:27,141 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 work unit id assigned
2023-08-30T09:11:27.380Z			2023-08-30 09:11:27,097 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 work unit id received
2023-08-30T09:11:27.380Z			2023-08-30 09:11:26,873 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 running playbook
2023-08-30T09:11:27.380Z			2023-08-30 09:11:26,801 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 preparing playbook
2023-08-30T09:11:27.380Z			2023-08-30 09:11:26,666 INFO [0253207eebc648e9bffea0707239c9bd] awx.main.tasks.jobs Job 8206451 waited 252.4184970855713 to acquire lock for local source tree for path /var/lib/awx/projects/_36525__REDACTED.lock.
2023-08-30T09:07:15.310Z			[pid: 3286|app: 0|req: 68/646493] REDACTED () {64 vars in 3206 bytes} [Wed Aug 30 09:07:14 2023] GET /api/v2/unified_jobs/?id__in=8206451&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 303 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:07:15.310Z			REDACTED - - [30/Aug/2023:09:07:14 +0000] "GET /api/v2/unified_jobs/?id__in=8206451&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:07:15.222Z			2023-08-30 09:07:14,745 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 notifications sent
2023-08-30T09:07:15.222Z			2023-08-30 09:07:14,241 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 pre run
2023-08-30T09:07:14.577Z			[pid: 3285|app: 0|req: 157/646482] REDACTED () {64 vars in 3227 bytes} [Wed Aug 30 09:07:14 2023] GET /api/v2/unified_jobs/?id__in=8206451%2C8206414&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 204 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:07:14.577Z			REDACTED - - [30/Aug/2023:09:07:14 +0000] "GET /api/v2/unified_jobs/?id__in=8206451%2C8206414&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:07:14.577Z			[pid: 3210|app: 0|req: 420/646481] REDACTED () {64 vars in 3450 bytes} [Wed Aug 30 09:07:14 2023] GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=48587&or__inventoryupdate__inventory_source__inventory=48587&or__job__inventory=48587&or__workflowjob__inventory=48587&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 234 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:07:14.577Z			REDACTED - - [30/Aug/2023:09:07:14 +0000] "GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=48587&or__inventoryupdate__inventory_source__inventory=48587&or__job__inventory=48587&or__workflowjob__inventory=48587&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:07:14.577Z			[pid: 3285|app: 0|req: 156/646477] REDACTED () {64 vars in 2838 bytes} [Wed Aug 30 09:07:14 2023] GET /api/v2/unified_jobs/?id=8201333&id__in=8206451&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 208 msecs (HTTP/1.1 200) 14 headers in 579 bytes (1 switches on core 0)
2023-08-30T09:07:14.577Z			REDACTED - - [30/Aug/2023:09:07:14 +0000] "GET /api/v2/unified_jobs/?id=8201333&id__in=8206451&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:07:14.091Z			2023-08-30 09:07:13,734 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 waiting
2023-08-30T09:07:13.090Z			2023-08-30 09:07:12,944 INFO [5f7742a679c14b728208ad5b6ace9f8b] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:07:11.088Z			2023-08-30 09:07:10,114 INFO [77253084ad67431185bf6be7e13c1fb5] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:58.207Z			2023-08-30 09:06:57,980 INFO [9a52ca1a2d004e099492068a170fd779] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:54.081Z			2023-08-30 09:06:52,661 INFO [48d11e37e7f24fbf9c51f5517da7ee4c] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:46.075Z			2023-08-30 09:06:45,548 INFO [15d85dc4273d40b3bd95a4126920ff58] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:39.198Z			2023-08-30 09:06:37,780 INFO [27f4cc4b0e80479ba58e847eeaadb230] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:39.071Z			2023-08-30 09:06:38,324 INFO [15d85dc4273d40b3bd95a4126920ff58] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:34.068Z			2023-08-30 09:06:32,624 INFO [eadd7b7b13b24d6ab4c5dd0d0238b90f] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:19.186Z			2023-08-30 09:06:17,761 INFO [1370871e537e45a3b368e2da0a4fd70a] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:14.055Z			2023-08-30 09:06:12,747 INFO [d95ea34f91394953bf23d5cb84e603dd] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:03.049Z			2023-08-30 09:06:01,882 INFO [230a326fbb584ad98df083be1b381de1] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:06:02.049Z			2023-08-30 09:06:01,195 INFO [230a326fbb584ad98df083be1b381de1] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:05:59.174Z			2023-08-30 09:05:57,827 INFO [d6938ca05b6a4e5dbfa3472e657d0557] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:05:54.045Z			2023-08-30 09:05:52,591 INFO [e625cac429894d33a92e4fdb3b5ea7f0] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:05:51.526Z			[pid: 3285|app: 0|req: 88/646360] REDACTED () {64 vars in 2838 bytes} [Wed Aug 30 09:05:50 2023] GET /api/v2/unified_jobs/?id=8201333&id__in=8206451&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 168 msecs (HTTP/1.1 200) 14 headers in 579 bytes (1 switches on core 0)
2023-08-30T09:05:51.526Z			REDACTED - - [30/Aug/2023:09:05:50 +0000] "GET /api/v2/unified_jobs/?id=8201333&id__in=8206451&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:05:51.262Z			[pid: 3286|app: 0|req: 1/646370] REDACTED () {64 vars in 3449 bytes} [Wed Aug 30 09:05:50 2023] GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=48587&or__inventoryupdate__inventory_source__inventory=48587&or__job__inventory=48587&or__workflowjob__inventory=48587&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 485 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:05:51.262Z			REDACTED - - [30/Aug/2023:09:05:50 +0000] "GET /api/v2/unified_jobs/?id__in=8206451&not__launch_type=sync&or__adhoccommand__inventory=48587&or__inventoryupdate__inventory_source__inventory=48587&or__job__inventory=48587&or__workflowjob__inventory=48587&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://REDACTED.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:05:51.043Z			2023-08-30 09:05:50,841 INFO [230a326fbb584ad98df083be1b381de1] awx.analytics.job_lifecycle job-8206451 blocked by projectupdate-8206410
2023-08-30T09:05:51.043Z			2023-08-30 09:05:50,284 INFO [230a326fbb584ad98df083be1b381de1] awx.analytics.job_lifecycle job-8206451 acknowledged
2023-08-30T09:05:51.043Z			2023-08-30 09:05:50,163 INFO [230a326fbb584ad98df083be1b381de1] awx.analytics.job_lifecycle job-8206451 created



AWX version

22.5.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Deploying AWX 22.5 on EKS 1.24 -> running jobs

Expected results

The jobs should start running after a few seconds, or 2-3 minutes if a project update is performed.

Actual results

The AWX web CPU is high and the jobs are waiting between 4-10 minutes to start running.

Additional information

No response

@djyasin
Copy link
Member

djyasin commented Aug 30, 2023

@lioramazor
We would like to gather a bit more information from you. Do you have a persistent project set up? Could you please provide us with additional details regarding the workload? Is there a single project?

Thank you for your time!

@lioramazor
Copy link
Author

@djyasin
Thanks for the fast response.

We have a persistent project set up. We have been using AWX for years to deploy our product's infra, and sometimes hundreds of workflows run simultaneously. We have many different AWX projects and types of workflows.

We have a production environment with version 21.10.2 where the issue doesn't happen, and everything is stable.

We have another environment for developers with version 22.5, upgraded from 22.3, that started having issues with the web pods' CPU consumption and long locks after the upgrade. Please see these logs about the job 8206451.
Pay attention that nothing was happening for 4 minutes between 2023-08-30T09:07:15.310Z and 2023-08-30T09:11:27.380Z, so it seems like AWX was in idle mode because of the web pods CPU consumption.

2023-08-30T09:11:27.380Z            2023-08-30 09:11:26,801 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 preparing playbook
2023-08-30T09:11:27.380Z            2023-08-30 09:11:26,666 INFO [0253207eebc648e9bffea0707239c9bd] awx.main.tasks.jobs Job 8206451 waited 252.4184970855713 to acquire the lock for local source tree for path /var/lib/awx/projects/_36525__REDACTED.lock.
2023-08-30T09:07:15.310Z            [pid: 3286|app: 0|req: 68/646493] REDACTED () {64 vars in 3206 bytes} [Wed Aug 30 09:07:14 2023] GET /api/v2/unified_jobs/?id__in=8206451&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 303 msecs (HTTP/1.1 200) 14 headers in 576 bytes (1 switches on core 0)
2023-08-30T09:07:15.310Z            REDACTED - - [30/Aug/2023:09:07:14 +0000] "GET /api/v2/unified_jobs/?id__in=8206451&name__icontains=REDACTEDe&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "[https://REDACTED.com/"](https://redacted.com/%22) "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/REDACTED Safari/537.36" "REDACTED"
2023-08-30T09:07:15.222Z            2023-08-30 09:07:14,745 INFO [0253207eebc648e9bffea0707239c9bd] awx.analytics.job_lifecycle job-8206451 notifications sent




Please see the web task resource consumption after the upgrade (the upgrade happened on the first of August). One of the lines is resource request (we were changing it from time to time when we were trying to handle this issue), and another is the actual CPU consumption that became much higher after the upgrade.
image

This problem is blocking us from upgrading our PROD environment to 22.5 and aligning our environments.

We will be glad to provide any info that is needed.

@elibogomolnyi
Copy link

Hi @djyasin and @TheRealHaoLiu,

I want to say that I am from the @lioramazor team, and we are struggling with this issue right now. We will gladly provide additional info and arrange a session if needed. There seems to be some issue with the CPU consumption inserted between 22.3 and 22.5; we will be glad to contribute to troubleshooting and resolving it.

@lioramazor lioramazor changed the title Long wait times to acquire lock for local source tree for path Web replicas CPU consumption issue. Jobs start with dela Aug 31, 2023
@lioramazor lioramazor changed the title Web replicas CPU consumption issue. Jobs start with dela Web replicas CPU consumption issue. Jobs start with delay Aug 31, 2023
@elibogomolnyi
Copy link

elibogomolnyi commented Aug 31, 2023

@djyasin @TheRealHaoLiu

An update. This issue seems to be related to the RECEPTOR_KUBE_SUPPORT_RECONNECT, which was enabled.

We had to redeploy the development AWX with RECEPTOR_KUBE_SUPPORT_RECONNECT disabled, and for now, everything looks good.
We also removed the RECEPTOR_LOG DEBUG level, but it doesn’t look like it is related to performance, so we think the main problem is the reconnect support.

We will be back on Sunday and ready to say for sure.
image

@TheRealHaoLiu
Copy link
Member

I fail to see how RECEPTOR_KUBE_SUPPORT_RECONNECT could possibly cause increase CPU utilization on web pod

RECEPTOR_KUBE_SUPPORT_RECONNECT is only set on the env var of the controlplane-ee container that's deployed in task pod and rest of AWX is not aware nor care about the env var.

lets do some digging on the DB side and see if the database is struggling somehow.

@TheRealHaoLiu
Copy link
Member

btw when u set/remove RECEPTOR_KUBE_SUPPORT_RECONNECT it cause a deployment change which cause pod to be restarted that's probably the drop that you see there.

@elibogomolnyi
Copy link

elibogomolnyi commented Aug 31, 2023

I fail to see how RECEPTOR_KUBE_SUPPORT_RECONNECT could possibly cause increase CPU utilization on web pod

Me too. But it is what is happening for now. We will keep watching, but the CPU looks ok for now. I wonder if there can be some indirect relation.

btw when u set/remove RECEPTOR_KUBE_SUPPORT_RECONNECT it cause a deployment change which cause pod to be restarted that's probably the drop that you see there.

Yes, sure thing. Will see if there won't be any spikes after turning off the reconnect support. By the way, do you think some changes between 22.3 and 22.5 could cause the CPU spikes?

@elibogomolnyi
Copy link

elibogomolnyi commented Sep 3, 2023

btw when u set/remove RECEPTOR_KUBE_SUPPORT_RECONNECT it cause a deployment change which cause pod to be restarted that's probably the drop that you see there.

Hi @TheRealHaoLiu, you were right; the problem is not related to RECEPTOR_KUBE_SUPPORT_RECONNECT

Let me describe, in detail, what is happening after an upgrade. We upgraded from 22.3 to 22.5 on August 2. We even did it with you when we had a DB migration issue. =).

Immediately after an upgrade, we noticed that the web replicas consume much more CPU than before:
image

And task replicas consume more memory:
image

We are not sure, if it is related, but we also saw an increase in CPU and Average Active sessions of our RDS:
image

We had RDS CPU hitting 100% and slowness in AWX, so on August 24, we decided to make a cleanup (we deleted all jobs older than 45 days and all notification and notification templates, that we don't need). According to the screenshot of the RDS performance above, it seems it helped to remediate the RDS CPU issue. At least, it didn't hit 100% since then.

During the last week, we ran tests requiring about 50 workflows simultaneously. Our users started complaining about the slowness of the workflows; the processes that took 1 hour before, now can take up to 3 hours. It is not a permanent issue, and it looks like it depends on the AWX load, which it can handle anymore in the 22.5.

I'll describe, what we see in detail.

We are running some workflows from API and see that some are stuck on one of the jobs. This job takes about 2 minutes to finish, but because of some problem, it took 45 minutes. The job number is 8229831. Here are the logs, related to this job:

2023-08-31 09:29:46,082 INFO     [20c1a5a028e648fdb097f1d781643c2e] awx.analytics.job_lifecycle job-8229831 created
2023-08-31 09:29:49,124 INFO     [20c1a5a028e648fdb097f1d781643c2e] awx.analytics.job_lifecycle job-8229831 acknowledged
172.27.4.149 - - [31/Aug/2023:09:29:49 +0000] "GET /api/v2/unified_jobs/?id__in=8229831&name__icontains=console_&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://.../" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "..."
[pid: 2043|app: 0|req: 549/398386] 172.27.4.149 () {64 vars in 3267 bytes} [Thu Aug 31 09:29:49 2023] GET /api/v2/unified_jobs/?id__in=8229831&name__icontains=console_&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 151 msecs (HTTP/1.1 200) 14 headers in 577 bytes (1 switches on core 0)
172.27.5.142 - - [31/Aug/2023:09:29:49 +0000] "GET /api/v2/unified_jobs/?id__in=8229831&name__icontains=_&not__launch_type=sync&or__status=running&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://.../" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/... Safari/537.36" "..."
[pid: 2033|app: 0|req: 746/398387] 172.27.5.142 () {64 vars in 2246 bytes} [Thu Aug 31 09:29:49 2023] GET /api/v2/unified_jobs/?id__in=8229831&name__icontains=_&not__launch_type=sync&or__status=running&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 166 msecs (HTTP/1.1 200) 14 headers in 577 bytes (1 switches on core 0)
2023-08-31 09:29:49,746 INFO     [20c1a5a028e648fdb097f1d781643c2e] awx.analytics.job_lifecycle job-8229831 blocked by projectupdate-8229780

From here, we started getting a message, that the job is blocked by project update 8229780.

We get this message for about 3.5 minutes and then the jobs starts running.

2023-08-31 09:33:41,799 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229831 waiting
2023-08-31 09:33:43,394 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229831 pre run
2023-08-31 09:33:44,005 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229831 notifications sent
172.27.4.149 - - [31/Aug/2023:09:33:43 +0000] "GET /api/v2/unified_jobs/?id__in=8229831&name__icontains=console_&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://.../" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "..."
[pid: 2051|app: 0|req: 226/401014] ... () {64 vars in 3268 bytes} [Thu Aug 31 09:33:43 2023] GET /api/v2/unified_jobs/?id__in=8229831&name__icontains=console_&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 174 msecs (HTTP/1.1 200) 14 headers in 577 bytes (1 switches on core 0)
172.27.5.142 - - [**31/Aug/2023:09:33:43** +0000] "GET /api/v2/unified_jobs/?id__in=8229831&name__icontains=_&not__launch_type=sync&or__status=running&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://.../" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/... Safari/537.36" "..."
[pid: 2046|app: 0|req: 657/401017] ... () {64 vars in 2245 bytes} [Thu Aug 31 09:33:43 2023] GET /api/v2/unified_jobs/?id__in=8229831&on =_&not__launch_type=sync&or__status=running&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 194 msecs (HTTP/1.1 200) 14 headers in 577 bytes (1 switches on core 0)

And from now on, something very strange will happen. From the last logs, we see, that they stopped at 31/Aug/2023:09:33:43. And between 31/Aug/2023:09:33:43 and 31/Aug/2023:09:45:30, nothing happens in logs (at least, there are no records, that contain the job number)

So after that, we can see

.... - - [31/Aug/2023:09:45:30 +0000] "GET /api/v2/unified_jobs/?id=8229831 HTTP/1.1" 200 26413 "https://.../" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "..."
[pid: 2086|app: 0|req: 615/409219] ..... () {64 vars in 2042 bytes} [Thu Aug 31 09:45:30 2023] GET /api/v2/unified_jobs/?id=8229831 => generated 26413 bytes in 231 msecs (HTTP/1.1 200) 14 headers in 580 bytes (1 switches on core 0)
... - - [31/Aug/2023:09:45:31 +0000] "OPTIONS /api/v2/jobs/8229831/job_events/ HTTP/1.1" 200 11910 "https://.../" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "..."
[pid: 2091|app: 0|req: 297/409226] ... () {66 vars in 2090 bytes} [Thu Aug 31 09:45:31 2023] OPTIONS /api/v2/jobs/8229831/job_events/ => generated 11910 bytes in 217 msecs (HTTP/1.1 200) 15 headers in 603 bytes (1 switches on core 0)
... - - [31/Aug/2023:09:45:31 +0000] "GET /api/v2/jobs/8229831/ HTTP/1.1" 200 26691 "https://.../" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "..."
[pid: 2108|app: 0|req: 32/409311] ... () {64 vars in 2021 bytes} [Thu Aug 31 09:45:30 2023] GET /api/v2/jobs/8229831/ => generated 26691 bytes in 270 msecs (HTTP/1.1 200) 14 headers in 588 bytes (1 switches on core 0)
... - - [31/Aug/2023:09:45:34 +0000] "GET /api/v2/jobs/8229831/job_events/?not__stdout=&order_by=counter&page=1&page_size=50 HTTP/1.1" 200 52 ".../" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "..."
[pid: 2093|app: 0|req: 302/409358] ... () {64 vars in 2142 bytes} [Thu Aug 31 09:45:34 2023] GET /api/v2/jobs/8229831/job_events/?not__stdout=&order_by=counter&page=1&page_size=50 => generated 52 bytes in 189 msecs (HTTP/1.1 200) 15 headers in 600 bytes (1 switches on core 0)
2023-08-31 09:45:34,889 INFO     [-] awx.main.wsrelay Starting producer for 100.64.66.2-job_events-8229831
2023-08-31 09:45:34,889 INFO     [-] awx.main.wsrelay Starting producer for 100.64.66.2-job_events-8229831
2023-08-31 09:45:44,892 INFO     [-] awx.main.wsrelay Producer 100.64.66.2-job_events-8229831 has no subscribers, shutting down.
2023-08-31 09:45:44,892 INFO     [-] awx.main.wsrelay Producer 100.64.66.2-job_events-8229831 has no subscribers, shutting down.

Here comes this message that seems to be important:

2023-08-31 10:21:08,609 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.main.tasks.jobs Job 8229831 waited 2845.2125771045685 to acquire lock for local source tree for path /var/lib/awx/projects/_36525__pgsuspegasus_wf_pcloud.lock.

And then preparing the playbook:

2023-08-31 10:22:04,314 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229831 preparing playbook
2023-08-31 10:22:04,438 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229831 running playbook
2023-08-31 10:22:04,660 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229831 work unit id received
2023-08-31 10:22:04,703 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229831 work unit id assigned

And the job is running after it.

We mentioned a project update 8229780. Here are the logs, related to it:

2023-08-31 09:27:04,257 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 created
2023-08-31 09:27:04,446 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 acknowledged
2023-08-31 09:27:04,850 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 controller node chosen
2023-08-31 09:27:04,850 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 execution node chosen
2023-08-31 09:27:04,895 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 waiting
2023-08-31 09:27:04,895 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229779 blocked by projectupdate-8229780
2023-08-31 09:27:05,033 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 pre run

Then we can see, that many other jobs were blocked by this project update. We see hundreds of records like this, about the different jobs:

2023-08-31 09:27:12,454 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229779 blocked by projectupdate-8229780
2023-08-31 09:27:19,681 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle job-8229783 blocked by projectupdate-8229780
2023-08-31 09:27:34,965 INFO     [20c1a5a028e648fdb097f1d781643c2e] awx.analytics.job_lifecycle job-8229779 blocked by projectupdate-8229780

In the middle of the messages about the jobs, blocked by project update we get this:

2023-08-31 09:31:35,388 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.main.tasks.jobs Job 8229780 waited 270.35366439819336 to acquire lock for local source tree for path /var/lib/awx/projects/_36525__pgsuspegasus_wf_pcloud.lock.
2023-08-31 09:31:35,391 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 preparing playbook
2023-08-31 09:31:35,633 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 running playbook
2023-08-31 09:31:35,651 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 work unit id received

The logs about jobs, that are blocked by project update again and finally:

2023-08-31 09:33:41,188 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 8229780
2023-08-31 09:33:41,190 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 post run
2023-08-31 09:33:41,335 INFO     [4cd7ed05812e42b19aee4fa445925d1c] awx.analytics.job_lifecycle projectupdate-8229780 finalize run
2023-08-31 09:33:41,408 INFO     [-] awx.analytics.job_lifecycle projectupdate-8229780 stats wrapup finished
172.27.5.142 - - [31/Aug/2023:09:33:41 +0000] "GET /api/v2/unified_jobs/?id__in=8229780&name__icontains=_&not__launch_type=sync&or__status=running&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https:.../" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "..."
[pid: 2051|app: 0|req: 218/400985] 172.27.5.142 () {64 vars in 2246 bytes} [Thu Aug 31 09:33:41 2023] GET /api/v2/unified_jobs/?id__in=8229780&name__icontains=_&not__launch_type=sync&or__status=running&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 169 msecs (HTTP/1.1 200) 14 headers in 577 bytes (1 switches on core 0)
172.27.4.149 - - [31/Aug/2023:09:33:41 +0000] "GET /api/v2/unified_jobs/?id__in=8229780&name__icontains=console_&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTTP/1.1" 200 52 "https://.../" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "...."
[pid: 2048|app: 0|req: 363/401072] 172.27.4.149 () {64 vars in 3268 bytes} [Thu Aug 31 09:33:41 2023] GET /api/v2/unified_jobs/?id__in=8229780&name__icontains=console_&not__launch_type=sync&order_by=-finished&page=1&page_size=20 => generated 52 bytes in 164 msecs (HTTP/1.1 200) 14 headers in 577 bytes (1 switches on core 0)
2023-08-31 09:37:52,078 WARNING  [4cd7ed05812e42b19aee4fa445925d1c] awx.main.tasks.jobs Rewriting cache at /var/lib/awx/projects/.__awx_cache/_36525__pgsuspegasus_wf_pcloud/8229780, performance may suffer

So, to sum up.

  • The job that usually takes 2 minutes to finish, was running from 8/31/2023, 12:33:42 PM till 8/31/2023, 1:22:23 PM (49 minutes)
  • The project update, that was triggered by this job, started running at 8/31/2023, 12:27:04 PM and finished at 8/31/2023, 12:33:41 PM (6 minutes, when it usually takes less than 1 minute for it to finish).
  • The job started running for real at 13:22:11 (I can see it in jobs output). So looks like nothing was happening between 12:27:04 and 13:22:11.

This is what happened with the web CPU during that time:
image

And the task RAM:
image

After the upgrade to 22.5 we also experience that RDS CPU reaches 100%. Here is what we can see in logs during this time

2023-09-03 07:00:39 UTC:100.64.166.106(56430):prod-awx@unicorn:[1712]:ERROR:  relation "main_inventoryupdateevent_20230903_07" already exists
2023-09-03 07:00:39 UTC:100.64.166.106(56430):prod-awx@unicorn:[1712]:STATEMENT:  CREATE TABLE IF NOT EXISTS main_inventoryupdateevent_20230903_07 PARTITION OF main_inventoryupdateevent FOR VALUES FROM ('2023-09-03 07:00:00+00:00') to ('2023-09-03 08:00:00+00:00');
2023-09-03 07:03:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 17470 buffers (1.7%); 0 WAL file(s) added, 0 removed, 2 recycled; write=269.809 s, sync=0.005 s, total=269.852 s; sync files=465, longest=0.005 s, average=0.001 s; distance=150052 kB, estimate=174163 kB
2023-09-03 07:03:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:08:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 10285 buffers (1.0%); 0 WAL file(s) added, 1 removed, 1 recycled; write=269.921 s, sync=0.005 s, total=269.968 s; sync files=441, longest=0.004 s, average=0.001 s; distance=89066 kB, estimate=165653 kB
2023-09-03 07:08:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:13:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 18623 buffers (1.8%); 0 WAL file(s) added, 0 removed, 1 recycled; write=269.883 s, sync=0.005 s, total=269.915 s; sync files=393, longest=0.004 s, average=0.001 s; distance=147548 kB, estimate=163843 kB
2023-09-03 07:13:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:18:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 8650 buffers (0.9%); 0 WAL file(s) added, 0 removed, 1 recycled; write=269.836 s, sync=0.006 s, total=269.871 s; sync files=420, longest=0.005 s, average=0.001 s; distance=75038 kB, estimate=154962 kB
2023-09-03 07:18:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:23:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 10028 buffers (1.0%); 0 WAL file(s) added, 0 removed, 2 recycled; write=269.874 s, sync=0.006 s, total=269.922 s; sync files=389, longest=0.005 s, average=0.001 s; distance=88567 kB, estimate=148323 kB
2023-09-03 07:23:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:28:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 17043 buffers (1.7%); 0 WAL file(s) added, 0 removed, 1 recycled; write=269.793 s, sync=0.005 s, total=269.827 s; sync files=421, longest=0.004 s, average=0.001 s; distance=113852 kB, estimate=144876 kB
2023-09-03 07:28:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:33:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 9085 buffers (0.9%); 0 WAL file(s) added, 0 removed, 2 recycled; write=269.918 s, sync=0.005 s, total=269.950 s; sync files=356, longest=0.004 s, average=0.001 s; distance=78571 kB, estimate=138245 kB
2023-09-03 07:33:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:38:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 9247 buffers (0.9%); 0 WAL file(s) added, 0 removed, 1 recycled; write=269.854 s, sync=0.004 s, total=269.890 s; sync files=329, longest=0.003 s, average=0.001 s; distance=79221 kB, estimate=132343 kB
2023-09-03 07:38:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:43:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 8891 buffers (0.9%); 0 WAL file(s) added, 0 removed, 1 recycled; write=269.918 s, sync=0.005 s, total=269.950 s; sync files=376, longest=0.004 s, average=0.001 s; distance=72492 kB, estimate=126358 kB
2023-09-03 07:43:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:48:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 9187 buffers (0.9%); 0 WAL file(s) added, 0 removed, 1 recycled; write=269.804 s, sync=0.020 s, total=269.854 s; sync files=339, longest=0.008 s, average=0.001 s; distance=78017 kB, estimate=121524 kB
2023-09-03 07:48:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:53:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 11035 buffers (1.1%); 0 WAL file(s) added, 1 removed, 1 recycled; write=269.825 s, sync=0.007 s, total=269.950 s; sync files=321, longest=0.006 s, average=0.001 s; distance=100094 kB, estimate=119381 kB
2023-09-03 07:53:43 UTC::@:[394]:LOG:  checkpoint starting: time
2023-09-03 07:58:13 UTC::@:[394]:LOG:  checkpoint complete: wrote 14278 buffers (1.4%); 0 WAL file(s) added, 7 removed, 0 recycled; write=269.832 s, sync=0.011 s, total=270.132 s; sync files=376, longest=0.010 s, average=0.001 s; distance=125681 kB, estimate=125681 kB
2023-09-03 07:58:43 UTC::@:[394]:LOG:  checkpoint starting: time

image image image

This query seems to be the heavies (takes 60% of CPU):
image

We will be glad to share all needed info and logs with you to troubleshoot this issue. Please tell me if any more info is needed.

@elibogomolnyi
Copy link

We found the root cause of this issue and opened some related bugs. Please see them here:
#14432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants