-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX not able to delete the worker pods after finished running #15247
Comments
I observe the same issues on Kubernetes 1.27 with AWX 23.0.0. The pods that are not deleted are pods where the awx jobs have been deleted immediately after the pod failed. It looks like awx only knows about existing pods through the jobs inside of awx. If this is the case, the pod should be actively be removed from Kubernetes when the job is deleted OR the api output of the job should give a hint on whether the pod has already been deleted inside kubernetes. |
can u give us the output of |
@TheRealHaoLiu below is the output.
|
@chinna44 that does not look like the output from the api endpoint... that looks like the stdout of the job |
@TheRealHaoLiu yes.. you are correct, I'm sorry for that Below is the output for endpoint /api/v2/jobs/462026, but I could not see the details of job for this kind of any other Inventory Sync jobs. Please let me know if you require details in any other possible ways HTTP 404 Not Found { |
@TheRealHaoLiu please let me know if you need any other details |
Hi. We face the same issues, but also job pods are sometimes not removed in k8s. Mostly job pods that end with an error and not OK, but there are some successfull pods as well hanging. We observed beginning of this type of problems after updating AWX from 24.5.0 to 24.6.1 and upgrade of k8s to 1.30.3 which sadly both took place the same day. Before the upgrage we did not observe this kind of problems and we were already running at least k8s 1.28 (cannot confirm precise version currently) Update: Ok, I take everything above back. It showed up, that during troubleshooting one from our admins has set
This explains the behaviour we had. After reverthing those back to default all left over pods were removed immediately by AWX and no new pods are left behind. So everything works as expected at least within mentioned verisons. |
Please confirm the following
[email protected]
instead.)Bug Summary
We have recently upgraded the awx version from 22.5.0 to 23.9.0 which is deployed on EKS 1.28 version.
After AWX upgrade, we observed that only few jobs (not all jobs) running on workers pods specific to inventory sync are not getting deleted even after job workflow is completed . The pods will be in queue for hours and days until we delete them manually. I don't see any other errors
The worker pods status is shown below
NAME READY STATUS RESTARTS AGE
automation-job-462026-6zf7c 1/2 NotReady 0 3m23s
The errors that are captured from awx control plane ee logs for the worker pods that are not getting deleted
Error deleting pod automation-job-462026-6zf7c: client rate limiter Wait returned an error: context canceled
Context was canceled while reading logs for pod awx-workers/automation-job-462026-6zf7c. Assuming pod has finished
The pod status description shows: Not displaying the data that is condifential
Containers:
worker:
State: Terminated
Reason: Completed
Exit Code: 0
Ready: False
Restart Count: 0
authenticator:
State: Running
Ready: True
Restart Count: 0
The automation-job-462026-6zf7c pod contains two containers: worker and authenticator.
When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container and authenticator container
worker-container.txt
authenticator-container.txt
For now we are testing this in non production environment, currently its a blocker to upgrade the production. Please have a look and provide the fix or suggest the best awx version if it is a known issue
AWX version
23.9.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Run many AWX jobs based on the pod that contains worker and authenticator images.(we observed mainly on Inventory sync jobs)
Expected results
AWX deletes all the pods that finished running.
Actual results
AWX Worker pods got stuck
Additional information
No response
The text was updated successfully, but these errors were encountered: