AWX not able to delete the worker pods after finished running #15247

chinna44 · 2024-06-03T15:18:54Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.
I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

We have recently upgraded the awx version from 22.5.0 to 23.9.0 which is deployed on EKS 1.28 version.

After AWX upgrade, we observed that only few jobs (not all jobs) running on workers pods specific to inventory sync are not getting deleted even after job workflow is completed . The pods will be in queue for hours and days until we delete them manually. I don't see any other errors

The worker pods status is shown below
NAME READY STATUS RESTARTS AGE
automation-job-462026-6zf7c 1/2 NotReady 0 3m23s

The errors that are captured from awx control plane ee logs for the worker pods that are not getting deleted
Error deleting pod automation-job-462026-6zf7c: client rate limiter Wait returned an error: context canceled
Context was canceled while reading logs for pod awx-workers/automation-job-462026-6zf7c. Assuming pod has finished

The pod status description shows: Not displaying the data that is condifential
Containers:
worker:
State: Terminated
Reason: Completed
Exit Code: 0
Ready: False
Restart Count: 0
authenticator:
State: Running
Ready: True
Restart Count: 0

The automation-job-462026-6zf7c pod contains two containers: worker and authenticator.

When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container and authenticator container
worker-container.txt
authenticator-container.txt

For now we are testing this in non production environment, currently its a blocker to upgrade the production. Please have a look and provide the fix or suggest the best awx version if it is a known issue

AWX version

23.9.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run many AWX jobs based on the pod that contains worker and authenticator images.(we observed mainly on Inventory sync jobs)

Expected results

AWX deletes all the pods that finished running.

Actual results

AWX Worker pods got stuck

Additional information

No response

The text was updated successfully, but these errors were encountered:

chronicc · 2024-06-05T13:20:31Z

I observe the same issues on Kubernetes 1.27 with AWX 23.0.0.

The pods that are not deleted are pods where the awx jobs have been deleted immediately after the pod failed. It looks like awx only knows about existing pods through the jobs inside of awx.

If this is the case, the pod should be actively be removed from Kubernetes when the job is deleted OR the api output of the job should give a hint on whether the pod has already been deleted inside kubernetes.

TheRealHaoLiu · 2024-06-06T15:19:11Z

can u give us the output of /api/v2/jobs/462026

chinna44 · 2024-06-07T06:06:09Z

@TheRealHaoLiu below is the output.
I want to highlight again, pod is not deleting only for few inventory sync jobs which is completed successfully

ansible-inventory [core 2.15.5]
config file = /ansible.cfg
configured module search path = ['/cyberark-ansible-modules/lib/ansible/modules', '/runner/project']
ansible python module location = /usr/local/lib/python3.9/site-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections:/usr/share/automation-controller/collections
executable location = /usr/local/bin/ansible-inventory
python version = 3.9.18 (main, Jan 24 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/usr/bin/python3)
jinja version = 3.0.0
libyaml = False
Using /ansible.cfg as config file
[DEPRECATION WARNING]: DEFAULT_GATHER_TIMEOUT option, the module_defaults
keyword is a more generic version and can apply to all calls to the
M(ansible.builtin.gather_facts) or M(ansible.builtin.setup) actions, use
module_defaults instead. This feature will be removed from ansible-core in
version 2.18. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
redirecting (type: inventory) ansible.builtin.aws_ec2 to amazon.aws.aws_ec2
Using inventory plugin 'ansible_collections.amazon.aws.plugins.inventory.aws_ec2' to process inventory source '/runner/inventory/aws_ec2.yml'
Parsed /runner/inventory/aws_ec2.yml inventory source with auto plugin
8.867 INFO Processing JSON output...
8.868 INFO Loaded 1 groups, 0 hosts
8.898 INFO Inventory import completed for AWS-sandbox-Windows in 0.0s

TheRealHaoLiu · 2024-06-12T17:55:52Z

@chinna44 that does not look like the output from the api endpoint... that looks like the stdout of the job

chinna44 · 2024-06-13T05:00:51Z

@TheRealHaoLiu yes.. you are correct, I'm sorry for that

Below is the output for endpoint /api/v2/jobs/462026, but I could not see the details of job for this kind of any other Inventory Sync jobs. Please let me know if you require details in any other possible ways

HTTP 404 Not Found
Allow: GET, DELETE, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
X-API-Node: awx-web-c8bc64f45-h7xwt
X-API-Product-Name: AWX
X-API-Product-Version: 23.9.0
X-API-Time: 0.057s

{
"detail": "Not found."
}

chinna44 · 2024-06-17T17:04:38Z

@TheRealHaoLiu please let me know if you need any other details

BartOpitz · 2024-09-17T13:48:56Z

Hi. We face the same issues, but also job pods are sometimes not removed in k8s. Mostly job pods that end with an error and not OK, but there are some successfull pods as well hanging.

We observed beginning of this type of problems after updating AWX from 24.5.0 to 24.6.1 and upgrade of k8s to 1.30.3 which sadly both took place the same day. Before the upgrage we did not observe this kind of problems and we were already running at least k8s 1.28 (cannot confirm precise version currently)

Update: Ok, I take everything above back. It showed up, that during troubleshooting one from our admins has set

RECEPTOR_RELEASE_WORK = False        # Default True
RECPETOR_KEEP_WORK_ON_ERROR = True   # Default False

This explains the behaviour we had. After reverthing those back to default all left over pods were removed immediately by AWX and no new pods are left behind. So everything works as expected at least within mentioned verisons.

github-actions bot added component:api needs_triage type:bug community labels Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWX not able to delete the worker pods after finished running #15247

AWX not able to delete the worker pods after finished running #15247

chinna44 commented Jun 3, 2024

chronicc commented Jun 5, 2024

TheRealHaoLiu commented Jun 6, 2024

chinna44 commented Jun 7, 2024

TheRealHaoLiu commented Jun 12, 2024

chinna44 commented Jun 13, 2024

chinna44 commented Jun 17, 2024

BartOpitz commented Sep 17, 2024 •

edited

Loading

AWX not able to delete the worker pods after finished running #15247

AWX not able to delete the worker pods after finished running #15247

Comments

chinna44 commented Jun 3, 2024

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

chronicc commented Jun 5, 2024

TheRealHaoLiu commented Jun 6, 2024

chinna44 commented Jun 7, 2024

@TheRealHaoLiu below is the output. I want to highlight again, pod is not deleting only for few inventory sync jobs which is completed successfully

TheRealHaoLiu commented Jun 12, 2024

chinna44 commented Jun 13, 2024

chinna44 commented Jun 17, 2024

BartOpitz commented Sep 17, 2024 • edited Loading

@TheRealHaoLiu below is the output.
I want to highlight again, pod is not deleting only for few inventory sync jobs which is completed successfully

BartOpitz commented Sep 17, 2024 •

edited

Loading