Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automation Jobs Killed #12549

Closed
3 tasks done
antwacky opened this issue Jul 4, 2022 · 7 comments
Closed
3 tasks done

Automation Jobs Killed #12549

antwacky opened this issue Jul 4, 2022 · 7 comments
Labels

Comments

@antwacky
Copy link

antwacky commented Jul 4, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

Having migrated AWX operator to a new cluster, jobs are randomly failing with the following:

Error opening pod stream: Get "https://awx.k3s.net:10250/containerLogs/awx/automation-job-12053-blqxd/worker?follow=true": EOF

I'm not able to see anything beyond this error. The cluster has plenty of CPU/memory overhead, and I can't see any OOM events or similar.

AWX Operator version

0.21.0

AWX version

21.0.0

Kubernetes platform

kubernetes

Kubernetes/Platform version

v1.23.7+k3s1

Modifications

yes

Steps to reproduce

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  service_type: ClusterIP
  ingress_type: ingress
  hostname: awx.k3s.net
  ingress_annotations: |
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    cert-manager.io/cluster-issuer: cert-manager-acme-issuer
    traefik.ingress.kubernetes.io/router.middlewares: auth-sysadmin@kubernetescrd
  ingress_tls_secret: awx.k3s.net
  postgres_storage_class: longhorn

Expected results

I expect the jobs to complete successfully.

Actual results

The pods sometimes fail with the error:

Error opening pod stream: Get "https://awx.k3s.net:10250/containerLogs/awx/automation-job-12053-blqxd/worker?follow=true": EOF

Additional information

No response

Operator Logs

No response

@akus062381
Copy link
Member

Hello @pagey101, could you please ask this on our mailing list? See https://github.com/ansible/awx/#get-involved for information for ways to connect with us.

@antwacky
Copy link
Author

antwacky commented Jul 7, 2022

Hello, I have posted this to the group.

A similar issue happens in which the job status is failed and the job output is "No output found for this job".

Apologies if I am misunderstanding, but I would think it is a bug that the jobs are failing without providing a specific reason as to why?

Thanks

@d-rupp
Copy link

d-rupp commented Jul 14, 2022

Could this be related to #11805 ?

@antwacky
Copy link
Author

Unfortunately not, I have seen that issue.

This is not a long running playbook, and fails before the playbook runs sometimes (during inventory updates).

Sometimes the inventory updates will fail, sometimes they will complete but the playbook will fail. This is due to the pods being killed in quick succession.

As mentioned, there are no CPU/memory pressures when this happens. I have a response from the AWX team on the mailing list, so I will be discussing there.

@jbradberry jbradberry transferred this issue from ansible/awx-operator Jul 20, 2022
@adpavlov
Copy link

could you please share a link to the group?

@antwacky
Copy link
Author

Group

@akus062381
Copy link
Member

Closing this in favor of the mailing list discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants