Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX Not starting up properly after Server Reboot #393

Closed
WayneVDM09 opened this issue Sep 25, 2024 · 19 comments
Closed

AWX Not starting up properly after Server Reboot #393

WayneVDM09 opened this issue Sep 25, 2024 · 19 comments
Labels
question Further information is requested

Comments

@WayneVDM09
Copy link

Environment

  • OS: Rocky Linux 9.4
  • Kubernetes/K3s: 5.0.4
  • AWX Operator: 2.19.0

Description

AWX Operator pod Crashes after rebooting Rocky Linux server. Once the server reboots the awx-operator-controller-manager pod is an a CreateContainerError status. To temporarily fix the issue, only until the next reboot, I 'reboot' all the pods by removing them. Restarting only the awx-operator-controller-manager does not always work. The Rocky linux server is a Virtual Machine installed on an oVirt instance.

Step to Reproduce

  1. Reboot the Rocky linux 9.4 server
  2. kubectl get pods -n awx

Logs

$ kubectl get pods -n awx
NAME                                               READY   STATUS                 RESTARTS            AGE
awx-web-ffc6d8bbc-plswc                            3/3     Running                12 (59m ago)        6d17h
awx-postgres-15-0                                  1/1     Running                4 (59m ago)         6d17h
awx-task-57ff45ddbf-jnhsq                          4/4     Running                16 (59m ago)        6d17h
awx-operator-controller-manager-7b87fbf9f6-mlfk8   1/2     CreateContainerError   6 (<invalid> ago)   61m

The below is only the recap of the operator logs after a server reboot
$ kubectl -n awx logs -f awx-operator-controller-manager-7b87fbf9f6-mlfk8
PLAY RECAP *********************************************************************
 \"ansible-operator\", \"operation\": \"Update\", \"subresource\": \"status\", \"time\": \"2024-09-25T09:18:54Z\"}], \"name\": \"awx\", \"namespace\": \"awx\", \"resourceVersion\": \"5927387\", \"uid\": \"fa5b04f5-820d-4575-9a52-220b86112a3a\"}, \"spec\": {\"admin_password_secret\": \"awx-admin-password\", \"admin_user\": \"admin\", \"auto_upgrade\": true, \"bundle_cacert_secret\": \"awx-custom-certs\", \"create_preload_data\": true, \"ee_resource_requirements\": {}, \"extra_settings\": [{\"setting\": \"TOWER_URL_BASE\", \"value\": \"\\\"https://blue-awx.platsoft.lab\\\"\"}], \"garbage_collect_secrets\": false, \"image_pull_policy\": \"IfNotPresent\", \"ingress_hosts\": [{\"hostname\": \"blue-awx.platsoft.lab\", \"tls_secret\": \"awx-secret-tls\"}], \"ingress_type\": \"ingress\", \"init_container_resource_requirements\": {}, \"ipv6_disabled\": false, \"loadbalancer_class\": \"\", \"loadbalancer_ip\": \"\", \"loadbalancer_port\": 80, \"loadbalancer_protocol\": \"http\", \"metrics_utility_console_enabled\": false, \"metrics_utility_cronjob_gather_schedule\": \"@hourly\", \"metrics_utility_cronjob_report_schedule\": \"@monthly\", \"metrics_utility_enabled\": false, \"metrics_utility_pvc_claim_size\": \"5Gi\", \"no_log\": true, \"postgres_configuration_secret\": \"awx-postgres-configuration\", \"postgres_data_volume_init\": true, \"postgres_keepalives\": true, \"postgres_keepalives_count\": 5, \"postgres_keepalives_idle\": 5, \"postgres_keepalives_interval\": 5, \"postgres_resource_requirements\": {}, \"postgres_storage_class\": \"awx-postgres-volume\", \"postgres_storage_requirements\": {\"requests\": {\"storage\": \"8Gi\"}}, \"projects_existing_claim\": \"awx-projects-claim\", \"projects_persistence\": true, \"projects_storage_access_mode\": \"ReadWriteMany\", \"projects_storage_size\": \"8Gi\", \"redis_resource_requirements\": {}, \"replicas\": 1, \"route_tls_termination_mechanism\": \"Edge\", \"rsyslog_resource_requirements\": {}, \"set_self_labels\": true, \"task_liveness_failure_threshold\": 3, \"task_liveness_initial_delay\": 5, \"task_liveness_period\": 0, \"task_liveness_timeout\": 1, \"task_manage_replicas\": true, \"task_privileged\": false, \"task_readiness_failure_threshold\": 3, \"task_readiness_initial_delay\": 20, \"task_readiness_period\": 0, \"task_readiness_timeout\": 1, \"task_replicas\": 1, \"task_resource_requirements\": {}, \"web_liveness_failure_threshold\": 3, \"web_liveness_initial_delay\": 5, \"web_liveness_period\": 0, \"web_liveness_timeout\": 1, \"web_manage_replicas\": true, \"web_readiness_failure_threshold\": 3, \"web_readiness_initial_delay\": 20, \"web_readiness_period\": 0, \"web_readiness_timeout\": 1, \"web_replicas\": 1, \"web_resource_requirements\": {}}, \"status\": {\"adminPasswordSecret\": \"awx-admin-password\", \"adminUser\": \"admin\", \"broadcastWebsocketSecret\": \"awx-broadcast-websocket\", \"conditions\": [{\"lastTransitionTime\": \"2024-07-02T20:12:59Z\", \"reason\": \"Failed\", \"status\": \"True\", \"type\": \"Failure\"}, {\"lastTransitionTime\": \"2024-07-02T19:46:21Z\", \"reason\": \"\", \"status\": \"False\", \"type\": \"Successful\"}, {\"lastTransitionTime\": \"2024-09-25T09:18:54Z\", \"reason\": \"Running\", \"status\": \"True\", \"type\": \"Running\"}], \"image\": \"quay.io/ansible/awx:24.6.0\", \"postgresConfigurationSecret\": \"awx-postgres-configuration\", \"secretKeySecret\": \"awx-secret-key\", \"version\": \"24.6.0\"}}}\n\r\nTASK [installer : Look up details for this deployment] *************************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:26\nskipping: [localhost] => {\"changed\": false, \"false_condition\": \"additional_labels | length\", \"skip_reason\": \"Conditional result was False\"}\n\r\nTASK [installer : Select resource labels which are in `additional_labels`] *****\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:34\nskipping: [localhost] => {\"changed\": false, \"false_condition\": \"additional_labels | length\", \"skip_reason\": \"Conditional result was False\"}\n\r\nTASK [installer : Include secret key configuration tasks] **********************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:44\nincluded: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml for localhost\n\r\nTASK [installer : Check for specified secret key configuration] ****************\r\ntask path: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml:2\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Check for default secret key configuration] ******************\r\ntask path: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml:11\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set secret key secret] ***************************************\r\ntask path: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml:19\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Create secret key secret] ************************************\r\ntask path: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml:25\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Read secret key secret] **************************************\r\ntask path: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml:31\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set secret key secret] ***************************************\r\ntask path: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml:41\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Store secret key secret name] ********************************\r\ntask path: /opt/ansible/roles/installer/tasks/secret_key_configuration.yml:46\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Load LDAP CAcert certificate] ********************************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:47\nskipping: [localhost] => {\"changed\": false, \"false_condition\": \"ldap_cacert_secret != ''\", \"skip_reason\": \"Conditional result was False\"}\n\r\nTASK [installer : Load ldap bind password] *************************************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:52\nskipping: [localhost] => {\"changed\": false, \"false_condition\": \"ldap_password_secret != ''\", \"skip_reason\": \"Conditional result was False\"}\n\r\nTASK [installer : Load bundle certificate authority certificate] ***************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:57\nincluded: /opt/ansible/roles/installer/tasks/load_bundle_cacert_secret.yml for localhost\n\r\nTASK [installer : Retrieve bundle Certificate Authority Secret] ****************\r\ntask path: /opt/ansible/roles/installer/tasks/load_bundle_cacert_secret.yml:2\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Load bundle Certificate Authority Secret content] ************\r\ntask path: /opt/ansible/roles/installer/tasks/load_bundle_cacert_secret.yml:10\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Include admin password configuration tasks] ******************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:62\nincluded: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml for localhost\n\r\nTASK [installer : Check for specified admin password configuration] ************\r\ntask path: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml:2\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Check for default admin password configuration] **************\r\ntask path: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml:11\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set admin password secret] ***********************************\r\ntask path: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml:19\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Create admin password secret] ********************************\r\ntask path: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml:25\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Read admin password secret] **********************************\r\ntask path: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml:31\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set admin password secret] ***********************************\r\ntask path: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml:41\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Store admin password] ****************************************\r\ntask path: /opt/ansible/roles/installer/tasks/admin_password_configuration.yml:46\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Include broadcast websocket configuration tasks] *************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:65\nincluded: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml for localhost\n\r\nTASK [installer : Check for specified broadcast websocket secret configuration] ***\r\ntask path: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml:2\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Check for default broadcast websocket secret configuration] ***\r\ntask path: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml:11\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set broadcast websocket secret] ******************************\r\ntask path: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml:19\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Create broadcast websocket secret] ***************************\r\ntask path: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml:26\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Read broadcast websocket secret] *****************************\r\ntask path: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml:32\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set broadcast websocket secret] ******************************\r\ntask path: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml:42\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Store broadcast websocket secret name] ***********************\r\ntask path: /opt/ansible/roles/installer/tasks/broadcast_websocket_configuration.yml:48\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Include set_images tasks] ************************************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:68\nincluded: /opt/ansible/roles/installer/tasks/set_images.yml for localhost\n\r\nTASK [installer : Set default awx init container image] ************************\r\ntask path: /opt/ansible/roles/installer/tasks/set_images.yml:3\nok: [localhost] => {\"ansible_facts\": {\"_default_init_container_image\": \"quay.io/ansible/awx-ee:24.6.0\"}, \"changed\": false}\n\r\nTASK [installer : Set user provided awx init image] ****************************\r\ntask path: /opt/ansible/roles/installer/tasks/set_images.yml:7\nskipping: [localhost] => {\"changed\": false, \"false_condition\": \"init_container_image | default('_undefined',true) != '_undefined'\", \"skip_reason\": \"Conditional result was False\"}\n\r\nTASK [installer : Set Init image URL] ******************************************\r\ntask path: /opt/ansible/roles/installer/tasks/set_images.yml:14\nok: [localhost] => {\"ansible_facts\": {\"_init_container_image\": \"quay.io/ansible/awx-ee:24.6.0\"}, \"changed\": false}\n\r\nTASK [installer : Set default awx init projects container image] ***************\r\ntask path: /opt/ansible/roles/installer/tasks/set_images.yml:21\nok: [localhost] => {\"ansible_facts\": {\"_default_init_projects_container_image\": \"quay.io/centos/centos:stream9\"}, \"changed\": false}\n\r\nTASK [installer : Set user provided awx init projects image] *******************\r\ntask path: /opt/ansible/roles/installer/tasks/set_images.yml:25\nskipping: [localhost] => {\"changed\": false, \"false_condition\": \"init_projects_container_image | default([]) | length\", \"skip_reason\": \"Conditional result was False\"}\n\r\nTASK [installer : Set Init projects image URL] *********************************\r\ntask path: /opt/ansible/roles/installer/tasks/set_images.yml:31\nok: [localhost] => {\"ansible_facts\": {\"_init_projects_container_image\": \"quay.io/centos/centos:stream9\"}, \"changed\": false}\n\r\nTASK [installer : Include database configuration tasks] ************************\r\ntask path: /opt/ansible/roles/installer/tasks/install.yml:71\nstatically imported: /opt/ansible/roles/installer/tasks/migrate_data.yml\nincluded: /opt/ansible/roles/installer/tasks/database_configuration.yml for localhost\n\r\nTASK [installer : Check for specified PostgreSQL configuration] ****************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:2\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Check for default PostgreSQL configuration] ******************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:11\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Check for specified old PostgreSQL configuration secret] *****\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:19\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Check for default old PostgreSQL configuration] **************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:28\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set old PostgreSQL configuration] ****************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:36\nok: [localhost] => {\"ansible_facts\": {\"old_pg_config\": {\"api_found\": true, \"changed\": false, \"failed\": false, \"resources\": []}}, \"changed\": false}\n\r\nTASK [installer : Set proper database name when migrating from old deployment] ***\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:41\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set default postgres image] **********************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:50\nok: [localhost] => {\"ansible_facts\": {\"_default_postgres_image\": \"quay.io/sclorg/postgresql-15-c9s:latest\"}, \"changed\": false}\n\r\nTASK [installer : Set PostgreSQL configuration] ********************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:54\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set user provided postgres image] ****************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:59\nskipping: [localhost] => {\"changed\": false, \"false_condition\": \"postgres_image | default([]) | length\", \"skip_reason\": \"Conditional result was False\"}\n\r\nTASK [installer : Set Postgres image URL] **************************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:66\nok: [localhost] => {\"ansible_facts\": {\"_postgres_image\": \"quay.io/sclorg/postgresql-15-c9s:latest\"}, \"changed\": false}\n\r\nTASK [installer : Create Database configuration] *******************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:71\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Read Database Configuration] *********************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:77\nskipping: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set PostgreSQL Configuration] ********************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:86\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set actual postgres configuration secret used] ***************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:91\nok: [localhost] => {\"ansible_facts\": {\"__postgres_configuration_secret\": \"awx-postgres-configuration\"}, \"changed\": false}\n\r\nTASK [installer : Store Database Configuration] ********************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:95\nok: [localhost] => {\"censored\": \"the output has been hidden due to the fact that 'no_log: true' was specified for this result\", \"changed\": false}\n\r\nTASK [installer : Set database as managed] *************************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:106\nok: [localhost] => {\"ansible_facts\": {\"managed_database\": true}, \"changed\": false}\n\r\nTASK [installer : Get the old postgres pod (N-1)] ******************************\r\ntask path: /opt/ansible/roles/installer/tasks/database_configuration.yml:112\nAn exception occurred during task execution. To see the full traceback, use -vvv. The error was: ' raised while trying to get resource using (name=None, namespace=awx, label_selectors=[], field_selectors=['status.phase=Running'])\r\nfatal: [localhost]: FAILED! => {\"changed\": false, \"msg\": \"Exception '401\\nReason: Unauthorized\\nHTTP response headers: HTTPHeaderDict({'Audit-Id': '50ef502e-c4c6-4c2d-8763-9fb6f1d1ce5a', 'Cache-Control': 'no-cache, private', 'Content-Length': '129', 'Content-Type': 'application/json', 'Date': 'Wed, 25 Sep 2024 07:19:08 GMT'})\\nHTTP response body: b'{\\\"kind\\\":\\\"Status\\\",\\\"apiVersion\\\":\\\"v1\\\",\\\"metadata\\\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"Unauthorized\\\",\\\"reason\\\":\\\"Unauthorized\\\",\\\"code\\\":401}\\\\n'\\nOriginal traceback: \\n  File \\\"/usr/local/lib/python3.9/site-packages/kubernetes/dynamic/client.py\\\", line 55, in inner\\n    resp = func(self, *args, **kwargs)\\n\\n  File \\\"/usr/local/lib/python3.9/site-packages/kubernetes/dynamic/client.py\\\", line 270, in request\\n    api_response = self.client.call_api(\\n\\n  File \\\"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\\\", line 348, in call_api\\n    return self.__call_api(resource_path, method,\\n\\n  File \\\"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\\\", line 180, in __call_api\\n    response_data = self.request(\\n\\n  File \\\"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\\\", line 373, in request\\n    return self.rest_client.GET(url,\\n\\n  File \\\"/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py\\\", line 244, in GET\\n    return self.request(\\\"GET\\\", url,\\n\\n  File \\\"/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py\\\", line 238, in request\\n    raise ApiException(http_resp=r)\\n' raised while trying to get resource using (name=None, namespace=awx, label_selectors=[], field_selectors=['status.phase=Running'])\"}\n\r\nPLAY RECAP *********************************************************************\r\nlocalhost                  : ok=46   changed=0    unreachable=0    failed=1    skipped=20   rescued=0    ignored=0   \n","job":"5687128873702928656","name":"awx","namespace":"awx","error":"exit status 2","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/ansible/runner.(*runner).Run.func1\n\tansible-operator-plugins/internal/ansible/runner/runner.go:269"}
localhost                  : ok=46   changed=0    unreachable=0    failed=1    skipped=20   rescued=0    ignored=0   

----------
{"level":"error","ts":"2024-09-25T07:19:08Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"13685a1b-0cd9-4042-8207-c87af7befba5","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-09-25T07:19:08Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"d10f9fbe-2eb1-4d3a-a7a7-8385d87a84d0","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-09-25T07:19:08Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"e84fe192-44cd-4c8b-b851-0bccefa00b39","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-09-25T07:19:08Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"dece1cd8-b073-4ea4-aa00-ba8152b62707","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-09-25T07:19:08Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"f0559322-bf44-401d-b740-45d00bcb2d11","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-09-25T07:19:09Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"e339ebd0-949f-4140-900e-2a5bdde7a49d","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-09-25T07:19:09Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"969ea078-e7b6-4e5e-aa85-3d6df2dc8c93","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-09-25T07:19:10Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"51c3e535-41eb-43d7-9c4c-23715dc6b8ae","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
E0925 07:19:10.134245       7 leaderelection.go:332] error retrieving resource lock awx/awx-operator: Unauthorized
{"level":"error","ts":"2024-09-25T07:19:11Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"fc7f1cd9-293e-4f7b-9850-7afa67b7a2fc","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
E0925 07:19:12.133995       7 leaderelection.go:332] error retrieving resource lock awx/awx-operator: Unauthorized
{"level":"error","ts":"2024-09-25T07:19:13Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"e57788d9-6590-4c37-8960-86e2dafbaa14","error":"Unauthorized","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
E0925 07:19:14.134049       7 leaderelection.go:332] error retrieving resource lock awx/awx-operator: Unauthorized
E0925 07:19:16.134018       7 leaderelection.go:332] error retrieving resource lock awx/awx-operator: Unauthorized
I0925 07:19:18.128886       7 leaderelection.go:285] failed to renew lease awx/awx-operator: timed out waiting for the condition
{"level":"info","ts":"2024-09-25T07:19:18Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-09-25T07:19:18Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2024-09-25T07:19:18Z","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2024-09-25T07:19:18Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awxmeshingress-controller"}
{"level":"error","ts":"2024-09-25T07:19:18Z","logger":"cmd","msg":"Proxy or operator exited with error.","error":"leader election lost","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.run\n\tansible-operator-plugins/internal/cmd/ansible-operator/run/cmd.go:261\ngithub.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.NewCmd.func1\n\tansible-operator-plugins/internal/cmd/ansible-operator/run/cmd.go:81\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039\nmain.main\n\tansible-operator-plugins/cmd/ansible-operator/main.go:40\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.12/x64/src/runtime/proc.go:250"}
{"level":"info","ts":"2024-09-25T07:19:18Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2024-09-25T07:19:18Z","msg":"Stopping and waiting for HTTP servers"}

Files

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  # These parameters are designed for use with:
  # - AWX Operator: 2.19.0
  #   https://github.com/ansible/awx-operator/blob/2.19.0/README.md

  admin_user: admin
  admin_password_secret: awx-admin-password

  ingress_type: ingress
  ingress_hosts:
    - hostname: blue-awx.platsoft.lab
      tls_secret: awx-secret-tls

  postgres_configuration_secret: awx-postgres-configuration

  postgres_data_volume_init: true
  postgres_storage_class: awx-postgres-volume
  postgres_storage_requirements:
    requests:
      storage: 8Gi

  projects_persistence: true
  projects_existing_claim: awx-projects-claim

  web_replicas: 1
  task_replicas: 1

  web_resource_requirements: {}
  task_resource_requirements: {}
  ee_resource_requirements: {}
  init_container_resource_requirements: {}
  postgres_resource_requirements: {}
  redis_resource_requirements: {}
  rsyslog_resource_requirements: {}

  # Uncomment to reveal "censored" logs
  #no_log: false

  #Custom Changes:
  bundle_cacert_secret: awx-custom-certs

  extra_settings:
    - setting: TOWER_URL_BASE
      value: '"https://awx.example.com"'
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: awx

generatorOptions:
  disableNameSuffixHash: true

secretGenerator:
  - name: awx-secret-tls
    type: kubernetes.io/tls
    files:
      - tls.crt
      - tls.key

  - name: awx-postgres-configuration
    type: Opaque
    literals:
      - host=awx-postgres-15
      - port=5432
      - database=awx
      - username=awx
      - password=xxxxxxxxxxxxxxxxx
      - type=managed

  - name: awx-admin-password
    type: Opaque
    literals:
      - password=xxxxxxxxxxxxxxxx

  # If you want to specify SECRET_KEY for your AWX manually, uncomment following lines and change the value.
  # Refer AAC documentation for detail about SECRET_KEY.
  # https://docs.ansible.com/automation-controller/latest/html/administration/secret_handling.html
  #- name: awx-secret-key
  #  type: Opaque
  #  literals:
  #    - secret_key=MySuperSecureSecretKey123!

  - name: awx-custom-certs
    type: Opaque
    files:
      - bundle-ca.crt=example.crt

resources:
  - pv.yaml
  - pvc.yaml
  - awx.yaml
@WayneVDM09 WayneVDM09 added the question Further information is requested label Sep 25, 2024
@kurokobo
Copy link
Owner

kurokobo commented Sep 25, 2024

@WayneVDM09
Hi, I've never encountered that issue by restarting my CentOS 9 host 🤔
Are there any suspicious logs from K3s by using journalctl -u k3s?

@Akshay-Hegde
Copy link

Akshay-Hegde commented Sep 30, 2024

Same issue even on below OS ( k3s version v1.30.4+k3s1 (98262b5d) )
AlmaLinux release 9.4 (Seafoam Ocelot)
Ubuntu 24.04

nothing suspicious in journalctl

@WayneVDM09
Copy link
Author

Hi.
Sorry for the delay reply. I'm looking a these logs from when I made this issue. I'm not entirely sure what I should be looking for but I do see a decent amount of errors. We did run into this problem on the 25th of September and I've attached the logs available for that day.

We are running a zero trust network basically which I'm not to sure if that might be causing issues like this. We are allowing the server to reach quay.io, cdn03.quay.io, cdn01.quay.io and cdn02.quay.io in order for it to pull the latest ee in order to run jobs in the latest ee.

Let me know if there is anything else I can do to further help troubleshoot this issue?

k3s_2024-09-25.log

@kurokobo
Copy link
Owner

@WayneVDM09
Thanks for updating,

  • What the output of kubectl get all -A look like?
  • Is there any logs of OOM Killer (keywords: Out of memory, Killed process) in your /var/log/messages of the K3s host?
  • How many vCPU cores and RAM does your VM have?
  • Does increasing vCPU cores and RAM solve the issue?

@WayneVDM09
Copy link
Author

WayneVDM09 commented Oct 3, 2024

Hi

No OOM Messages from what I've seen

We currently have 3 vcores set to the VM AWX is running. Increasing or decreasing does not make a difference.

We've needed to shutdown and start the server back up again since the last reply. Here is the output of kubectl get pods -n awx

Note only the awx-operator pod has been restarted in the below output.

$ kubectl get pods -n awx
NAME                                               READY   STATUS    RESTARTS             AGE
awx-task-57ff45ddbf-tgvqc                          4/4     Running   4 (<invalid> ago)    8d
awx-web-ffc6d8bbc-kwx6x                            3/3     Running   31 (<invalid> ago)   8d
awx-postgres-15-0                                  1/1     Running   2 (<invalid> ago)    8d
awx-operator-controller-manager-7b87fbf9f6-c2mzx   2/2     Running   4 (3m54s ago)        50m
[awx@blue-awx ~]$ kubectl get all -A
NAMESPACE     NAME                                                   READY   STATUS      RESTARTS             AGE
kube-system   pod/helm-install-traefik-d4h9j                         0/1     Completed   1                    93d
kube-system   pod/helm-install-traefik-crd-c2lbb                     0/1     Completed   0                    93d
awx           pod/awx-task-57ff45ddbf-tgvqc                          4/4     Running     4 (<invalid> ago)    8d
kube-system   pod/coredns-6799fbcd5-x9jvn                            1/1     Running     25 (<invalid> ago)   93d
kube-system   pod/svclb-traefik-df5791ac-h7q2t                       2/2     Running     50 (<invalid> ago)   93d
kube-system   pod/local-path-provisioner-6c86858495-2s2pr            1/1     Running     24 (<invalid> ago)   93d
kube-system   pod/traefik-7d5f6474df-gg28m                           1/1     Running     25 (<invalid> ago)   93d
awx           pod/awx-web-ffc6d8bbc-kwx6x                            3/3     Running     31 (<invalid> ago)   8d
kube-system   pod/metrics-server-54fd9b65b-66fv2                     1/1     Running     26 (<invalid> ago)   93d
awx           pod/awx-postgres-15-0                                  1/1     Running     2 (<invalid> ago)    8d
awx           pod/awx-operator-controller-manager-7b87fbf9f6-c2mzx   2/2     Running     4 (4m43s ago)        50m

NAMESPACE     NAME                                                      TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
default       service/kubernetes                                        ClusterIP      10.43.0.1       <none>          443/TCP                      93d
kube-system   service/kube-dns                                          ClusterIP      10.43.0.10      <none>          53/UDP,53/TCP,9153/TCP       93d
kube-system   service/metrics-server                                    ClusterIP      10.43.94.155    <none>          443/TCP                      93d
awx           service/awx-operator-controller-manager-metrics-service   ClusterIP      10.43.92.238    <none>          8443/TCP                     93d
awx           service/awx-postgres-15                                   ClusterIP      None            <none>          5432/TCP                     93d
awx           service/awx-service                                       ClusterIP      10.43.134.45    <none>          80/TCP                       93d
kube-system   service/traefik                                           LoadBalancer   10.43.216.168   192.168.45.60   80:30165/TCP,443:32714/TCP   93d

NAMESPACE     NAME                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-system   daemonset.apps/svclb-traefik-df5791ac   1         1         1       1            1           <none>          93d

NAMESPACE     NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
awx           deployment.apps/awx-task                          1/1     1            1           93d
kube-system   deployment.apps/coredns                           1/1     1            1           93d
kube-system   deployment.apps/local-path-provisioner            1/1     1            1           93d
kube-system   deployment.apps/traefik                           1/1     1            1           93d
awx           deployment.apps/awx-web                           1/1     1            1           93d
kube-system   deployment.apps/metrics-server                    1/1     1            1           93d
awx           deployment.apps/awx-operator-controller-manager   1/1     1            1           93d

NAMESPACE     NAME                                                         DESIRED   CURRENT   READY   AGE
awx           replicaset.apps/awx-operator-controller-manager-7875f768df   0         0         0       93d
awx           replicaset.apps/awx-operator-controller-manager-66f45fd876   0         0         0       92d
awx           replicaset.apps/awx-operator-controller-manager-57d5bffcc5   0         0         0       92d
awx           replicaset.apps/awx-web-5c86686646                           0         0         0       93d
awx           replicaset.apps/awx-task-67c688c7dc                          0         0         0       93d
awx           replicaset.apps/awx-web-5884c744f5                           0         0         0       92d
awx           replicaset.apps/awx-task-79b98fdfc4                          0         0         0       92d
awx           replicaset.apps/awx-operator-controller-manager-98fc897d5    0         0         0       92d
awx           replicaset.apps/awx-operator-controller-manager-6f84dd9f54   0         0         0       30d
awx           replicaset.apps/awx-web-6fd75655cf                           0         0         0       92d
awx           replicaset.apps/awx-task-7d67fc7dbc                          0         0         0       92d
awx           replicaset.apps/awx-task-8665f69c77                          0         0         0       20d
awx           replicaset.apps/awx-web-58897b5b6d                           0         0         0       20d
awx           replicaset.apps/awx-operator-controller-manager-6885c467b    0         0         0       28d
awx           replicaset.apps/awx-web-66745bc85                            0         0         0       28d
awx           replicaset.apps/awx-task-74bd5bf7f9                          0         0         0       28d
awx           replicaset.apps/awx-operator-controller-manager-867b86c4f    0         0         0       16d
awx           replicaset.apps/awx-web-847f7d8669                           0         0         0       16d
awx           replicaset.apps/awx-task-84f7dc87c7                          0         0         0       16d
awx           replicaset.apps/awx-operator-controller-manager-bdd8dd5b     0         0         0       15d
awx           replicaset.apps/awx-web-6d4b478589                           0         0         0       15d
awx           replicaset.apps/awx-task-7f9fdd976                           0         0         0       15d
awx           replicaset.apps/awx-operator-controller-manager-589c886644   0         0         0       14d
awx           replicaset.apps/awx-task-57ff45ddbf                          1         1         1       14d
kube-system   replicaset.apps/coredns-6799fbcd5                            1         1         1       93d
kube-system   replicaset.apps/local-path-provisioner-6c86858495            1         1         1       93d
kube-system   replicaset.apps/traefik-7d5f6474df                           1         1         1       93d
awx           replicaset.apps/awx-web-ffc6d8bbc                            1         1         1       14d
kube-system   replicaset.apps/metrics-server-54fd9b65b                     1         1         1       93d
awx           replicaset.apps/awx-operator-controller-manager-7b87fbf9f6   1         1         1       8d

NAMESPACE   NAME                               READY   AGE
awx         statefulset.apps/awx-postgres-15   1/1     93d

NAMESPACE     NAME                                 COMPLETIONS   DURATION   AGE
kube-system   job.batch/helm-install-traefik-crd   1/1           31s        93d
kube-system   job.batch/helm-install-traefik       1/1           35s        93d
awx           job.batch/awx-migration-24.6.0       1/1           4m32s      93d

When the server was back online again the awx-operator-controller-manager had a CreateContainerError with the description below:

$ kubectl describe pod awx-operator-controller-manager-7b87fbf9f6-7bhwb -n awx
Name:             awx-operator-controller-manager-7b87fbf9f6-7bhwb
Namespace:        awx
Priority:         0
Service Account:  awx-operator-controller-manager
Node:             blue-awx.platsoft.lab/192.168.45.60
Start Time:       Wed, 25 Sep 2024 09:31:48 +0200
Labels:           control-plane=controller-manager
                  pod-template-hash=7b87fbf9f6
Annotations:      kubectl.kubernetes.io/default-container: awx-manager
                  kubectl.kubernetes.io/restartedAt: 2024-09-25T08:24:59+02:00
Status:           Running
IP:               10.42.0.83
IPs:
  IP:           10.42.0.83
Controlled By:  ReplicaSet/awx-operator-controller-manager-7b87fbf9f6
Containers:
  kube-rbac-proxy:
    Container ID:  containerd://513c0ccff5f950bcf567d04a4c1e0e7793e8ea25c887ed6840f9bf8029530989
    Image:         gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0
    Image ID:      gcr.io/kubebuilder/kube-rbac-proxy@sha256:d8cc6ffb98190e8dd403bfe67ddcb454e6127d32b87acc237b3e5240f70a20fb
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=0
    State:          Running
      Started:      Thu, 03 Oct 2024 10:20:19 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Wed, 25 Sep 2024 11:01:17 +0200
      Finished:     Thu, 03 Oct 2024 10:19:47 +0200
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     500m
      memory:  128Mi
    Requests:
      cpu:        5m
      memory:     64Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtmbt (ro)
  awx-manager:
    Container ID:  containerd://f491b8b6b9ea6d753b37745cd25f743c21aa8f83f30759243b81babf1cf229c8
    Image:         quay.io/ansible/awx-operator:2.19.0
    Image ID:      quay.io/ansible/awx-operator@sha256:0a1c9e43b4f82b7d46a1ade77429c24875f22e36bed35cdc6445a1f01ec4c8b3
    Port:          <none>
    Host Port:     <none>
    Args:
      --health-probe-bind-address=:6789
      --metrics-bind-address=127.0.0.1:8080
      --leader-elect
      --leader-election-id=awx-operator
    State:          Waiting
      Reason:       CreateContainerError
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 03 Oct 2024 10:20:20 +0200
      Finished:     Thu, 03 Oct 2024 08:21:38 +0200
    Ready:          False
    Restart Count:  81
    Limits:
      cpu:     1500m
      memory:  960Mi
    Requests:
      cpu:      50m
      memory:   32Mi
    Liveness:   http-get http://:6789/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:6789/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ANSIBLE_GATHERING:   explicit
      ANSIBLE_DEBUG_LOGS:  false
      WATCH_NAMESPACE:     awx (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtmbt (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-dtmbt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                        From     Message
  ----     ------   ----                       ----     -------
  Normal   Created  59m (x2 over <invalid>)    kubelet  Created container awx-manager
  Normal   Started  59m (x2 over <invalid>)    kubelet  Started container awx-manager
  Warning  Failed   58m (x3 over 59m)          kubelet  Error: failed to reserve container name "awx-manager_awx-operator-controller-manager-7b87fbf9f6-7bhwb_awx_adfc415f-2fb8-4351-9b4e-f23f888766cd_82": name "awx-manager_awx-operator-controller-manager-7b87fbf9f6-7bhwb_awx_adfc415f-2fb8-4351-9b4e-f23f888766cd_82" is reserved for "8849dea31832b115a049cf63f4e17da6dbb82e89d5cb7ee1b77be3b97ea7114d"
  Warning  BackOff  56m (x13 over 59m)         kubelet  Back-off restarting failed container awx-manager in pod awx-operator-controller-manager-7b87fbf9f6-7bhwb_awx(adfc415f-2fb8-4351-9b4e-f23f888766cd)
  Normal   Pulled   76s (x253 over <invalid>)  kubelet  Container image "quay.io/ansible/awx-operator:2.19.0" already present on machine
  Normal   Created  <invalid>                  kubelet  Created container kube-rbac-proxy
  Normal   Started  <invalid>                  kubelet  Started container kube-rbac-proxy

After restarting the pod, all looked well but when I try to run a job now I get the following error within the job:
image

To restart the pod I ran kubectl delete pod awx-operator-controller-manager-7b87fbf9f6-7bhwb -n awx.

@kurokobo
Copy link
Owner

kurokobo commented Oct 4, 2024

Hmm, it seems that the issue might not be with the Operator itself, but rather that K3s is becoming unstable. Is there enough memory? Do you have enough free space on / and /var/lib/rancher? Is there a time synchronization issue?

Anyway, first, let's remove the unnecessary resources. Execute kubectl -n awx get replicaset then delete all those with DESIRED has 0, by kubectl -n awx delete replicaset <replicaset name>.

After that, we will stop the pods in the order of Operator, Web, Task, and PSQL.

kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0
kubectl -n awx scale deployment/awx-web --replicas=0
kubectl -n awx scale deployment/awx-task --replicas=0
kubectl -n awx scale statefulset/awx-postgres-15 --replicas=0

At this stage, the result of kubectl -n awx get all should be very simple.

$ kubectl -n awx get all
NAME                                                      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   10.43.193.44   <none>        8443/TCP   10d
service/awx-postgres-15                                   ClusterIP   None           <none>        5432/TCP   10d
service/awx-service                                       ClusterIP   10.43.157.88   <none>        80/TCP     10d

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-operator-controller-manager   0/0     0            0           10d
deployment.apps/awx-task                          0/0     0            0           10d
deployment.apps/awx-web                           0/0     0            0           10d

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-operator-controller-manager-745b55d94b   0         0         0       10d
replicaset.apps/awx-task-dcf4bc947                           0         0         0       10d
replicaset.apps/awx-web-74cbdd58db                           0         0         0       10d

NAME                               READY   AGE
statefulset.apps/awx-postgres-15   0/0     10d

NAME                             COMPLETIONS   DURATION   AGE
job.batch/awx-migration-24.6.1   1/1           2m13s      10d

Then if it's okay for you, delete the operator deployment, and re-deploy it. Your data is not affected by doing this but if you're feeling anxious, be sure to make a backup.

# Delete Operator deployment
kubectl -n awx delete deployment awx-operator-controller-manager

# Your Operator will be started by doing this, then Operator starts PSQL, Task, and Web
cd awx-on-k3s
kubectl -n awx apply -k operator

Does your Operator still reproduce the issue? If so, it might be a good idea to take a backup of AWX and consider reinstalling whole K3s.

Additionally, if AWX is running stably, you don't need the Operator, so it shouldn't be a problem to leave the Operator stopped with the following command: kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0

If you want to reboot the K3s host more safely, you can set the replicas of all deployments and statefulsets to 0 before the reboot and then change it back to 1 after the reboot.

kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0
kubectl -n awx scale deployment/awx-web --replicas=0
kubectl -n awx scale deployment/awx-task --replicas=0
kubectl -n awx scale statefulset/awx-postgres-15 --replicas=0
sudo reboot
kubectl -n awx scale statefulset/awx-postgres-15 --replicas=1
kubectl -n awx scale deployment/awx-task --replicas=1
kubectl -n awx scale deployment/awx-web --replicas=1
kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=1

@WayneVDM09
Copy link
Author

Hi @kurokobo

I think the issue has been fixed. During last week we've added resource limits to our base/awx.yml file like below:

web_resource_requirements:
    limits:
      cpu: "2"
      memory: "8Gi"
    requests:
      cpu: "1"
      memory: "4Gi"
  task_resource_requirements:
    limits:
      cpu: "2"
      memory: "8Gi"
    requests:
      cpu: "1"
      memory: "4Gi"
  ee_resource_requirements:
    limits:
      cpu: "2"
      memory: "8Gi"
    requests:
      cpu: "1"
      memory: "4Gi"
  init_container_resource_requirements: {}
  postgres_resource_requirements: {}
  redis_resource_requirements: {}
  rsyslog_resource_requirements: {}

After adding this and rebooting AWX seemed to work fine. Just in case, I did what you mentioned above by removing the replicas, deleting the deployment and re-deploying. After a system reboot things came up fine and working. The installation seems a lot more stable. Thank you for your help.

@WayneVDM09
Copy link
Author

A new issue has appeared now after the redeployment done above. We are running an ovirt environment with version 4.5 installed. Now before the changes we could run playbooks using the ovirt.ovirt collection. After the re-deployment, we're getting:

FAILED! => {"changed": false, "msg": "ovirtsdk4 version 4.4.0 or higher is required for this module"}

We have a separate issue before the deployment were modules such as community.general.pkgng was not detected (basically anything from the community.general collection). We have our playbooks stored on a git repository and collections 'installed' via collections/requirements.yml.

We have the same installation of AWX on a different network using the same git repository with more resources and it's not experiencing any of these issues that we are experiencing on this installation. I'm not sure what is different since both installations followed the installation steps thoroughly. What would be causing this issue regarding collections not being detected or using different versions?

Below is the contents of our collections/requirements.yml file:

---
collections:
- community.crypto
- ovirt.ovirt
- community.general
- awx.awx
- pfsensible.core

@kurokobo
Copy link
Owner

kurokobo commented Oct 7, 2024

Which EE images are you using for the job? Is the ovirtsdk4 installed in the EE?

@WayneVDM09
Copy link
Author

I'm not sure. How do I check?

We've set each playbook to use the latest EE within it's job settings on AWX GUI.

@kurokobo
Copy link
Owner

kurokobo commented Oct 7, 2024

@WayneVDM09
image
image

Do you mean quay.io/ansible/awx-ee:latest?

Try:

  • Explicitly specify quay.io/ansible/awx-ee:24.6.0
  • Remove ovirt.ovirt from your collections/requirements.yml, sync project, and re-try the job, since quay.io/ansible/awx-ee contains ovirt.ovirt by default

@WayneVDM09
Copy link
Author

Our EE's on this AWX instance.
image
image
image

I've tried a combination of both methods you've given me and all attempts lead to the same error.

  1. Tried using 24.6.0 with ovirt.ovirt in collections/requirements.yml
  2. Removed ovirt.ovirt from collections/requirements.yml and tried ee:latest
  3. Kept it removed from the file using ee:24.6.0
  4. Using ee:latest with ovirt.ovirt in the file still gives the same error (just clarifying that all these combinations gives the same error)
    To make sure I did try the above multiple times.
    image

@kurokobo
Copy link
Owner

kurokobo commented Oct 9, 2024

@WayneVDM09
Hmm, is the task which causes the error targeted localhost with ansible_connection: local correctly?

@WayneVDM09
Copy link
Author

Sorry I'm not too sure what you mean by this?

@kurokobo
Copy link
Owner

Oh sorry for my lack of words.

I mean, the ovirt-engine-sdk-python (ovirtsdk4) is required on the node where the ovirt.ovirt.* modules in your playbook are executed.

If the target node for the tasks using ovirt.ovirt.* is localhost, the ovirtsdk4 inside the awx-ee image will be used, but if the tasks are being executed on a different target node, ovirtsdk4 will need to be present on that node.

@WayneVDM09
Copy link
Author

The job runs on our ovirt-engine VM and the package below was found on our ovirt-engine VM.

$ rpm -qa | grep sdk4
python3.11-ovirt-engine-sdk4-4.6.3-0.1.master.20230324091708.el9.x86_64

@kurokobo
Copy link
Owner

I recently destroyed my oVirt environment in my lab, so I can't do any testing just now, but it may be related to the ansible_python_interpreter issue.

You should set the verbosity of the job to 3 or higher, or add the task with debug module to display the ansible_python_interpreter variable in order to identify the path of python executable used on the oVirt Engine, and verify that ovirtsdk4 is available from that python executable.

If you can't import ovirtsdk4 to the python but can be imported to a different python, you can avoid the issue by explicitly specifying ansible_python_interpreter in host variables.

@WayneVDM09
Copy link
Author

Hi. Seems like it didn't have any path set for the variable but I'm not sure if I looked correctly, however setting the variable to the below in the playbook fixed the issue.

  vars:
    ansible_python_interpreter: /usr/bin/python3.11

Thank you for all of you assistance! I think we can finally close off this issue.

@kurokobo
Copy link
Owner

You can also specify ansible_python_interpreter as host_vars instead of vars in playbook, but anyway I'm glad to hear that you've solved the issue!
I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants