-
-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX Not starting up properly after Server Reboot #393
Comments
@WayneVDM09 |
Same issue even on below OS ( k3s version v1.30.4+k3s1 (98262b5d) ) nothing suspicious in |
Hi. We are running a zero trust network basically which I'm not to sure if that might be causing issues like this. We are allowing the server to reach quay.io, cdn03.quay.io, cdn01.quay.io and cdn02.quay.io in order for it to pull the latest ee in order to run jobs in the latest ee. Let me know if there is anything else I can do to further help troubleshoot this issue? |
@WayneVDM09
|
Hmm, it seems that the issue might not be with the Operator itself, but rather that K3s is becoming unstable. Is there enough memory? Do you have enough free space on Anyway, first, let's remove the unnecessary resources. Execute After that, we will stop the pods in the order of Operator, Web, Task, and PSQL. kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0
kubectl -n awx scale deployment/awx-web --replicas=0
kubectl -n awx scale deployment/awx-task --replicas=0
kubectl -n awx scale statefulset/awx-postgres-15 --replicas=0 At this stage, the result of $ kubectl -n awx get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/awx-operator-controller-manager-metrics-service ClusterIP 10.43.193.44 <none> 8443/TCP 10d
service/awx-postgres-15 ClusterIP None <none> 5432/TCP 10d
service/awx-service ClusterIP 10.43.157.88 <none> 80/TCP 10d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/awx-operator-controller-manager 0/0 0 0 10d
deployment.apps/awx-task 0/0 0 0 10d
deployment.apps/awx-web 0/0 0 0 10d
NAME DESIRED CURRENT READY AGE
replicaset.apps/awx-operator-controller-manager-745b55d94b 0 0 0 10d
replicaset.apps/awx-task-dcf4bc947 0 0 0 10d
replicaset.apps/awx-web-74cbdd58db 0 0 0 10d
NAME READY AGE
statefulset.apps/awx-postgres-15 0/0 10d
NAME COMPLETIONS DURATION AGE
job.batch/awx-migration-24.6.1 1/1 2m13s 10d Then if it's okay for you, delete the operator deployment, and re-deploy it. Your data is not affected by doing this but if you're feeling anxious, be sure to make a backup. # Delete Operator deployment
kubectl -n awx delete deployment awx-operator-controller-manager
# Your Operator will be started by doing this, then Operator starts PSQL, Task, and Web
cd awx-on-k3s
kubectl -n awx apply -k operator Does your Operator still reproduce the issue? If so, it might be a good idea to take a backup of AWX and consider reinstalling whole K3s. Additionally, if AWX is running stably, you don't need the Operator, so it shouldn't be a problem to leave the Operator stopped with the following command: If you want to reboot the K3s host more safely, you can set the replicas of all deployments and statefulsets to kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0
kubectl -n awx scale deployment/awx-web --replicas=0
kubectl -n awx scale deployment/awx-task --replicas=0
kubectl -n awx scale statefulset/awx-postgres-15 --replicas=0
sudo reboot
kubectl -n awx scale statefulset/awx-postgres-15 --replicas=1
kubectl -n awx scale deployment/awx-task --replicas=1
kubectl -n awx scale deployment/awx-web --replicas=1
kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=1 |
Hi @kurokobo I think the issue has been fixed. During last week we've added resource limits to our base/awx.yml file like below:
After adding this and rebooting AWX seemed to work fine. Just in case, I did what you mentioned above by removing the replicas, deleting the deployment and re-deploying. After a system reboot things came up fine and working. The installation seems a lot more stable. Thank you for your help. |
A new issue has appeared now after the redeployment done above. We are running an ovirt environment with version 4.5 installed. Now before the changes we could run playbooks using the ovirt.ovirt collection. After the re-deployment, we're getting:
We have a separate issue before the deployment were modules such as community.general.pkgng was not detected (basically anything from the community.general collection). We have our playbooks stored on a git repository and collections 'installed' via collections/requirements.yml. We have the same installation of AWX on a different network using the same git repository with more resources and it's not experiencing any of these issues that we are experiencing on this installation. I'm not sure what is different since both installations followed the installation steps thoroughly. What would be causing this issue regarding collections not being detected or using different versions? Below is the contents of our collections/requirements.yml file:
|
Which EE images are you using for the job? Is the ovirtsdk4 installed in the EE? |
I'm not sure. How do I check? We've set each playbook to use the latest EE within it's job settings on AWX GUI. |
Do you mean Try:
|
@WayneVDM09 |
Sorry I'm not too sure what you mean by this? |
Oh sorry for my lack of words. I mean, the If the target node for the tasks using |
The job runs on our ovirt-engine VM and the package below was found on our ovirt-engine VM.
|
I recently destroyed my oVirt environment in my lab, so I can't do any testing just now, but it may be related to the You should set the verbosity of the job to 3 or higher, or add the task with If you can't import |
Hi. Seems like it didn't have any path set for the variable but I'm not sure if I looked correctly, however setting the variable to the below in the playbook fixed the issue.
Thank you for all of you assistance! I think we can finally close off this issue. |
You can also specify |
Environment
Description
AWX Operator pod Crashes after rebooting Rocky Linux server. Once the server reboots the awx-operator-controller-manager pod is an a CreateContainerError status. To temporarily fix the issue, only until the next reboot, I 'reboot' all the pods by removing them. Restarting only the awx-operator-controller-manager does not always work. The Rocky linux server is a Virtual Machine installed on an oVirt instance.
Step to Reproduce
Logs
Files
The text was updated successfully, but these errors were encountered: