Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX Task Pod - Worker failed to run task awx.main.tasks.system.awx_periodic_scheduler #14394

Closed
5 of 11 tasks
Daniel-Vaz opened this issue Aug 30, 2023 · 5 comments
Closed
5 of 11 tasks

Comments

@Daniel-Vaz
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

On the AWX Task Pod I see the following error message being logged on a loop constantly:

2023-08-30 07:38:38,795 ERROR    [-] awx.main.dispatch Worker failed to run task awx.main.tasks.system.awx_periodic_scheduler(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 745, in awx_periodic_scheduler
    state = TowerScheduleState.get_solo()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/solo/models.py", line 55, in get_solo
    obj, created = cls.objects.get_or_create(pk=cls.singleton_instance_id)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 916, in get_or_create
    return self.get(**kwargs), False
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 640, in get
    raise self.model.MultipleObjectsReturned(
awx.main.models.ha.TowerScheduleState.MultipleObjectsReturned: get() returned more than one TowerScheduleState -- it returned 2!

Pod Startup seems normal:

[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1 of 30
Instance Group already registered controlplane
Instance Group already registered default
Successfully registered instance None
(changed: True)
2023-08-30 07:38:03,862 INFO RPC interface 'supervisor' initialized
2023-08-30 07:38:03,862 INFO RPC interface 'supervisor' initialized
2023-08-30 07:38:03,862 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-08-30 07:38:03,862 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-08-30 07:38:03,862 INFO supervisord started with pid 7
2023-08-30 07:38:03,862 INFO supervisord started with pid 7
2023-08-30 07:38:04,865 INFO spawned: 'superwatcher' with pid 34
2023-08-30 07:38:04,865 INFO spawned: 'superwatcher' with pid 34
2023-08-30 07:38:04,866 INFO spawned: 'dispatcher' with pid 35
2023-08-30 07:38:04,866 INFO spawned: 'dispatcher' with pid 35
2023-08-30 07:38:04,868 INFO spawned: 'callback-receiver' with pid 36
2023-08-30 07:38:04,868 INFO spawned: 'callback-receiver' with pid 36
2023-08-30 07:38:04,870 INFO spawned: 'wsrelay' with pid 37
2023-08-30 07:38:04,870 INFO spawned: 'wsrelay' with pid 37
READY
2023-08-30 07:38:05,872 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-08-30 07:38:05,872 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-08-30 07:38:06,759 INFO     [-] awx.main.commands.run_callback_receiver Callback receiver started with pid=36
2023-08-30 07:38:06,760 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-ffcbcf59f-7896t']
2023-08-30 07:38:06,862 WARNING  [-] awx.main.tasks.system Rejoining the cluster as instance awx-task-ffcbcf59f-7896t. Prior last_seen None
2023-08-30 07:38:06,873 WARNING  [-] awx.main.dispatch.periodic Scheduler next run unexpectedly over 20 seconds in future: 25.88771653175354
2023-08-30 07:38:06,913 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-ffcbcf59f-7896t is registered.
2023-08-30 07:38:16,923 INFO     [-] awx.main.wsrelay Adding {'awx-web-67dd69cb85-wxzmh'} to websocket broadcast list
2023-08-30 07:38:16,935 INFO     [-] awx.main.wsrelay Connection from awx-task-ffcbcf59f-7896t to 172.17.163.3 established.
2023-08-30 07:38:16,935 INFO     [-] awx.main.wsrelay Starting producer for metrics


2023-08-30 07:38:34,867 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-08-30 07:38:34,867 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-08-30 07:38:34,867 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-08-30 07:38:34,867 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-08-30 07:38:34,867 INFO     [-] awx.main.wsrelay Starting producer for 172.17.163.3-jobs-status_changed
2023-08-30 07:38:34,867 INFO     [-] awx.main.wsrelay Starting producer for 172.17.163.3-control-limit_reached_1
2023-08-30 07:38:35,868 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-08-30 07:38:35,868 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)

AWX version

23.0.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

ansible [core 2.15.0]

Operating system

Fedora CoreOS 38

Web browser

Firefox

Steps to reproduce

We are using AWX for a while now and only recently did we notice this, and it seems to only have started happening after the upgrade from version 22.5.0 to 22.7.0.

Currently on version 23.0.0 using the AWX Operator on version 2.5.2 we still face this issue.

Expected results

No errors being outputted to the logs.

Actual results

The following log message keeps repeating in the awx-task Pod every 30s +/-:

2023-08-30 07:44:38,784 ERROR    [-] awx.main.dispatch Worker failed to run task awx.main.tasks.system.awx_periodic_scheduler(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 745, in awx_periodic_scheduler
    state = TowerScheduleState.get_solo()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/solo/models.py", line 55, in get_solo
    obj, created = cls.objects.get_or_create(pk=cls.singleton_instance_id)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 916, in get_or_create
    return self.get(**kwargs), False
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 640, in get
    raise self.model.MultipleObjectsReturned(
awx.main.models.ha.TowerScheduleState.MultipleObjectsReturned: get() returned more than one TowerScheduleState -- it returned 2!

Additional information

No response

@AlanCoding
Copy link
Member

This was reported in #8214 and #4419 and in both cases there was a simple and obvious remediation.

The mechanic here is that we're using the django-solo library to track the scheduler state https://github.com/lazybird/django-solo/ and the entire point of the solo library is that it only maintains 1 record in a table.

The simple fact that you hit this error isn't very interesting. If you know of something that an internal task did, or that the awx-operator did, which put it into this state, that would be very interesting. You might make progress by looking at the table contents and getting logs from around the time of the recorded timestamps to see what might have ran and caused this.

@Daniel-Vaz
Copy link
Author

Hi @AlanCoding 😄

Thank you for the explanation, I wasn't aware of the mentioned issues.
Indeed in my situation I had a duplicate entries in the public.main_towerschedulestate table:

awx=# SELECT * FROM public.main_towerschedulestate;
 id |       schedule_last_run
----+-------------------------------
  1 | 2023-06-26 12:04:17.06672+00
  1 | 2023-06-26 10:37:49.014928+00

Scaling AWX down and deleting these entries resulted in no longer these logs being shown.
Will keep a eye out in the future in order to better understand if there is any specific root cause associated with this behaviour.

@AlanCoding
Copy link
Member

I hate to claim "this shouldn't happen"... but it shouldn't happen. There is only 1 case I know of where we access the model:

state = TowerScheduleState.get_solo()

That is done under an advisory lock which prevents more than 1 process doing this operation at a given time. That's a pretty strong level of assurance against this kind of bug. This reeks of someone doing cowboy coding and running scripts outside of the app logic itself. Still, the fact that it has been reported 3 times suggests there's something systemic going on.

@keys1234249
Copy link

Hi @AlanCoding 😄

Thank you for the explanation, I wasn't aware of the mentioned issues. Indeed in my situation I had a duplicate entries in the public.main_towerschedulestate table:

awx=# SELECT * FROM public.main_towerschedulestate;
 id |       schedule_last_run
----+-------------------------------
  1 | 2023-06-26 12:04:17.06672+00
  1 | 2023-06-26 10:37:49.014928+00

Scaling AWX down and deleting these entries resulted in no longer these logs being shown. Will keep a eye out in the future in order to better understand if there is any specific root cause associated with this behaviour.

Hey Daniel can you share some information on how to clear the duplicate entries from the AWX DB? When you mention scale down i assume you mean the replicas of AWX? I only have a single replica so i would assume i can run while a single instance is running?

@keys1234249
Copy link

Hi @AlanCoding 😄
Thank you for the explanation, I wasn't aware of the mentioned issues. Indeed in my situation I had a duplicate entries in the public.main_towerschedulestate table:

awx=# SELECT * FROM public.main_towerschedulestate;
 id |       schedule_last_run
----+-------------------------------
  1 | 2023-06-26 12:04:17.06672+00
  1 | 2023-06-26 10:37:49.014928+00

Scaling AWX down and deleting these entries resulted in no longer these logs being shown. Will keep a eye out in the future in order to better understand if there is any specific root cause associated with this behaviour.

Hey Daniel can you share some information on how to clear the duplicate entries from the AWX DB? When you mention scale down i assume you mean the replicas of AWX? I only have a single replica so i would assume i can run while a single instance is running?

FYI - For anyone trying to resolve this issue steps i performed to clear double entry below. Not sure if its the correct way but it seemed to resolve my issue.

Confirm if you have two entries:
oc exec awx-prod-postgres-13-0 -- psql --username awx --dbname awx --command "SELECT * FROM public.main_towerschedulestate;"

Delete both entries:
oc exec awx-prod-postgres-13-0 -- psql --username awx --dbname awx --command "DELETE FROM public.main_towerschedulestate WHERE id IN (1);"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants