Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaunch jobs do not get queued when instances disabled #14365

Open
5 of 11 tasks
2and3makes23 opened this issue Aug 22, 2023 · 8 comments
Open
5 of 11 tasks

Relaunch jobs do not get queued when instances disabled #14365

2and3makes23 opened this issue Aug 22, 2023 · 8 comments

Comments

@2and3makes23
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

When all AWX instances are disabled and a former job gets relaunched the following things happen

  • internal server error
  • the job appears in jobs as "New", but is stuck there indefinetly

AWX version

22.5.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

openshift

Modifications

no

Ansible version

2.12.10

Operating system

CentOS, RHEL

Web browser

Firefox

Steps to reproduce

  • Disable all AWX instances
  • Attempt to relaunch a foremerly successful job (or any job for that matter)

Expected results

The relaunched job appears under jobs as pending and begins to start as soon as an AWX instance gets reenabled and picks it up

Actual results

internal server error
grafik

the job appears in jobs wit status "New" and is stuck there indefinetly
grafik

Reenabling instances does not change the state of the relaunched job.

Additional information

After reenabling AWX instances everything works fine again, including job relaunch.
Only the "New" job stays stuck.

@djyasin
Copy link
Member

djyasin commented Aug 23, 2023

Hello,
When jobs are relaunched, they still have to go through the task manager processing. They can get assigned to different nodes within the same instance group, and the expectation is that instances will need to be enabled and available so that the job can run.

May we ask why you are disabling the instances for a given job template? We would like to gain a better understanding of this particular use case.

Thank you for your time!

@2and3makes23
Copy link
Author

Hi, thanks for your quick response!

When we disable instances

We disable all instances (not just for a particular job template, but in general) when updating to a more recent AWX version in order not to interrupt our customers jobs in the process.

Benefits for us

Jobs that are triggered during our update process are enqueued ("Pending" state) and executed after reenabling the instances (in our case: complete AWX redeployment).

Only relaunched jobs run into above described error, while instances are disabled.

More on why we disable isntances

Updating for us means, updating the AWX operator and redeploying AWX using that newer Operator, which we trigger explicitly because of staging.

Of course we would much rather update in a more kubernetes way and use a rolling update strategy (replacing old pods one by one) instead of disabling and redeploying, but as far as we know, that is not yet possible: awx-operator/issues/1275 and awx-operator/issues/1362

But maybe you have some helpful input on that for us, too? :)
Thank you for your time :)

@djyasin
Copy link
Member

djyasin commented Aug 30, 2023

@2and3makes23 Thank you so much for providing this additional information! This is extremely helpful. Could you please also provide us with the trace-back logs that are generated when this occurs? This will be very helpful to us.

Thank you again for taking the time to provide all of this information!

@AlanCoding
Copy link
Member

I had a look and I was not able to reproduce this issue. Jobs can be relaunched even when all instances are disabled and the relaunch job goes into "pending" as expected.

@2and3makes23
Copy link
Author

Sorry for the delay

@AlanCoding thanks for checking on your side

@djyasin please find log output below that is produced for one event of a user clicking job relaunch while all (two) instances are disabled

September 8th 2023, 15:37:33.11<some_ip> - - [08/Sep/2023 13:37:33] "GET /probe?seconds=1&livereadistart=readi HTTP/1.1" 200 -
September 8th 2023, 15:37:33.11<some_ip> - - [08/Sep/2023:13:37:33 +0000] "GET / HTTP/1.1" 200 8 "-" "python-requests/2.31.0"
September 8th 2023, 15:37:33.11<some_ip> - - [08/Sep/2023:13:37:33 +0000] "GET / HTTP/1.1" 200 8 "-" "python-requests/2.31.0"
September 8th 2023, 15:37:31.232[pid: 40|app: 0|req: 144/280] <some_ip> () {70 vars in 1261 bytes} [Fri Sep  8 13:37:30 2023] POST /api/v2/jobs/43006/relaunch/ => generated 41 bytes in 301 msecs (HTTP/1.1 500) 8 headers in 309 bytes (1 switches on core 0)
September 8th 2023, 15:37:31.23<some_ip> - - [08/Sep/2023:13:37:31 +0000] "POST /api/v2/jobs/43006/relaunch/ HTTP/1.1" 500 41 "https://awx.domain.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0" "<some_ip>, <some_ip>"
September 8th 2023, 15:37:31.2292023-09-08 13:37:31,226 ERROR    [7c40e4bfc1ea472aa957f6662601b473] django.request Internal Server Error: /api/v2/jobs/43006/relaunch/
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/views/generic/base.py", line 104, in view
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/api/generics.py", line 332, in dispatch
September 8th 2023, 15:37:31.229    return super(APIView, self).dispatch(request, *args, **kwargs)
September 8th 2023, 15:37:31.229    raise exc
September 8th 2023, 15:37:31.229    new_job = obj.copy_unified_job(**copy_kwargs)
September 8th 2023, 15:37:31.229    unified_job = self.unified_job_template.create_unified_job(**prompts)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 906, in save
September 8th 2023, 15:37:31.229    result = super(UnifiedJob, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/polymorphic/models.py", line 87, in save
September 8th 2023, 15:37:31.229    return super().save(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 207, in save
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 814, in save
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 892, in save_base
September 8th 2023, 15:37:31.229    post_save.send(
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/dispatch/dispatcher.py", line 176, in send
September 8th 2023, 15:37:31.229    return [
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/signals.py", line 109, in emit_update_inventory_on_created_or_deleted
September 8th 2023, 15:37:31.229    connection.on_commit(lambda: update_inventory_computed_fields.delay(inventory.id))
September 8th 2023, 15:37:31.229    connection.on_commit(lambda: update_inventory_computed_fields.delay(inventory.id))
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/publish.py", line 73, in delay
September 8th 2023, 15:37:31.229    queue = queue()
September 8th 2023, 15:37:31.229    response = get_response(request)
September 8th 2023, 15:37:31.229    return super(APIView, self).dispatch(request, *args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 506, in dispatch
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/api/views/__init__.py", line 3424, in post
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 940, in copy_unified_job
September 8th 2023, 15:37:31.229    unified_job = self.unified_job_template.create_unified_job(**prompts)
September 8th 2023, 15:37:31.229    unified_job.save()
September 8th 2023, 15:37:31.229    return super().save(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 325, in save
September 8th 2023, 15:37:31.229    super(PrimordialModel, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229    return [
September 8th 2023, 15:37:31.229    connection.on_commit(lambda: update_inventory_computed_fields.delay(inventory.id))
September 8th 2023, 15:37:31.229    return cls.apply_async(args, kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/publish.py", line 93, in apply_async
September 8th 2023, 15:37:31.229    queue = queue()
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/__init__.py", line 37, in get_task_queuename
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/views/decorators/csrf.py", line 56, in wrapper_view
September 8th 2023, 15:37:31.229    return self.dispatch(request, *args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 509, in dispatch
September 8th 2023, 15:37:31.229    self.raise_uncaught_exception(exc)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 506, in dispatch
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/api/views/__init__.py", line 3424, in post
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 940, in copy_unified_job
September 8th 2023, 15:37:31.229    job = super(JobTemplate, self).create_unified_job(**kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 400, in create_unified_job
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 325, in save
September 8th 2023, 15:37:31.229    super(PrimordialModel, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 173, in save
September 8th 2023, 15:37:31.229    self.save_base(
September 8th 2023, 15:37:31.229    (receiver, receiver(signal=self, sender=sender, **named))
September 8th 2023, 15:37:31.229    func()
September 8th 2023, 15:37:31.229    return cls.apply_async(args, kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/publish.py", line 93, in apply_async
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/__init__.py", line 37, in get_task_queuename
September 8th 2023, 15:37:31.229    raise ValueError('No task instances are READY and Enabled.')
September 8th 2023, 15:37:31.2292023-09-08 13:37:31,226 ERROR    [7c40e4bfc1ea472aa957f6662601b473] django.request Internal Server Error: /api/v2/jobs/43006/relaunch/
September 8th 2023, 15:37:31.229Traceback (most recent call last):
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
September 8th 2023, 15:37:31.229    return view_func(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/views/generic/base.py", line 104, in view
September 8th 2023, 15:37:31.229    return super(JobRelaunch, self).dispatch(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 509, in dispatch
September 8th 2023, 15:37:31.229    response = handler(request, *args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/jobs.py", line 655, in copy_unified_job
September 8th 2023, 15:37:31.229    job = super(JobTemplate, self).create_unified_job(**kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 906, in save
September 8th 2023, 15:37:31.229    result = super(UnifiedJob, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/polymorphic/models.py", line 87, in save
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 207, in save
September 8th 2023, 15:37:31.229    super(PasswordFieldsModel, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 814, in save
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
September 8th 2023, 15:37:31.229    (receiver, receiver(signal=self, sender=sender, **named))
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/signals.py", line 109, in emit_update_inventory_on_created_or_deleted
September 8th 2023, 15:37:31.229    connection.on_commit(lambda: update_inventory_computed_fields.delay(inventory.id))
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/signals.py", line 109, in <lambda>
September 8th 2023, 15:37:31.229    raise ValueError('No task instances are READY and Enabled.')
September 8th 2023, 15:37:31.229ValueError: No task instances are READY and Enabled.
September 8th 2023, 15:37:31.229Traceback (most recent call last):
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
September 8th 2023, 15:37:31.229    response = get_response(request)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
September 8th 2023, 15:37:31.229    response = wrapped_callback(request, *callback_args, **callback_kwargs)
September 8th 2023, 15:37:31.229    return view_func(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/api/views/__init__.py", line 3377, in dispatch
September 8th 2023, 15:37:31.229    return super(JobRelaunch, self).dispatch(*args, **kwargs)
September 8th 2023, 15:37:31.229    response = self.handle_exception(exc)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 469, in handle_exception
September 8th 2023, 15:37:31.229    response = handler(request, *args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/jobs.py", line 655, in copy_unified_job
September 8th 2023, 15:37:31.229    return super(Job, self).copy_unified_job(**new_prompts)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/jobs.py", line 393, in create_unified_job
September 8th 2023, 15:37:31.229    unified_job.save()
September 8th 2023, 15:37:31.229    super(PasswordFieldsModel, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229    super(CreatedModifiedModel, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 760, in on_commit
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/signals.py", line 109, in <lambda>
September 8th 2023, 15:37:31.229ValueError: No task instances are READY and Enabled.
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
September 8th 2023, 15:37:31.229    response = wrapped_callback(request, *callback_args, **callback_kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/views/decorators/csrf.py", line 56, in wrapper_view
September 8th 2023, 15:37:31.229    return self.dispatch(request, *args, **kwargs)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/api/views/__init__.py", line 3377, in dispatch
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/api/generics.py", line 332, in dispatch
September 8th 2023, 15:37:31.229    response = self.handle_exception(exc)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 469, in handle_exception
September 8th 2023, 15:37:31.229    self.raise_uncaught_exception(exc)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
September 8th 2023, 15:37:31.229    raise exc
September 8th 2023, 15:37:31.229    new_job = obj.copy_unified_job(**copy_kwargs)
September 8th 2023, 15:37:31.229    return super(Job, self).copy_unified_job(**new_prompts)
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/jobs.py", line 393, in create_unified_job
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 400, in create_unified_job
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 173, in save
September 8th 2023, 15:37:31.229    super(CreatedModifiedModel, self).save(*args, **kwargs)
September 8th 2023, 15:37:31.229    self.save_base(
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 892, in save_base
September 8th 2023, 15:37:31.229    post_save.send(
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/dispatch/dispatcher.py", line 176, in send
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 760, in on_commit
September 8th 2023, 15:37:31.229    func()
September 8th 2023, 15:37:31.229  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/publish.py", line 73, in delay
September 8th 2023, 15:37:30.898[pid: 38|app: 0|req: 68/279] <some_ip> () {64 vars in 1140 bytes} [Fri Sep  8 13:37:30 2023] GET /api/v2/jobs/43006/relaunch/ => generated 68 bytes in 266 msecs (HTTP/1.1 200) 14 headers in 583 bytes (1 switches on core 0)
September 8th 2023, 15:37:30.89<some_ip> - - [08/Sep/2023:13:37:30 +0000] "GET /api/v2/jobs/43006/relaunch/ HTTP/1.1" 200 68 "https://awx.domain.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0" "<some_ip>, <some_ip>"

@AlanCoding
Copy link
Member

Thanks, that points to some relatively recent code so this is good information.

raise ValueError('No task instances are READY and Enabled.')

@AlanCoding
Copy link
Member

I didn't give enough information in my last comment - the ValueError is hit because we have enabled=True as a part of the instance filter, so the queryset returns no instances, and raises that error. The obvious and simple fix is to either remove that from the filter, or add a last-ditch query to get disabled instances when no enabled instances are present.

I did not hit this bug in my replication attempt because I was using a hybrid node, which submits tasks locally. Only web pods use this code.

This is obviously valid and should get worked on.

@2and3makes23
Copy link
Author

Thanks for looking into this, we really appreciate it ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants