Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awx.main.wsrelay records repeated errors due to 403 with 'Invalid response status' over '/websocket/relay/' #14876

Closed
5 of 11 tasks
kurokobo opened this issue Feb 14, 2024 · 7 comments
Assignees

Comments

@kurokobo
Copy link
Contributor

kurokobo commented Feb 14, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

There are repeated errors recorded in task pod. Seems adding web instance to websocket broadcast list is failing due to wsrelay failure.

Confirmed that this occurs too on minimal fresh deployment in addition to upgraded instance. Operator completes reconciliation without any errors.

2024-02-14 05:02:32,059 INFO spawned: 'wsrelay' with pid 240
2024-02-14 05:02:32,059 INFO spawned: 'wsrelay' with pid 240
2024-02-14 05:02:33,938 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-5bdffffcd-nczk8 is registered.
2024-02-14 05:02:50,048 WARNING  [-] awx.main.tasks.system Heartbeat skew - interval=62.8776, expected=60
2024-02-14 05:02:50,071 INFO     [-] awx.main.tasks.system Host awx-task-6f8689c7fc-ghjcb Automatically Deprovisioned.
2024-02-14 05:02:53,961 INFO     [-] awx.main.wsrelay Adding {'awx-web-b5468fdfb-hm625'} to websocket broadcast list
2024-02-14 05:02:53,977 WARNING  [-] awx.main.wsrelay Connection from awx-task-5bdffffcd-nczk8 to 10.42.0.34 failed for unknown reason: '403, message='Invalid response status', url=URL('http://10.42.0.34:8052/websocket/relay/')'.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 89, in connect
    async with session.ws_connect(uri, ssl=self.verify_ssl, heartbeat=20) as websocket:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 795, in _ws_connect
    raise WSServerHandshakeError(
aiohttp.client_exceptions.WSServerHandshakeError: 403, message='Invalid response status', url=URL('http://10.42.0.34:8052/websocket/relay/')
2024-02-14 05:03:02,987 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:03:02,987 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:03:03,965 INFO     [-] awx.main.wsrelay Removing {'awx-web-b5468fdfb-hm625'} from websocket broadcast list
/usr/lib64/python3.9/asyncio/events.py:80: RuntimeWarning: coroutine 'WebSocketRelayManager.cleanup_offline_host' was never awaited
  self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 175, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_wsrelay.py", line 171, in handle
    asyncio.run(websocket_relay_manager.run())
  File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in run
    await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in <genexpr>
    await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
RuntimeError: Task got bad yield: <coroutine object WebSocketRelayManager.cleanup_offline_host at 0x7f2b5c853640>
2024-02-14 05:03:04,279 WARN exited: wsrelay (exit status 1; not expected)
2024-02-14 05:03:04,279 WARN exited: wsrelay (exit status 1; not expected)

In the web pod, there also are logs around authentication, Websocket connection does not provide valid authentication:

2024-02-14 05:01:58,691 INFO success: nginx entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: nginx entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: daphne entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: daphne entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: awx-cache-clear entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: awx-cache-clear entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: ws-heartbeat entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:01:58,691 INFO success: ws-heartbeat entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-02-14 05:02:19,768 ERROR    Websocket connection does not provide valid authentication
10.42.0.33 - - [14/Feb/2024:05:02:19 +0000] "GET /websocket/relay/ HTTP/1.1" 403 5 "-" "Python/3.9 aiohttp/3.8.3" "-"
2024-02-14 05:02:53,973 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-7' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f2f94cf4400>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:168]>
2024-02-14 05:02:53,974 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-14' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f2f94c2d910>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:168]>
10.42.0.33 - - [14/Feb/2024:05:02:53 +0000] "GET /websocket/relay/ HTTP/1.1" 403 5 "-" "Python/3.9 aiohttp/3.8.3" "-"
2024-02-14 05:02:53,976 ERROR    Websocket connection does not provide valid authentication
10.42.0.33 - - [14/Feb/2024:05:03:26 +0000] "GET /websocket/relay/ HTTP/1.1" 403 5 "-" "Python/3.9 aiohttp/3.8.3" "-"
2024-02-14 05:03:26,986 ERROR    Websocket connection does not provide valid authentication
2024-02-14 05:03:50,097 ERROR    Websocket connection does not provide valid authentication
10.42.0.33 - - [14/Feb/2024:05:03:50 +0000] "GET /websocket/relay/ HTTP/1.1" 403 5 "-" "Python/3.9 aiohttp/3.8.3" "-"
2024-02-14 05:04:23,139 ERROR    Websocket connection does not provide valid authentication
2024-02-14 05:04:23,140 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-21' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f2f94c32490>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:168]>
2024-02-14 05:04:23,140 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-28' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f2f94c32a00>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:168]>
2024-02-14 05:04:23,140 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-35' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f2f955ba280>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:168]>
10.42.0.33 - - [14/Feb/2024:05:04:23 +0000] "GET /websocket/relay/ HTTP/1.1" 403 5 "-" "Python/3.9 aiohttp/3.8.3" "-"

AWX version

23.8.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

N/A

Operating system

CentOS Stream 8

Web browser

Chrome

Steps to reproduce

  1. Deploy AWX 23.8.0 using AWX Operator 2.12.0 with following YAML:
    ---
    apiVersion: awx.ansible.com/v1beta1
    kind: AWX
    metadata:
      name: awx-demo
    spec:
      service_type: nodeport

Or,

  1. Upgrade existing AWX to 23.8.0 from 23.7.0 using AWX Operator 2.12.0

Expected results

There is no error in task and web pod.

Actual results

There are repeated errors both task and web pod.

Additional information

Not sure but maybe related to: #14855 since the error is from DrfAuthMiddleware: https://github.com/ansible/django-ansible-base/blob/8cfce0d19f593c61d0704db527e89b0c6f300db4/ansible_base/lib/channels/middleware.py#L35-L38

Also there is a doc for this topic on django-ansible-base:

If the user can be retrieved from the stored session or by any backend in settings.AUTHENTICATION_BACKEND, the user is stored in scope["user"]. Othwerwise the websocket connection is denied and closed with return code 403.
https://github.com/ansible/django-ansible-base/blob/devel/docs/lib/channels_authentication.md

@chrismeyersfsu
I've not digged into deeper this issue yet, but any ideas on this?

@kurokobo
Copy link
Contributor Author

Confirmed that this occurs too on minimal fresh deployment in addition to upgraded instance. I've edited my OP.

@willthames
Copy link
Contributor

We have also encountered this problem following our 23.8.0 update.

Reverting awx/main/routing.py back to how it was before #14855 on both awx-web and awx-task got the cluster back up and running again, so I would say that #14855 is definitely the cause.

@TheRealHaoLiu
Copy link
Member

cc @chrismeyersfsu

@kurokobo
Copy link
Contributor Author

I've already mentioned him in my OP😃

@chrismeyersfsu
Copy link
Member

chrismeyersfsu commented Feb 14, 2024

Got a fix up if someone could try it out #14879

Going to try and code some tests up for this.

edit: Oh, the PR should also fix the RuntimeError: Task got bad yield: <coroutine object WebSocketRelayManager.cleanup_offline_host at 0x7f2b5c853640> error.

Thanks for the bug report!

@kurokobo
Copy link
Contributor Author

@chrismeyersfsu @TheRealHaoLiu @fosterseth
Thanks for working on this.

I made quick test for the PR #14879 and I can confirm that I no longer get repeated errors just by deploying new instance.
It would be great if you trigger new Z release, AWX 23.8.1 and Operator 2.12.1, out of the normal release cadence.

@ckozler
Copy link

ckozler commented Jun 8, 2024

I reproduced this error using HEAD and 25.4.0 as documented here #15262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants