Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote work unit is gone #15665

Open
7 of 11 tasks
xibriz opened this issue Nov 26, 2024 · 3 comments
Open
7 of 11 tasks

Remote work unit is gone #15665

xibriz opened this issue Nov 26, 2024 · 3 comments

Comments

@xibriz
Copy link

xibriz commented Nov 26, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

When trying to add a new instance with the same IP/hostname after reinstallation of the remote host I get the following error in AWX:

Receptor error from 10.131.254.154, detail:
Remote work unit is gone

The logs show:

2024-11-26 06:56:00,714 WARNING  [d4328228e83442d4b8d0100c1fcbf803] awx.main.tasks.receptor While releasing work: WorkUnitCancelError 'error cancelling remote unit:  unknown work unit ' on node '10.131.254.154' with state 'Failed' work unit id '10'
2024-11-26 06:56:00,718 INFO     [d4328228e83442d4b8d0100c1fcbf803] awx.main.tasks.system Failed to find capacity of new or lost execution node 10.131.254.154, errors:
Receptor error from 10.131.254.154, detail:
Remote work unit is gone
2024-11-26 06:56:05,014 INFO     [d4328228e83442d4b8d0100c1fcbf803] awx.main.tasks.receptor Reaping orphaned work unit G9gDUxVA with params --worker-info
2024-11-26 06:56:05,681 ERROR    [d4328228e83442d4b8d0100c1fcbf803] awx.main.dispatch Worker failed to run task awx.main.tasks.system.awx_receptor_workunit_reaper(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
    return _call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/system.py", line 693, in awx_receptor_workunit_reaper
    administrative_workunit_reaper(receptor_work_list)
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/receptor.py", line 222, in administrative_workunit_reaper
    receptor_ctl.simple_command(f"work release {unit_id}")
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 83, in simple_command
    return self.read_and_parse_json()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 60, in read_and_parse_json
    raise RuntimeError(text[7:])
RuntimeError: error cancelling remote unit:  unknown work unit 10

The receptor logs shows:

ERROR 2024/11/26 06:57:05 Error locating unit: 10
ERROR 2024/11/26 06:57:05 : unknown work unit 10

How do I resolve this? I can't change the IP of the receptor.

AWX version

24.6.1

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

2.15.9

Operating system

Ubuntu

Web browser

No response

Steps to reproduce

Reinstall a instance bundle.

Expected results

The receptor will join the mesh.

Actual results

Receptor can't join the mesh.

Additional information

No response

@xibriz
Copy link
Author

xibriz commented Nov 26, 2024

Running /var/lib/awx/venv/awx/bin/receptorctl --socket=/var/run/receptor/receptor.sock work list shows:

{
    "z5CKnIs0": {
        "Detail": "Remote work unit is gone",
        "ExtraData": {
            "Expiration": "2024-11-26T06:55:18.115587049Z",
            "LocalCancelled": true,
            "LocalReleased": true,
            "RemoteNode": "10.131.254.154",
            "RemoteParams": {
                "params": "--worker-info"
            },
            "RemoteStarted": true,
            "RemoteUnitID": "10",
            "RemoteWorkType": "ansible-runner",
            "SignWork": true,
            "TLSClient": "tlsclient"
        },
        "State": 3,
        "StateName": "Failed",
        "StdoutSize": 0,
        "WorkType": "remote"
    }
}

On the receptor:

receptorctl --socket=/var/run/receptor/receptor.sock work list --unit_id z5CKnIs0
Warning: receptorctl and receptor are different versions, they may not be compatible
ERROR: unknown work unit z5CKnIs0
receptorctl --socket=/var/run/receptor/receptor.sock work list
Warning: receptorctl and receptor are different versions, they may not be compatible
{
    "10.131.254.154119IMiKm": {
        "Detail": "exit status 0",
        "ExtraData": null,
        "State": 2,
        "StateName": "Succeeded",
        "StdoutSize": 121,
        "WorkType": "ansible-runner"
    },
   ....
}

@xibriz
Copy link
Author

xibriz commented Nov 26, 2024

More logs:

2024-11-26 07:03:58,012 WARNING  [d4328228e83442d4b8d0100c1fcbf803] awx.main.tasks.system Execution node attempting to rejoin as instance 10.131.254.154.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
min_value in DecimalField should be Decimal type.
2024-11-26 07:04:00,615 WARNING  [d4328228e83442d4b8d0100c1fcbf803] awx.main.tasks.receptor While releasing work: WorkUnitCancelError 'error cancelling remote unit:  unknown work unit ' on node '10.131.254.154' with state 'Failed' work unit id '10'
2024-11-26 07:04:00,621 INFO     [d4328228e83442d4b8d0100c1fcbf803] awx.main.tasks.system Failed to find capacity of new or lost execution node 10.131.254.154, errors:
Receptor error from 10.131.254.154, detail:
Remote work unit is gone
2024-11-26 07:04:05,027 INFO     [d4328228e83442d4b8d0100c1fcbf803] awx.main.tasks.receptor Reaping orphaned work unit G9gDUxVA with params --worker-info
2024-11-26 07:04:05,703 ERROR    [d4328228e83442d4b8d0100c1fcbf803] awx.main.dispatch Worker failed to run task awx.main.tasks.system.awx_receptor_workunit_reaper(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
    return _call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/system.py", line 693, in awx_receptor_workunit_reaper
    administrative_workunit_reaper(receptor_work_list)
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/receptor.py", line 222, in administrative_workunit_reaper
    receptor_ctl.simple_command(f"work release {unit_id}")
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 83, in simple_command
    return self.read_and_parse_json()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 60, in read_and_parse_json
    raise RuntimeError(text[7:])
RuntimeError: error cancelling remote unit:  unknown work unit 10

@manoharpalemfortestonly
Copy link

manoharpalemfortestonly commented Dec 2, 2024

Hi,

Looks like am also facing same kind of issue on awx UI.

Receptor error from , detail:
Remote work unit is gone

work unit error on awx ui-  2024-12-02 164140

==============
awx version: 24.5.0
awx system os version:
NAME="CentOS Stream"
VERSION="9"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="9"
PLATFORM_ID="platform:el9"
PRETTY_NAME="CentOS Stream 9"
ANSI_COLOR="0;31"

execution node OS version:
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS

tail -f /var/log/receptor/receptor.log
ERROR 2024/12/02 11:09:19 Error locating unit:
ERROR 2024/12/02 11:09:19 : unknown work unit
ERROR 2024/12/02 11:11:19 Error locating unit:
ERROR 2024/12/02 11:11:19 : unknown work unit
ERROR 2024/12/02 11:12:19 Error locating unit:
ERROR 2024/12/02 11:12:19 : unknown work unit
ERROR 2024/12/02 11:13:19 Error locating unit: 1
ERROR 2024/12/02 11:13:19 : unknown work unit
ERROR 2024/12/02 11:14:19 Error locating unit:
ERROR 2024/12/02 11:14:19 : unknown work unit

===================

receptor is running fine

systemctl status receptor
● receptor.service - Ansible Receptor for Linux
Loaded: loaded (/lib/systemd/system/receptor.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2024-12-02 08:14:12 UTC; 3h 2min ago
Main PID: 38110 (receptor)
Tasks: 6 (limit: 4676)
Memory: 50.8M
CPU: 56.661s
CGroup: /system.slice/receptor.service
└─38110 /usr/bin/receptor --config /etc/receptor/receptor.conf

Dec 02 08:14:12 pzsafnBASTaec01 systemd[1]: Started Ansible Receptor for Linux.

Could anyone assist on this issue please ASAP? Thanks in Advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants