New way to determine hop node health #14414

fosterseth · 2023-09-05T19:05:37Z

SUMMARY

Old way of detecting health depended on Receptor Service advertisements. However, hop nodes need not run any services, thus they won't show up in the Advertisement lists.

We recently removed work-commands from remote hop node configurations. The presence of work commands also triggered a service Ad. Now hop nodes will stay in Install state indefinitely.

Instead, we can detect health of hop nodes based on whether they show up in the Known Connection Costs list.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

Old way of detecting health depended on Receptor Service advertisements. However, hop nodes need not run any services, thus they won't show up in the Advertisement lists. Instead, we can detect health of hop nodes based on whether they show up in the Known Connection Costs list.

AlanCoding · 2023-09-05T20:33:16Z

awx/main/tasks/system.py

+            if instance.node_type == Instance.Types.HOP and mesh_status['KnownConnectionCosts'].get(instance.hostname):
+                if instance.node_state in (Instance.States.UNAVAILABLE, Instance.States.INSTALLED):
+                    logger.warning(f'Hop node {instance.hostname}, has joined the receptor mesh')
+                instance.save_health_data(errors='', update_last_seen=True)


Later on in this method we do

instance.save(update_fields=['last_seen'])

This is the biggest problem I have right now about how this is evolving. The precedent leans towards the notion that health checks and updating last seen are separate processes. The update_last_seen kwarg may make sense for instance.mark_offline but it doesn't seem to make sense as something tacked-on to instance.save_health_data. The implication following from that is the save_health_data saves the data from a health check. If a health check (in the formally proper sense of running ansible-runner worker --worker-info) is not run, I'd rather not call it at all.

I also definitely don't want it get set to now() for hop nodes if it is set to parse_date(ad['Time']) for execution nodes. Does the "KnownConnectionCosts" not report a timestamp? Do nodes not report a timestamp if they're not running a control service? That may be a larger problem with the philosophy.

fosterseth · 2023-09-06T01:18:19Z

Upon further investigation, hop nodes do show up in advertisements. Currently hop nodes will be set up with a control service, so it is included in the Advertisements list for this reason, even if no work commands are configured.

"Advertisements": [
        {
            "ConnType": 2,
            "NodeID": "hop1",
            "Service": "control",
            "Tags": {
                "type": "Control Service"
            },
            "Time": "2023-09-06T01:07:36.423976905Z",
            "WorkCommands": null
        }
]

But a control service isn't needed for a hop node to function, we could turn it off. Maybe we just leave it in, since it serves a purpose of obtaining a "last seen".

Sadly Receptor doesn't publish a timestamp for when it receives routing information, so although Known Connection Costs reflects "real time", we don't have timestamp information associated with it.

AlanCoding · 2023-09-06T13:20:12Z

@fosterseth to start with requirements, I can establish that current behavior is that last_seen of hop nodes is updated right now. This can be tested easily with the development environment.

What would be the code consequence of not tracking last_seen, and what are the higher level consequences for requirements? If I shut down tools_receptor_hop in docker, then it stops updating its last seen field. That causes the main heartbeat method to hit:

awx/awx/main/tasks/system.py

Line 635 in d03a6a8

    
           other_inst.mark_offline(errors=_('Another cluster node has determined this instance to be unresponsive'))

Can confirm in the dev environment

tools_awx_1     | awx-dispatcher stderr | 2023-09-06 13:08:53,667 ERROR    [-] awx.main.tasks.system Host receptor-hop last checked in at 2023-09-06 13:05:10.771394+00:00, marked as lost.

And the instance reflects

            "errors": "Another cluster node has determined this instance to be unresponsive",

And that is for the hop node.

So if you're considering losing the hop node control service, so that hop nodes do not have a timestamp reported, then we're talking about losing the notion of node health for hop nodes. I'm not saying I'm against that, but that's what this discussion should pivot around.

For a set of static instances, it seems fine if hop nodes don't have health? Zombie nodes would appear healthy but are not connected. But If someone is using the receptor collection to set up a hop node, this seems like it would be much more problematic. What are the expectations for the user experience as someone tries to connect a new hop node, and it is available, then not-available, and then available?

What I'm currently leaning towards: If a hop node does not show up under Advertisements but does show up under the other list for connection cost, maybe we attach a unique error message suggesting to the user that maybe the control service isn't running.

fosterseth · 2023-09-06T15:00:00Z

upon further, further thought - hop nodes will need a control service so that it can run a "reload" command, which is part of the receptor collection now.

As you said, we could warn a user if a hop node is detected on the mesh that doesn't have a control service set up.

I will just close this PR for now, as it isn't a real problem for us at the moment.

@AlanCoding thanks for thinking about this

github-actions bot added the component:api label Sep 5, 2023

fosterseth requested a review from AlanCoding September 5, 2023 19:06

AlanCoding reviewed Sep 5, 2023

View reviewed changes

fosterseth closed this Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New way to determine hop node health #14414

New way to determine hop node health #14414

fosterseth commented Sep 5, 2023 •

edited

Loading

AlanCoding Sep 5, 2023

fosterseth commented Sep 6, 2023 •

edited

Loading

AlanCoding commented Sep 6, 2023

fosterseth commented Sep 6, 2023

New way to determine hop node health #14414

New way to determine hop node health #14414

Conversation

fosterseth commented Sep 5, 2023 • edited Loading

SUMMARY

ISSUE TYPE

COMPONENT NAME

AlanCoding Sep 5, 2023

Choose a reason for hiding this comment

fosterseth commented Sep 6, 2023 • edited Loading

AlanCoding commented Sep 6, 2023

fosterseth commented Sep 6, 2023

fosterseth commented Sep 5, 2023 •

edited

Loading

fosterseth commented Sep 6, 2023 •

edited

Loading