Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New way to determine hop node health #14414

Closed
wants to merge 1 commit into from

Conversation

fosterseth
Copy link
Member

@fosterseth fosterseth commented Sep 5, 2023

SUMMARY

Old way of detecting health depended on Receptor Service advertisements. However, hop nodes need not run any services, thus they won't show up in the Advertisement lists.

We recently removed work-commands from remote hop node configurations. The presence of work commands also triggered a service Ad. Now hop nodes will stay in Install state indefinitely.

Instead, we can detect health of hop nodes based on whether they show up in the Known Connection Costs list.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

Old way of detecting health depended on
Receptor Service advertisements. However, hop
nodes need not run any services, thus they won't
show up in the Advertisement lists.

Instead, we can detect health of hop nodes based
on whether they show up in the Known Connection Costs
list.
if instance.node_type == Instance.Types.HOP and mesh_status['KnownConnectionCosts'].get(instance.hostname):
if instance.node_state in (Instance.States.UNAVAILABLE, Instance.States.INSTALLED):
logger.warning(f'Hop node {instance.hostname}, has joined the receptor mesh')
instance.save_health_data(errors='', update_last_seen=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on in this method we do

instance.save(update_fields=['last_seen'])

This is the biggest problem I have right now about how this is evolving. The precedent leans towards the notion that health checks and updating last seen are separate processes. The update_last_seen kwarg may make sense for instance.mark_offline but it doesn't seem to make sense as something tacked-on to instance.save_health_data. The implication following from that is the save_health_data saves the data from a health check. If a health check (in the formally proper sense of running ansible-runner worker --worker-info) is not run, I'd rather not call it at all.

I also definitely don't want it get set to now() for hop nodes if it is set to parse_date(ad['Time']) for execution nodes. Does the "KnownConnectionCosts" not report a timestamp? Do nodes not report a timestamp if they're not running a control service? That may be a larger problem with the philosophy.

@fosterseth
Copy link
Member Author

fosterseth commented Sep 6, 2023

Upon further investigation, hop nodes do show up in advertisements. Currently hop nodes will be set up with a control service, so it is included in the Advertisements list for this reason, even if no work commands are configured.

"Advertisements": [
        {
            "ConnType": 2,
            "NodeID": "hop1",
            "Service": "control",
            "Tags": {
                "type": "Control Service"
            },
            "Time": "2023-09-06T01:07:36.423976905Z",
            "WorkCommands": null
        }
]

But a control service isn't needed for a hop node to function, we could turn it off. Maybe we just leave it in, since it serves a purpose of obtaining a "last seen".

Sadly Receptor doesn't publish a timestamp for when it receives routing information, so although Known Connection Costs reflects "real time", we don't have timestamp information associated with it.

@AlanCoding
Copy link
Member

@fosterseth to start with requirements, I can establish that current behavior is that last_seen of hop nodes is updated right now. This can be tested easily with the development environment.

What would be the code consequence of not tracking last_seen, and what are the higher level consequences for requirements? If I shut down tools_receptor_hop in docker, then it stops updating its last seen field. That causes the main heartbeat method to hit:

other_inst.mark_offline(errors=_('Another cluster node has determined this instance to be unresponsive'))

Can confirm in the dev environment

tools_awx_1     | awx-dispatcher stderr | 2023-09-06 13:08:53,667 ERROR    [-] awx.main.tasks.system Host receptor-hop last checked in at 2023-09-06 13:05:10.771394+00:00, marked as lost.

And the instance reflects

            "errors": "Another cluster node has determined this instance to be unresponsive",

And that is for the hop node.

So if you're considering losing the hop node control service, so that hop nodes do not have a timestamp reported, then we're talking about losing the notion of node health for hop nodes. I'm not saying I'm against that, but that's what this discussion should pivot around.

For a set of static instances, it seems fine if hop nodes don't have health? Zombie nodes would appear healthy but are not connected. But If someone is using the receptor collection to set up a hop node, this seems like it would be much more problematic. What are the expectations for the user experience as someone tries to connect a new hop node, and it is available, then not-available, and then available?

What I'm currently leaning towards: If a hop node does not show up under Advertisements but does show up under the other list for connection cost, maybe we attach a unique error message suggesting to the user that maybe the control service isn't running.

@fosterseth
Copy link
Member Author

upon further, further thought - hop nodes will need a control service so that it can run a "reload" command, which is part of the receptor collection now.

As you said, we could warn a user if a hop node is detected on the mesh that doesn't have a control service set up.

I will just close this PR for now, as it isn't a real problem for us at the moment.

@AlanCoding thanks for thinking about this

@fosterseth fosterseth closed this Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants