-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New way to determine hop node health #14414
Conversation
Old way of detecting health depended on Receptor Service advertisements. However, hop nodes need not run any services, thus they won't show up in the Advertisement lists. Instead, we can detect health of hop nodes based on whether they show up in the Known Connection Costs list.
if instance.node_type == Instance.Types.HOP and mesh_status['KnownConnectionCosts'].get(instance.hostname): | ||
if instance.node_state in (Instance.States.UNAVAILABLE, Instance.States.INSTALLED): | ||
logger.warning(f'Hop node {instance.hostname}, has joined the receptor mesh') | ||
instance.save_health_data(errors='', update_last_seen=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later on in this method we do
instance.save(update_fields=['last_seen'])
This is the biggest problem I have right now about how this is evolving. The precedent leans towards the notion that health checks and updating last seen are separate processes. The update_last_seen
kwarg may make sense for instance.mark_offline
but it doesn't seem to make sense as something tacked-on to instance.save_health_data
. The implication following from that is the save_health_data
saves the data from a health check. If a health check (in the formally proper sense of running ansible-runner worker --worker-info
) is not run, I'd rather not call it at all.
I also definitely don't want it get set to now()
for hop nodes if it is set to parse_date(ad['Time'])
for execution nodes. Does the "KnownConnectionCosts" not report a timestamp? Do nodes not report a timestamp if they're not running a control service? That may be a larger problem with the philosophy.
Upon further investigation, hop nodes do show up in advertisements. Currently hop nodes will be set up with a control service, so it is included in the Advertisements list for this reason, even if no work commands are configured. "Advertisements": [
{
"ConnType": 2,
"NodeID": "hop1",
"Service": "control",
"Tags": {
"type": "Control Service"
},
"Time": "2023-09-06T01:07:36.423976905Z",
"WorkCommands": null
}
] But a control service isn't needed for a hop node to function, we could turn it off. Maybe we just leave it in, since it serves a purpose of obtaining a "last seen". Sadly Receptor doesn't publish a timestamp for when it receives routing information, so although Known Connection Costs reflects "real time", we don't have timestamp information associated with it. |
@fosterseth to start with requirements, I can establish that current behavior is that What would be the code consequence of not tracking Line 635 in d03a6a8
Can confirm in the dev environment
And the instance reflects "errors": "Another cluster node has determined this instance to be unresponsive", And that is for the hop node. So if you're considering losing the hop node control service, so that hop nodes do not have a timestamp reported, then we're talking about losing the notion of node health for hop nodes. I'm not saying I'm against that, but that's what this discussion should pivot around. For a set of static instances, it seems fine if hop nodes don't have health? Zombie nodes would appear healthy but are not connected. But If someone is using the receptor collection to set up a hop node, this seems like it would be much more problematic. What are the expectations for the user experience as someone tries to connect a new hop node, and it is available, then not-available, and then available? What I'm currently leaning towards: If a hop node does not show up under Advertisements but does show up under the other list for connection cost, maybe we attach a unique error message suggesting to the user that maybe the control service isn't running. |
upon further, further thought - hop nodes will need a control service so that it can run a "reload" command, which is part of the receptor collection now. As you said, we could warn a user if a hop node is detected on the mesh that doesn't have a control service set up. I will just close this PR for now, as it isn't a real problem for us at the moment. @AlanCoding thanks for thinking about this |
SUMMARY
Old way of detecting health depended on Receptor Service advertisements. However, hop nodes need not run any services, thus they won't show up in the Advertisement lists.
We recently removed work-commands from remote hop node configurations. The presence of work commands also triggered a service Ad. Now hop nodes will stay in Install state indefinitely.
Instead, we can detect health of hop nodes based on whether they show up in the Known Connection Costs list.
ISSUE TYPE
COMPONENT NAME