Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New way to determine hop node health #14414

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions awx/main/models/ha.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,12 +315,15 @@ def refresh_capacity_fields(self):

def save_health_data(self, version=None, cpu=0, memory=0, uuid=None, update_last_seen=False, errors=''):
update_fields = ['errors']
if self.node_type != 'hop':
if self.node_type != Instance.Types.HOP:
self.last_health_check = now()
update_fields.append('last_health_check')

if update_last_seen:
self.last_seen = self.last_health_check
if self.node_type == Instance.Types.HOP:
self.last_seen = now()
else:
self.last_seen = self.last_health_check
update_fields.append('last_seen')

if uuid is not None and self.uuid != uuid:
Expand Down
17 changes: 10 additions & 7 deletions awx/main/tasks/system.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,12 @@ def inspect_execution_and_hop_nodes(instance_list):

inspect_established_receptor_connections(mesh_status)

for instance in instance_list:
if instance.node_type == Instance.Types.HOP and mesh_status['KnownConnectionCosts'].get(instance.hostname):
if instance.node_state in (Instance.States.UNAVAILABLE, Instance.States.INSTALLED):
logger.warning(f'Hop node {instance.hostname}, has joined the receptor mesh')
instance.save_health_data(errors='', update_last_seen=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on in this method we do

instance.save(update_fields=['last_seen'])

This is the biggest problem I have right now about how this is evolving. The precedent leans towards the notion that health checks and updating last seen are separate processes. The update_last_seen kwarg may make sense for instance.mark_offline but it doesn't seem to make sense as something tacked-on to instance.save_health_data. The implication following from that is the save_health_data saves the data from a health check. If a health check (in the formally proper sense of running ansible-runner worker --worker-info) is not run, I'd rather not call it at all.

I also definitely don't want it get set to now() for hop nodes if it is set to parse_date(ad['Time']) for execution nodes. Does the "KnownConnectionCosts" not report a timestamp? Do nodes not report a timestamp if they're not running a control service? That may be a larger problem with the philosophy.


nowtime = now()
workers = mesh_status['Advertisements']

Expand All @@ -530,6 +536,10 @@ def inspect_execution_and_hop_nodes(instance_list):
logger.warning(f"Unrecognized node advertising on mesh: {hostname}")
continue

# Only execution nodes should be dealt with by execution_node_health_check
if instance.node_type == Instance.Types.HOP:
continue

# Control-plane nodes are dealt with via local_health_check instead.
if instance.node_type in (Instance.Types.CONTROL, Instance.Types.HYBRID):
continue
Expand All @@ -540,13 +550,6 @@ def inspect_execution_and_hop_nodes(instance_list):
instance.last_seen = last_seen
instance.save(update_fields=['last_seen'])

# Only execution nodes should be dealt with by execution_node_health_check
if instance.node_type == Instance.Types.HOP:
if instance.node_state in (Instance.States.UNAVAILABLE, Instance.States.INSTALLED):
logger.warning(f'Hop node {hostname}, has rejoined the receptor mesh')
instance.save_health_data(errors='')
continue

if instance.node_state in (Instance.States.UNAVAILABLE, Instance.States.INSTALLED):
# if the instance *was* lost, but has appeared again,
# attempt to re-establish the initial capacity and version
Expand Down
Loading