Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix [#847]: Send a recovery notification if the object recovered while flapping #971

Conversation

tsadpbb
Copy link
Contributor

@tsadpbb tsadpbb commented Jun 25, 2024

To Test

define host {
    host_name               dummy_host
    address                 localhost
    max_check_attempts      2
    check_period            24x7
    notification_period     24x7
    contact_groups          admins
    check_command           check_dummy!0!Dummy
    check_interval          2
}

define service {
    host_name               dummy_host
    service_description     test_service
    check_command           check_dummy!0!Dummy
    max_check_attempts      2
    check_interval          2
    check_period            24x7
    contact_groups          admins
    notification_period     24x7
}

You can switch the check_dummy command between 2 and 3 until the host/service is flapping. Then make sure that you set it back to returning 0. When the host/service is no longer flapping, it should send a recovery notification like one would expect.

@tsadpbb
Copy link
Contributor Author

tsadpbb commented Jun 25, 2024

Related issues and commits
#557
e08f910

#572
f86598e

@sawolf
Copy link
Contributor

sawolf commented Jun 25, 2024

Hey Dylan - from looking at the "related patches", I would be fairly surprised if either of those introduced the bug you're seeing. Are you telling me that the issue wasn't present in nagios-4.4.1 and is present starting in nagios-4.4.2? If so, did you test that we're not reintroducing the bugs in #557 and #572?

@tsadpbb
Copy link
Contributor Author

tsadpbb commented Jun 25, 2024

The issue isn't present in 4.4.2 but is introduced in 4.4.3. I tested this and verified this is true.

If you turn on debugging for notifications, you'll notice the message We shouldn't notify about this recovery when clear_service_flap tries to send a notification.
The conditional to trigger that is at line 540 in notifications.c

	if(svc->current_state == STATE_OK && svc->notified_on == 0) {
		log_debug_info(DEBUGL_NOTIFICATIONS, 1, "We shouldn't notify about this recovery.\n");
		return ERROR;
		}

The main thing to look for is svc->notified_on == 0

When a hard state change happens and the status is ok while the host is flapping, the following occurs

if(svc->current_state == STATE_OK && svc->state_type == HARD_STATE && hard_state_change == TRUE) {
		svc->current_notification_number = 0;
		svc->notified_on = 0;
		}

This is the change from f86598e

svc->notified_on gets set to 0, and when the service notification command is called on line 377 in function clear_service_flap in flapping.c, it fails the service notification viability check and the notification is not made.

My fix for this is to not clear out the notified_on value when the service is flapping. This way, the service_notification command that occurs in clear_service_flap will actually succeed.

That is also why I clear out those values when the clear_service_flap sends the recovery notification, otherwise those values won't be reset on a "recovery" like it seems they should be.

Another issue that I'm fixing starts on line 327 in flapping.c

	/* see if we should check to send a recovery notification out when flapping stops */
	if(svc->current_state != STATE_OK && svc->current_notification_number > 0)
		svc->check_flapping_recovery_notification = TRUE;
	else
		svc->check_flapping_recovery_notification = FALSE;

The goal of this is to determine whether or not we should send a recovery notification when flapping stops. I think the assumption was made that a notification would get sent out if the state was OK, but perhaps the ordering of setting is_flapping to true and when a recovery notification is made was changed at some point. So now, if the state changes to OK, but that state change also makes the service turn to flapping, then when the flapping resolves, check_flapping_recovery_notification is false and a recovery notification is not made.

I'm not sure when this bug was introduced or if it always existed. What I know is that I can replicate the bug and my changes resolve it. I can try to do some code archeology if you want.

The same logic applies to the host equivalent things.

As for whether or not these changes reintroduce the bugs, I will have to verify.

@tsadpbb
Copy link
Contributor Author

tsadpbb commented Jul 15, 2024

I confirmed that the bugs in #557 and #572 were not re introduced

@sawolf sawolf self-assigned this Jul 24, 2024
@sawolf
Copy link
Contributor

sawolf commented Aug 6, 2024

I did not get around to testing this, but the write-up seems reasonable. Feel free to merge this

@tsadpbb tsadpbb closed this Sep 9, 2024
@tsadpbb tsadpbb deleted the 847-flapping-recovery-notifications branch September 9, 2024 20:40
@tsadpbb tsadpbb restored the 847-flapping-recovery-notifications branch September 9, 2024 20:48
@tsadpbb tsadpbb reopened this Sep 9, 2024
@aaronagios
Copy link
Contributor

@tsadpbb Sebastian was satisfied with this a while back. Are we good to go?

@dylan-at-nagios dylan-at-nagios merged commit 53ad3df into NagiosEnterprises:master Sep 19, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants