-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix [#847]: Send a recovery notification if the object recovered while flapping #971
Fix [#847]: Send a recovery notification if the object recovered while flapping #971
Conversation
…ct recovered while flapping
Hey Dylan - from looking at the "related patches", I would be fairly surprised if either of those introduced the bug you're seeing. Are you telling me that the issue wasn't present in nagios-4.4.1 and is present starting in nagios-4.4.2? If so, did you test that we're not reintroducing the bugs in #557 and #572? |
The issue isn't present in 4.4.2 but is introduced in 4.4.3. I tested this and verified this is true. If you turn on debugging for notifications, you'll notice the message
The main thing to look for is When a hard state change happens and the status is ok while the host is flapping, the following occurs
This is the change from f86598e
My fix for this is to not clear out the notified_on value when the service is flapping. This way, the service_notification command that occurs in clear_service_flap will actually succeed. That is also why I clear out those values when the clear_service_flap sends the recovery notification, otherwise those values won't be reset on a "recovery" like it seems they should be. Another issue that I'm fixing starts on line 327 in flapping.c
The goal of this is to determine whether or not we should send a recovery notification when flapping stops. I think the assumption was made that a notification would get sent out if the state was OK, but perhaps the ordering of setting I'm not sure when this bug was introduced or if it always existed. What I know is that I can replicate the bug and my changes resolve it. I can try to do some code archeology if you want. The same logic applies to the host equivalent things. As for whether or not these changes reintroduce the bugs, I will have to verify. |
I did not get around to testing this, but the write-up seems reasonable. Feel free to merge this |
@tsadpbb Sebastian was satisfied with this a while back. Are we good to go? |
To Test
You can switch the check_dummy command between 2 and 3 until the host/service is flapping. Then make sure that you set it back to returning 0. When the host/service is no longer flapping, it should send a recovery notification like one would expect.