Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove immediate flush on reload/restart #3419

Conversation

grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented Jul 5, 2023

What this PR does

This pull request changes Alertmanager so it no longer flushes aggregation groups on configuration reload or restart of Alertmanager as this behavior causes a number of issues:

  1. Alertmanager will send notifications for inhibited alerts if the inhibited alert is sent to Alertmanager before the inhibiting alert following a restart (https://www.grobinson.net/best-practices-for-avoiding-race-conditions-in-inhibition-rules.html)

  2. Reloading Alertmanager via /-/reload can cause incomplete flushes of aggregation groups (Reloading the config leads to incorrect notifications being sent due to a race condition #3407)

  3. Reloading or restarting an Alertmanager while sending a notification can cause a race between the reloaded/restarted Alertmanager and the next peer in the Alertmanager cluster. This can, in some cases, cause firing and resolved notifications to be sent out of order. For example, resolved, firing, resolved.

A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Frequent reloads in combination with a large group_wait could even prevent alerts from being flushed at all.

@grobinson-grafana grobinson-grafana force-pushed the grobinson/remove-immediate-flush branch from bd8ae85 to 9c0547b Compare July 5, 2023 15:53
@grobinson-grafana grobinson-grafana marked this pull request as draft July 5, 2023 15:54
@grobinson-grafana grobinson-grafana force-pushed the grobinson/remove-immediate-flush branch from f037397 to 1f113be Compare July 6, 2023 12:51
This commit changes Alertmanager so it no longer flushes aggregation
groups on configuration reload or restart of Alertmanager as this
behavior causes a number of issues:

1. Alertmanager will send notifications for inhibited alerts if the
   inhibited alert is sent to Alertmanager before the inhibiting alert
   following a restart

2. Reloading Alertmanager via /-/reload can cause incomplete flushes
   of aggregation groups (prometheus#3407)

A potential issue with this change is that following a reload or restart
of Alertmanager, alerts that were waiting for group_wait will have to
wait from the beginning of group_wait again. If group_wait is large
then notifications could take longer to send then expected.

Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana grobinson-grafana marked this pull request as ready for review October 12, 2023 20:28
@grobinson-grafana
Copy link
Contributor Author

I've just read a comment on another issue that suggests this might be causing issues for users:

Any progress on this? We're running with a patch right now that just delays the start of the dispatcher because we were getting lots of false alarms for alerts that should be inhibited when we reloaded configs.
#3167 (comment)

@verejoel
Copy link

verejoel commented Dec 1, 2023

We've been running in production with a patched version of AlertManager that includes this PR since a week. Is there any chance we can get some momentum behind merging this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants