Remove immediate flush on reload/restart #3419

grobinson-grafana · 2023-07-05T15:49:58Z

What this PR does

This pull request changes Alertmanager so it no longer flushes aggregation groups on configuration reload or restart of Alertmanager as this behavior causes a number of issues:

Alertmanager will send notifications for inhibited alerts if the inhibited alert is sent to Alertmanager before the inhibiting alert following a restart (https://www.grobinson.net/best-practices-for-avoiding-race-conditions-in-inhibition-rules.html)
Reloading Alertmanager via /-/reload can cause incomplete flushes of aggregation groups (Reloading the config leads to incorrect notifications being sent due to a race condition #3407)
Reloading or restarting an Alertmanager while sending a notification can cause a race between the reloaded/restarted Alertmanager and the next peer in the Alertmanager cluster. This can, in some cases, cause firing and resolved notifications to be sent out of order. For example, resolved, firing, resolved.

A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Frequent reloads in combination with a large group_wait could even prevent alerts from being flushed at all.

This commit changes Alertmanager so it no longer flushes aggregation groups on configuration reload or restart of Alertmanager as this behavior causes a number of issues: 1. Alertmanager will send notifications for inhibited alerts if the inhibited alert is sent to Alertmanager before the inhibiting alert following a restart 2. Reloading Alertmanager via /-/reload can cause incomplete flushes of aggregation groups (prometheus#3407) A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Signed-off-by: George Robinson <[email protected]>

Signed-off-by: George Robinson <[email protected]>

grobinson-grafana · 2023-10-29T21:02:06Z

I've just read a comment on another issue that suggests this might be causing issues for users:

Any progress on this? We're running with a patch right now that just delays the start of the dispatcher because we were getting lots of false alarms for alerts that should be inhibited when we reloaded configs.
#3167 (comment)

verejoel · 2023-12-01T18:16:16Z

We've been running in production with a patched version of AlertManager that includes this PR since a week. Is there any chance we can get some momentum behind merging this?

grobinson-grafana force-pushed the grobinson/remove-immediate-flush branch from bd8ae85 to 9c0547b Compare July 5, 2023 15:53

grobinson-grafana marked this pull request as draft July 5, 2023 15:54

grobinson-grafana force-pushed the grobinson/remove-immediate-flush branch from f037397 to 1f113be Compare July 6, 2023 12:51

grobinson-grafana added 4 commits October 12, 2023 20:54

Fix lint

22e2390

Signed-off-by: George Robinson <[email protected]>

Fix lint again

a1b4eee

Signed-off-by: George Robinson <[email protected]>

Fix tests

5d2478b

Signed-off-by: George Robinson <[email protected]>

grobinson-grafana force-pushed the grobinson/remove-immediate-flush branch from 1f113be to 5d2478b Compare October 12, 2023 20:01

grobinson-grafana added 2 commits October 12, 2023 21:05

Remove mutex as timer cannot be reset from multiple goroutines

9849fe5

Signed-off-by: George Robinson <[email protected]>

Update comments in tests

0266390

Signed-off-by: George Robinson <[email protected]>

grobinson-grafana marked this pull request as ready for review October 12, 2023 20:28

grobinson-grafana closed this Apr 16, 2024

grobinson-grafana deleted the grobinson/remove-immediate-flush branch April 16, 2024 14:44

grobinson-grafana mentioned this pull request Sep 18, 2024

Alertmanager sending duplicate notifications after 'resolved' notification when running with multiple replicas #4008

Open

grobinson-grafana mentioned this pull request Oct 15, 2024

Alerts that should be inhibited fire on Alertmanager reload/restart #4064

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove immediate flush on reload/restart #3419

Remove immediate flush on reload/restart #3419

grobinson-grafana commented Jul 5, 2023 •

edited

Loading

grobinson-grafana commented Oct 29, 2023

verejoel commented Dec 1, 2023

Remove immediate flush on reload/restart #3419

Remove immediate flush on reload/restart #3419

Conversation

grobinson-grafana commented Jul 5, 2023 • edited Loading

What this PR does

grobinson-grafana commented Oct 29, 2023

verejoel commented Dec 1, 2023

grobinson-grafana commented Jul 5, 2023 •

edited

Loading