-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove immediate flush on reload/restart #3419
Remove immediate flush on reload/restart #3419
Conversation
bd8ae85
to
9c0547b
Compare
f037397
to
1f113be
Compare
This commit changes Alertmanager so it no longer flushes aggregation groups on configuration reload or restart of Alertmanager as this behavior causes a number of issues: 1. Alertmanager will send notifications for inhibited alerts if the inhibited alert is sent to Alertmanager before the inhibiting alert following a restart 2. Reloading Alertmanager via /-/reload can cause incomplete flushes of aggregation groups (prometheus#3407) A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
1f113be
to
5d2478b
Compare
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
I've just read a comment on another issue that suggests this might be causing issues for users:
|
We've been running in production with a patched version of AlertManager that includes this PR since a week. Is there any chance we can get some momentum behind merging this? |
What this PR does
This pull request changes Alertmanager so it no longer flushes aggregation groups on configuration reload or restart of Alertmanager as this behavior causes a number of issues:
Alertmanager will send notifications for inhibited alerts if the inhibited alert is sent to Alertmanager before the inhibiting alert following a restart (https://www.grobinson.net/best-practices-for-avoiding-race-conditions-in-inhibition-rules.html)
Reloading Alertmanager via /-/reload can cause incomplete flushes of aggregation groups (Reloading the config leads to incorrect notifications being sent due to a race condition #3407)
Reloading or restarting an Alertmanager while sending a notification can cause a race between the reloaded/restarted Alertmanager and the next peer in the Alertmanager cluster. This can, in some cases, cause firing and resolved notifications to be sent out of order. For example, resolved, firing, resolved.
A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Frequent reloads in combination with a large group_wait could even prevent alerts from being flushed at all.