Deduplicate alerts #11

SerialVelocity · 2020-09-06T21:58:06Z

Hey,

Currently, each alert (or set of alerts) creates a new notification. It would be nice if the duplicates don't create a new alert every time. Maybe a UUID hash of the groupKey?

The text was updated successfully, but these errors were encountered:

xMTinkerer · 2020-09-09T16:18:25Z

Ah, yea, that makes sense. No one wants duplicated alerts.
Would you be adverse to an MD5 hash? Something like this one. Or is a full UUID the way to go? Or can we just match on the groupKey itself and skip the hash? These guys seem to be talking about keeping the groupKey in plaintext.

xMTinkerer · 2020-09-09T16:27:09Z

EDIT: Actually, a better way to do this is with Event Flood Control. I've updated the below to use that method instead.

I can work on re-releasing the workflow, but that would require you to re-upload the zip file, and then reconfigure your alertmanager urls. If you'd like to add this to your existing workflow:

Add the groupKey as an output to the HTTP trigger:
Add a groupKey property to the event form. This is done in the Form > Layout page. After creating the property, drag it onto the layout:
Navigate to Flood Control

and select the one with Source=Prometheus
Then, drag the new groupKey property into the Selected Properties and set the trigger conditions in the section below.

.

Option 2

Add a Get Events and a switch step before the create event step:
Edit the Get Events step:
Status: ACTIVE
PropertyName: groupKey#en
PropertyValue: groupKey
Like so:
Update the switch step to inspect the number of events found by the Get Events step:

SerialVelocity · 2020-09-09T23:16:42Z

I don't think those options actually stop the duplicates. Let's say you have repeat_interval set to 2h. I think with your changes, you will ignore the first duplicate at 2 hours later, but then the alert will be closed after 3 hours by xmatters, then a new one will be triggered at the 4 hour mark?

xMTinkerer · 2020-09-10T16:44:32Z

The event can be configured to live for up to 24 hours. So the question is how long until it would be considered a new alert? I don't think it makes sense to deduplicate forever, so there must be a timeframe limit.

I don't think the active/closed status of the event is considered in the flood control, so that is irrelevant. Let's walk through an example. Assume your repeat_interval in your alertmanager.yaml is set to 2h. And you set the flood control in xMatters to be "More than 1 events within 4 hours".

12:00 - Prometheus triggers the xMatters integration and flood control allows it.
14:00 - Prometheus again triggers the integration because the repeat_interval happens, but flood control stops the event from being created because it is still within the 4 hours.
16:00 - Prometheus triggers the xMatters integration again and flood control allows it because the trigger condition is no longer valid.

What would you like to see happen?

SerialVelocity · 2020-09-12T14:12:30Z

So, these are the cases I'm thinking of:

An alert is triggered. You fix it in 10 minutes and the alert is resolved. You then get the same alert 20 minutes later because something went wrong again.
- Concrete example is orphaned pods in Kubernetes. Pods can be left in an inconsistent state when volumes are unmounted improperly. This sometimes can happen often.
A low priority alert goes off on Friday evening. Because it is the weekend coming up, it is delayed until Monday. You don't want the alert to retrigger, but you still want it to be active.
- Concrete example is CephClusterWarnState which means that something is not quite right in Ceph. This could be a disk is >75% full or a service crashed and restarted. Both are low priority to look into.

xMTinkerer · 2020-09-13T21:09:33Z

Ah, that's helpful context.
I'm guessing both of these would be set up with different groupKey values? If not, any kind of deduplication would do the wrong thing and lead to under or over alerting.

Is this where the silences could be helpful? You can reply with a silence and we write the silence back to Prometheus. The duration of the silence might be something to work out, or we provide multiple options such as Silence for 30 minutes, Silence for 2 days, etc.

SerialVelocity · 2020-09-17T10:19:44Z

I think the groupKey would be the same.

Is there a way to check if there is an active alert with the same group key? If there is, can you extend the alert that already exists?

The workflow above would work like this:

Prometheus alert goes off, an xmatters alert gets created
2h later, prometheus sends the alert again. xmatters finds the existing active alert and extends its lifetime to the maximum again (repeat for two days)
1h later, you fix the issue and prometheus sends a "resolved" alert. Xmatters resolves the active alert
10m later, the alert gets triggered again. Xmatters can't find an existing active alert so it creates a new one
10m later, you fix the issue again, prometheus sends a "resolved" alert, xmatters resolves the active alert

xMTinkerer · 2020-09-24T04:02:47Z

Is there a way to check if there is an active alert with the same group key?

Yes, the Get Events step hooked up in my screen shot above will do this. You would pass the groupKey and value as propertyName and propertyValue respectively. As noted you would need to parse out the groupKey from the HTTP trigger. (This is a code change I'll make in the branch here)

can you extend the alert that already exists?

Aside from our newly launched incidents, run time objects in xMatters are "events" and indicate something changed which is why we don't actually term them alerts.
Once the events are created they can be terminated (or suspended or resumed), but not otherwise altered. This makes sense because we wouldn't want to page one person with some information, 5 minutes later escalate to another person and deliver different information for the same event. The change in information is a change that should be communicated.

With that all said, how does this sound?

Prometheus alert goes off, an xmatters ~~alert~~ event gets created
2h later, prometheus sends the alert again. xmatters finds the existing active ~~alert~~ event and ~~extends its lifetime to the maximum again (repeat for two days)~~ does nothing.
1h later, you fix the issue and prometheus sends a "resolved" alert. Xmatters ~~resolves~~ terminates the active ~~alert~~ event (should we notify people here?)
10m later, the alert gets triggered again. Xmatters can't find an existing active event so it creates a new one
10m later, you fix the issue again, prometheus sends a "resolved" alert, xmatters resolves the active event

Is there any particular reason you want to keep something open in xMatters? Do you have additional reporting in xMatters that you can't get from Prometheus?

I mentioned the new incidents, and I'm almost wondering if creating an incident might solve all of this. I haven't played with this much, but it might work something like: (This is the initial launch, and there are a couple of missing pieces that we'd need to flesh this out fully.)

Prometheus alert goes off, an xmatters incident gets created
2h later, prometheus sends the alert again. xmatters finds the existing active incident and makes a comment.
1h later, you fix the issue and prometheus sends a "resolved" alert. xMatters resolves the incident, terminates all events with the matching groupKey and sends a notification indicating the incident has been resolved.
10m later, the alert gets triggered again. Xmatters can't find an existing active incident so it creates a new one
10m later, you fix the issue again, prometheus sends a "resolved" alert, xmatters resolves the active incident.

You'd end up with two incidents, as there were two alerts which makes sense to me because you'd want to track any downtime.

SerialVelocity · 2020-09-24T13:57:15Z

Is there any particular reason you want to keep something open in xMatters? Do you have additional reporting in xMatters that you can't get from Prometheus?

it was mainly for a one to one mapping

The incidents workflow seems more like what I'm looking for!

SerialVelocity · 2020-09-26T11:08:01Z

And yes, a resolved notification would be nice (but it should not be something people have to acknowledge).

xMTinkerer · 2020-09-29T16:44:40Z

Ok, this is helpful info. We just launched incidents, and there are a couple of features we need to finish development on before we can build this kind of integration between Prometheus and xMatters Incidents.

For now I'd say use the Event Flood Control or the Option 2 I listed above.

I'll keep this issue open and when we revisit this integration in the coming months we'll see about adding the incidents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate alerts #11

Deduplicate alerts #11

SerialVelocity commented Sep 6, 2020

xMTinkerer commented Sep 9, 2020

xMTinkerer commented Sep 9, 2020 •

edited

Loading

SerialVelocity commented Sep 9, 2020

xMTinkerer commented Sep 10, 2020

SerialVelocity commented Sep 12, 2020

xMTinkerer commented Sep 13, 2020

SerialVelocity commented Sep 17, 2020

xMTinkerer commented Sep 24, 2020

SerialVelocity commented Sep 24, 2020

SerialVelocity commented Sep 26, 2020

xMTinkerer commented Sep 29, 2020

Deduplicate alerts #11

Deduplicate alerts #11

Comments

SerialVelocity commented Sep 6, 2020

xMTinkerer commented Sep 9, 2020

xMTinkerer commented Sep 9, 2020 • edited Loading

.

Option 2

SerialVelocity commented Sep 9, 2020

xMTinkerer commented Sep 10, 2020

SerialVelocity commented Sep 12, 2020

xMTinkerer commented Sep 13, 2020

SerialVelocity commented Sep 17, 2020

xMTinkerer commented Sep 24, 2020

SerialVelocity commented Sep 24, 2020

SerialVelocity commented Sep 26, 2020

xMTinkerer commented Sep 29, 2020

xMTinkerer commented Sep 9, 2020 •

edited

Loading