Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate alerts #11

Open
SerialVelocity opened this issue Sep 6, 2020 · 11 comments
Open

Deduplicate alerts #11

SerialVelocity opened this issue Sep 6, 2020 · 11 comments

Comments

@SerialVelocity
Copy link

Hey,

Currently, each alert (or set of alerts) creates a new notification. It would be nice if the duplicates don't create a new alert every time. Maybe a UUID hash of the groupKey?

@xMTinkerer
Copy link
Contributor

Ah, yea, that makes sense. No one wants duplicated alerts.
Would you be adverse to an MD5 hash? Something like this one. Or is a full UUID the way to go? Or can we just match on the groupKey itself and skip the hash? These guys seem to be talking about keeping the groupKey in plaintext.

@xMTinkerer
Copy link
Contributor

xMTinkerer commented Sep 9, 2020

EDIT: Actually, a better way to do this is with Event Flood Control. I've updated the below to use that method instead.

I can work on re-releasing the workflow, but that would require you to re-upload the zip file, and then reconfigure your alertmanager urls. If you'd like to add this to your existing workflow:

  1. Add the groupKey as an output to the HTTP trigger:
    image

  2. Add a groupKey property to the event form. This is done in the Form > Layout page. After creating the property, drag it onto the layout:
    image

  3. Navigate to Flood Control
    image
    and select the one with Source=Prometheus
    Then, drag the new groupKey property into the Selected Properties and set the trigger conditions in the section below.
    image

.

Option 2

  1. Add a Get Events and a switch step before the create event step:
    image

  2. Edit the Get Events step:
    Status: ACTIVE
    PropertyName: groupKey#en
    PropertyValue: groupKey
    Like so:
    image

  3. Update the switch step to inspect the number of events found by the Get Events step:
    image

@SerialVelocity
Copy link
Author

I don't think those options actually stop the duplicates. Let's say you have repeat_interval set to 2h. I think with your changes, you will ignore the first duplicate at 2 hours later, but then the alert will be closed after 3 hours by xmatters, then a new one will be triggered at the 4 hour mark?

@xMTinkerer
Copy link
Contributor

The event can be configured to live for up to 24 hours. So the question is how long until it would be considered a new alert? I don't think it makes sense to deduplicate forever, so there must be a timeframe limit.

I don't think the active/closed status of the event is considered in the flood control, so that is irrelevant. Let's walk through an example. Assume your repeat_interval in your alertmanager.yaml is set to 2h. And you set the flood control in xMatters to be "More than 1 events within 4 hours".

12:00 - Prometheus triggers the xMatters integration and flood control allows it.
14:00 - Prometheus again triggers the integration because the repeat_interval happens, but flood control stops the event from being created because it is still within the 4 hours.
16:00 - Prometheus triggers the xMatters integration again and flood control allows it because the trigger condition is no longer valid.

What would you like to see happen?

@SerialVelocity
Copy link
Author

So, these are the cases I'm thinking of:

  • An alert is triggered. You fix it in 10 minutes and the alert is resolved. You then get the same alert 20 minutes later because something went wrong again.

    • Concrete example is orphaned pods in Kubernetes. Pods can be left in an inconsistent state when volumes are unmounted improperly. This sometimes can happen often.
  • A low priority alert goes off on Friday evening. Because it is the weekend coming up, it is delayed until Monday. You don't want the alert to retrigger, but you still want it to be active.

    • Concrete example is CephClusterWarnState which means that something is not quite right in Ceph. This could be a disk is >75% full or a service crashed and restarted. Both are low priority to look into.

@xMTinkerer
Copy link
Contributor

Ah, that's helpful context.
I'm guessing both of these would be set up with different groupKey values? If not, any kind of deduplication would do the wrong thing and lead to under or over alerting.

Is this where the silences could be helpful? You can reply with a silence and we write the silence back to Prometheus. The duration of the silence might be something to work out, or we provide multiple options such as Silence for 30 minutes, Silence for 2 days, etc.

@SerialVelocity
Copy link
Author

I think the groupKey would be the same.

Is there a way to check if there is an active alert with the same group key? If there is, can you extend the alert that already exists?

The workflow above would work like this:

  • Prometheus alert goes off, an xmatters alert gets created
  • 2h later, prometheus sends the alert again. xmatters finds the existing active alert and extends its lifetime to the maximum again (repeat for two days)
  • 1h later, you fix the issue and prometheus sends a "resolved" alert. Xmatters resolves the active alert
  • 10m later, the alert gets triggered again. Xmatters can't find an existing active alert so it creates a new one
  • 10m later, you fix the issue again, prometheus sends a "resolved" alert, xmatters resolves the active alert

@xMTinkerer
Copy link
Contributor

Is there a way to check if there is an active alert with the same group key?

Yes, the Get Events step hooked up in my screen shot above will do this. You would pass the groupKey and value as propertyName and propertyValue respectively. As noted you would need to parse out the groupKey from the HTTP trigger. (This is a code change I'll make in the branch here)

can you extend the alert that already exists?

Aside from our newly launched incidents, run time objects in xMatters are "events" and indicate something changed which is why we don't actually term them alerts.
Once the events are created they can be terminated (or suspended or resumed), but not otherwise altered. This makes sense because we wouldn't want to page one person with some information, 5 minutes later escalate to another person and deliver different information for the same event. The change in information is a change that should be communicated.

With that all said, how does this sound?

  • Prometheus alert goes off, an xmatters alert event gets created
  • 2h later, prometheus sends the alert again. xmatters finds the existing active alert event and extends its lifetime to the maximum again (repeat for two days) does nothing.
  • 1h later, you fix the issue and prometheus sends a "resolved" alert. Xmatters resolves terminates the active alert event (should we notify people here?)
  • 10m later, the alert gets triggered again. Xmatters can't find an existing active event so it creates a new one
  • 10m later, you fix the issue again, prometheus sends a "resolved" alert, xmatters resolves the active event

Is there any particular reason you want to keep something open in xMatters? Do you have additional reporting in xMatters that you can't get from Prometheus?

I mentioned the new incidents, and I'm almost wondering if creating an incident might solve all of this. I haven't played with this much, but it might work something like: (This is the initial launch, and there are a couple of missing pieces that we'd need to flesh this out fully.)

  • Prometheus alert goes off, an xmatters incident gets created
  • 2h later, prometheus sends the alert again. xmatters finds the existing active incident and makes a comment.
  • 1h later, you fix the issue and prometheus sends a "resolved" alert. xMatters resolves the incident, terminates all events with the matching groupKey and sends a notification indicating the incident has been resolved.
  • 10m later, the alert gets triggered again. Xmatters can't find an existing active incident so it creates a new one
  • 10m later, you fix the issue again, prometheus sends a "resolved" alert, xmatters resolves the active incident.

You'd end up with two incidents, as there were two alerts which makes sense to me because you'd want to track any downtime.

@SerialVelocity
Copy link
Author

Is there any particular reason you want to keep something open in xMatters? Do you have additional reporting in xMatters that you can't get from Prometheus?

it was mainly for a one to one mapping

The incidents workflow seems more like what I'm looking for!

@SerialVelocity
Copy link
Author

And yes, a resolved notification would be nice (but it should not be something people have to acknowledge).

@xMTinkerer
Copy link
Contributor

Ok, this is helpful info. We just launched incidents, and there are a couple of features we need to finish development on before we can build this kind of integration between Prometheus and xMatters Incidents.

For now I'd say use the Event Flood Control or the Option 2 I listed above.

I'll keep this issue open and when we revisit this integration in the coming months we'll see about adding the incidents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants