Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle #3461

emanlodovice · 2023-08-16T00:23:41Z

What this pull request does

This pull requests introduces a new AlertLifeCycleObserver interface that is accepted in the API, Dispatcher, and the notification pipeline. This interface contains methods to allow tracking what happens to an alert in alert manager.

Motivation

Currently, when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.

By introducing a new interface that allows to hook into the alert life cycle, consumers of the alert manager package would be able to implement whatever observability solution works best for them.

api/v1/api.go

qinxx108 · 2023-08-16T22:23:33Z

api/v1/api.go

@@ -447,6 +451,9 @@ func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...*
 		if err := a.Validate(); err != nil {
 			validationErrs.Add(err)
 			api.m.Invalid().Inc()
+			if api.alertLCObserver != nil {
+				api.alertLCObserver.Rejected("Invalid", a)


can we change the invalid to the actual error?

qinxx108 · 2023-08-16T22:28:46Z

api/v1/api.go

@@ -456,8 +463,14 @@ func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...*
 			typ: errorInternal,
 			err: err,
 		}, nil)
+		if api.alertLCObserver != nil {
+			api.alertLCObserver.Rejected("Failed to create", validAlerts...)


why this is rejecting?

This is when alerts.Put failed. Since we don't end up recording the alert I considered it as rejected.

qinxx108 · 2023-08-16T22:30:22Z

api/v1/api_test.go

@@ -153,6 +154,20 @@ func TestAddAlerts(t *testing.T) {
 		body, _ := io.ReadAll(res.Body)

 		require.Equal(t, tc.code, w.Code, fmt.Sprintf("test case: %d, StartsAt %v, EndsAt %v, Response: %s", i, tc.start, tc.end, string(body)))
+
+		observer := alertobserver.NewFakeAlertLifeCycleObserver()


nit: maybe create a separate test case?

qinxx108 · 2023-08-16T22:46:18Z

dispatch/dispatch_test.go

+	}
+	require.Equal(t, 1, len(recorder.Alerts()))
+	require.Equal(t, inputAlerts[0].Fingerprint(), observer.AggrGroupAlerts[0].Fingerprint())
+	o, ok := notify.AlertLCObserver(dispatcher.ctx)


can we create a fake observer for example increment a counter, then verify if the observer's function get called?

Yes we already do that. In line 598 we create a fake observer and in like 616 we verify that the function was called by checking the recorded alert.

qinxx108 · 2023-08-16T22:49:56Z

dispatch/dispatch.go

-	d.ctx, d.cancel = context.WithCancel(context.Background())
+	ctx := context.Background()
+	if d.alertLCObserver != nil {
+		ctx = notify.WithAlertLCObserver(ctx, d.alertLCObserver)


should we put the observer into the stages rather than in ctx?

You mean pass it as one of the arguments in the Exec call instead of adding it in the context?

grobinson-grafana · 2023-08-29T10:34:48Z

This is great! I've been thinking about doing something similar, for the exact reasons mentioned:

when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.

grobinson-grafana · 2023-08-29T10:39:49Z

alertobserver/alertobserver.go

+	"github.com/prometheus/alertmanager/types"
+)
+
+type AlertLifeCycleObserver interface {


Instead of having a large interface with a method per event, have you considered having a generic Observe method that accepts metadata?

For example:

Suggested change

type AlertLifeCycleObserver interface {

type LifeCycleObserver interface {

Observe(event string, alerts []*types.Alert, meta Metadata)

}

The metadata could be something as simple as:

type Metadata map[string]interface{}

Agreed, I'm not a fan of large interfaces either.

Sure, I can update the code as suggested. Thanks for checking 🙇

updated 🙇

simonpasquier

I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code?
Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".

simonpasquier · 2023-08-30T14:04:44Z

alertobserver/alertobserver.go

+	"github.com/prometheus/alertmanager/types"
+)
+
+type AlertLifeCycleObserver interface {


Agreed, I'm not a fan of large interfaces either.

emanlodovice · 2023-08-30T18:05:34Z

I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code? Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".

The use that we are thinking of is just adding logs for these events. It sort of becomes an alert history that we can query when the customer comes in. We would like to have the flexibility in implementing how we collect and format the logs and how we will store them.

qinxx108 · 2023-10-09T23:15:05Z

dispatch/dispatch.go

@@ -338,6 +345,9 @@ func (d *Dispatcher) processAlert(alert *types.Alert, route *Route) {
 	// function, to make sure that when the run() will be executed the 1st
 	// alert is already there.
 	ag.insert(alert)
+	if d.alertLCObserver != nil {


do we need an event at d.metrics.aggrGroupLimitReached.Inc()?

qinxx108 · 2023-10-09T23:20:28Z

notify/notify.go

+						m := alertobserver.AlertEventMeta{
+							"ctx":         ctx,
+							"msg":         "Unrecoverable error",
+							"integration": r.integration.Name(),


do we care about each retry? should we just record final fail or final success at func (r RetryStage) Exec()

I don't think we should care about retry here, currently we only record the final/success fail hence the if !retry.

i mean why not put this into line 758 Exec function

I updated the code the log the sent alerts instead because it is the correct list of alerts that was sent. I think because we don't return the sent alerts we have to keep the code where it currently is.

qinxx108 · 2023-10-09T23:20:58Z

Just some nits but overall looks good!

emanlodovice · 2023-10-12T06:46:05Z

@grobinson-grafana @simonpasquier could you have a look at this PR when you have time? Thank you

emanlodovice · 2023-10-17T21:11:18Z

Rebased PR and fixed conflicts

emanlodovice · 2023-10-19T03:52:12Z

@simonpasquier this draft PR in cortex gives the general idea of our use case for this feature https://github.com/cortexproject/cortex/pull/5602/commits

…ife cycle Signed-off-by: Emmanuel Lodovice <[email protected]>

emanlodovice · 2023-11-20T21:41:50Z

@gotjosh good day. Can you take a look at this one?

emanlodovice force-pushed the alert-observer branch 5 times, most recently from 50b60d1 to b284f53 Compare August 16, 2023 18:34

emanlodovice marked this pull request as ready for review August 16, 2023 18:40

qinxx108 reviewed Aug 16, 2023

View reviewed changes

api/v1/api.go Show resolved Hide resolved

qinxx108 reviewed Aug 16, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch from b284f53 to 5fbcf6f Compare August 17, 2023 00:06

grobinson-grafana approved these changes Aug 29, 2023

View reviewed changes

simonpasquier reviewed Aug 30, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch 2 times, most recently from a8d13d2 to 2b4fc5f Compare September 5, 2023 01:10

emanlodovice requested review from simonpasquier and qinxx108 September 5, 2023 01:42

emanlodovice force-pushed the alert-observer branch from 2b4fc5f to 494e304 Compare September 6, 2023 18:20

emanlodovice force-pushed the alert-observer branch from a5d618e to dee2f48 Compare September 29, 2023 07:38

qinxx108 reviewed Oct 9, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch 4 times, most recently from 7eb6d7b to fef64c8 Compare October 10, 2023 20:42

emanlodovice force-pushed the alert-observer branch 12 times, most recently from 9a6a3ea to c700916 Compare October 11, 2023 20:31

qinxx108 approved these changes Oct 11, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch from c700916 to 1335a7f Compare October 17, 2023 20:54

emanlodovice mentioned this pull request Oct 18, 2023

Alert history log support cortexproject/cortex#5602

Draft

3 tasks

emanlodovice force-pushed the alert-observer branch 2 times, most recently from 4de8e25 to 34e94ef Compare November 16, 2023 07:20

Add AlertLifeCycleObserver that allows consumers to hook into Alert l…

a2495fb

…ife cycle Signed-off-by: Emmanuel Lodovice <[email protected]>

emanlodovice force-pushed the alert-observer branch from 34e94ef to a2495fb Compare November 16, 2023 07:29

emanlodovice mentioned this pull request Nov 29, 2023

Add alert lifecycle observer amazon-contributing/alertmanager#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle #3461

Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle #3461

emanlodovice commented Aug 16, 2023

qinxx108 Aug 16, 2023

emanlodovice Aug 17, 2023

qinxx108 Aug 16, 2023

emanlodovice Aug 16, 2023

qinxx108 Aug 16, 2023

emanlodovice Aug 17, 2023

qinxx108 Aug 16, 2023

emanlodovice Aug 16, 2023

qinxx108 Aug 16, 2023

emanlodovice Aug 16, 2023

grobinson-grafana commented Aug 29, 2023

grobinson-grafana Aug 29, 2023

simonpasquier Aug 30, 2023

emanlodovice Aug 30, 2023

emanlodovice Sep 5, 2023

simonpasquier left a comment

simonpasquier Aug 30, 2023

emanlodovice commented Aug 30, 2023

qinxx108 Oct 9, 2023

qinxx108 Oct 9, 2023

emanlodovice Oct 10, 2023 •

edited

Loading

qinxx108 Oct 11, 2023

emanlodovice Oct 11, 2023

qinxx108 commented Oct 9, 2023

emanlodovice commented Oct 12, 2023

emanlodovice commented Oct 17, 2023

emanlodovice commented Oct 19, 2023

emanlodovice commented Nov 20, 2023

-type AlertLifeCycleObserver interface {
+type LifeCycleObserver interface {
+	Observe(event string, alerts []*types.Alert, meta Metadata)
+}

Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle #3461

Are you sure you want to change the base?

Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle #3461

Conversation

emanlodovice commented Aug 16, 2023

What this pull request does

Motivation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grobinson-grafana commented Aug 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpasquier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emanlodovice commented Aug 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emanlodovice Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qinxx108 commented Oct 9, 2023

emanlodovice commented Oct 12, 2023

emanlodovice commented Oct 17, 2023

emanlodovice commented Oct 19, 2023

emanlodovice commented Nov 20, 2023

emanlodovice Oct 10, 2023 •

edited

Loading