-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle #3461
base: main
Are you sure you want to change the base?
Conversation
50b60d1
to
b284f53
Compare
api/v1/api.go
Outdated
@@ -447,6 +451,9 @@ func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...* | |||
if err := a.Validate(); err != nil { | |||
validationErrs.Add(err) | |||
api.m.Invalid().Inc() | |||
if api.alertLCObserver != nil { | |||
api.alertLCObserver.Rejected("Invalid", a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the invalid to the actual error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
api/v1/api.go
Outdated
@@ -456,8 +463,14 @@ func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...* | |||
typ: errorInternal, | |||
err: err, | |||
}, nil) | |||
if api.alertLCObserver != nil { | |||
api.alertLCObserver.Rejected("Failed to create", validAlerts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is rejecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when alerts.Put
failed. Since we don't end up recording the alert I considered it as rejected.
api/v1/api_test.go
Outdated
@@ -153,6 +154,20 @@ func TestAddAlerts(t *testing.T) { | |||
body, _ := io.ReadAll(res.Body) | |||
|
|||
require.Equal(t, tc.code, w.Code, fmt.Sprintf("test case: %d, StartsAt %v, EndsAt %v, Response: %s", i, tc.start, tc.end, string(body))) | |||
|
|||
observer := alertobserver.NewFakeAlertLifeCycleObserver() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe create a separate test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
dispatch/dispatch_test.go
Outdated
} | ||
require.Equal(t, 1, len(recorder.Alerts())) | ||
require.Equal(t, inputAlerts[0].Fingerprint(), observer.AggrGroupAlerts[0].Fingerprint()) | ||
o, ok := notify.AlertLCObserver(dispatcher.ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create a fake observer for example increment a counter, then verify if the observer's function get called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we already do that. In line 598 we create a fake observer and in like 616 we verify that the function was called by checking the recorded alert.
dispatch/dispatch.go
Outdated
d.ctx, d.cancel = context.WithCancel(context.Background()) | ||
ctx := context.Background() | ||
if d.alertLCObserver != nil { | ||
ctx = notify.WithAlertLCObserver(ctx, d.alertLCObserver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we put the observer into the stages rather than in ctx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean pass it as one of the arguments in the Exec call instead of adding it in the context?
b284f53
to
5fbcf6f
Compare
This is great! I've been thinking about doing something similar, for the exact reasons mentioned:
|
alertobserver/alertobserver.go
Outdated
"github.com/prometheus/alertmanager/types" | ||
) | ||
|
||
type AlertLifeCycleObserver interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a large interface with a method per event, have you considered having a generic Observe method that accepts metadata?
For example:
type AlertLifeCycleObserver interface { | |
type LifeCycleObserver interface { | |
Observe(event string, alerts []*types.Alert, meta Metadata) | |
} |
The metadata could be something as simple as:
type Metadata map[string]interface{}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'm not a fan of large interfaces either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can update the code as suggested. Thanks for checking 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code?
Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".
alertobserver/alertobserver.go
Outdated
"github.com/prometheus/alertmanager/types" | ||
) | ||
|
||
type AlertLifeCycleObserver interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'm not a fan of large interfaces either.
The use that we are thinking of is just adding logs for these events. It sort of becomes an alert history that we can query when the customer comes in. We would like to have the flexibility in implementing how we collect and format the logs and how we will store them. |
a8d13d2
to
2b4fc5f
Compare
2b4fc5f
to
494e304
Compare
a5d618e
to
dee2f48
Compare
@@ -338,6 +345,9 @@ func (d *Dispatcher) processAlert(alert *types.Alert, route *Route) { | |||
// function, to make sure that when the run() will be executed the 1st | |||
// alert is already there. | |||
ag.insert(alert) | |||
if d.alertLCObserver != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need an event at d.metrics.aggrGroupLimitReached.Inc()?
notify/notify.go
Outdated
m := alertobserver.AlertEventMeta{ | ||
"ctx": ctx, | ||
"msg": "Unrecoverable error", | ||
"integration": r.integration.Name(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we care about each retry? should we just record final fail or final success at func (r RetryStage) Exec()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should care about retry here, currently we only record the final/success fail hence the if !retry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean why not put this into line 758 Exec function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code the log the sent
alerts instead because it is the correct list of alerts that was sent. I think because we don't return the sent
alerts we have to keep the code where it currently is.
Just some nits but overall looks good! |
7eb6d7b
to
fef64c8
Compare
9a6a3ea
to
c700916
Compare
@grobinson-grafana @simonpasquier could you have a look at this PR when you have time? Thank you |
c700916
to
1335a7f
Compare
Rebased PR and fixed conflicts |
@simonpasquier this draft PR in cortex gives the general idea of our use case for this feature https://github.com/cortexproject/cortex/pull/5602/commits |
4de8e25
to
34e94ef
Compare
…ife cycle Signed-off-by: Emmanuel Lodovice <[email protected]>
34e94ef
to
a2495fb
Compare
@gotjosh good day. Can you take a look at this one? |
What this pull request does
This pull requests introduces a new
AlertLifeCycleObserver
interface that is accepted in the API, Dispatcher, and the notification pipeline. This interface contains methods to allow tracking what happens to an alert in alert manager.Motivation
Currently, when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.
By introducing a new interface that allows to hook into the alert life cycle, consumers of the alert manager package would be able to implement whatever observability solution works best for them.