Sane Opsgenie alerting #487

wejdross · 2024-09-23T07:55:56Z

PR ensures we have new Opsgenie alerts that self resolve depending on time window, so we keep our Responsible Ops sane. It leaves old logic generated by Sloth tool, but disables forwarding alerts to Opsgenie.

Checklist

The PR has a meaningful title. It will be used to auto generate the
changelog.
The PR has a meaningful description that sums up the change. It will be
linked in the changelog.
PR contains a single logical change (to build a better changelog).
Categorize the PR by adding one of the labels:
bug, enhancement, documentation, change, breaking, dependency
as they show up in the changelog.
Link this PR to related issues or PRs.

Kidswiss

OnCall != Ops ;)

Kidswiss · 2024-09-23T08:28:03Z

component/slos.libsonnet

@@ -86,7 +86,7 @@ local generateSlothInput(name, uptime) =
          },
          labels+: {
            service: 'VSHN' + name,
-            OnCall: '{{ if eq $labels.sla "guaranteed" }}true{{ else }}false{{ end }}',
+            OnCall: false,


Let's remove this label completely, it doesn't serve any purpose anymore.

Also, this alone will not disable alerting to OpsGenie during office hours. This only disables routing to oncall. So whoever does ops will still get bothered by these alerts. You'll have to investigate how to disable the routing to OpsGenie. It might even need a new routing rule.

Kidswiss · 2024-09-23T08:31:16Z

component/vshn_alerting.jsonnet

+            alert: 'vshn-' + std.asciiLower(serviceName) + '-opsgenie-ha',
+            // this query can be read as: if the rate of probes that are not successful is higher than 0.2 in the last 5 minutes and in the last minute, then alert
+            // rate works on per second basis, so 0.2 means 20% of the probes are failing, which for 5 minutes is 1 minute and for 1 minute is 12 seconds
+            expr: 'rate(appcat_probes_seconds_count{reason!="success", service="' + serviceName + '", ha="true", maintenance="false"}[5m]) > 0.2 and rate(appcat_probes_seconds_count{reason!="success", service="' + serviceName + '", ha="true", maintenance="false"}[1m]) > 0.2',


You have to remove the maintenance="false" here.

If HA instances trigger an alert during maintenance, it should be considered broken and investigated.

Also the 12seconds within 1 minute might be a bit too low for that case, as there will be failovers during the maintenance, maybe we should increase it a bit to avoid false positives?

I just increased value for 1 minute to 45 second instead of 12, that should eliminate false positives

also removed maitenance label

component/vshn_alerting.jsonnet

Co-authored-by: Kidswiss <[email protected]>

Sane Opsgenie alerting

8d2a41f

wejdross added the enhancement New feature or request label Sep 23, 2024

wejdross self-assigned this Sep 23, 2024

wejdross requested a review from Kidswiss September 23, 2024 07:56

Kidswiss requested changes Sep 23, 2024

View reviewed changes

wejdross and others added 4 commits September 23, 2024 10:40

Update component/vshn_alerting.jsonnet

f38a1da

Co-authored-by: Kidswiss <[email protected]>

Update component/vshn_alerting.jsonnet

39a19c5

Co-authored-by: Kidswiss <[email protected]>

ensure only guaranteed instances alert

f8f1db3

further inprovements

a8eaf4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sane Opsgenie alerting #487

Sane Opsgenie alerting #487

wejdross commented Sep 23, 2024

Kidswiss left a comment

Kidswiss Sep 23, 2024

Kidswiss Sep 23, 2024

wejdross Sep 23, 2024

wejdross Sep 23, 2024 •

edited

Loading

Sane Opsgenie alerting #487

Are you sure you want to change the base?

Sane Opsgenie alerting #487

Conversation

wejdross commented Sep 23, 2024

Checklist

Kidswiss left a comment

Choose a reason for hiding this comment

Kidswiss Sep 23, 2024

Choose a reason for hiding this comment

Kidswiss Sep 23, 2024

Choose a reason for hiding this comment

wejdross Sep 23, 2024

Choose a reason for hiding this comment

wejdross Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

wejdross Sep 23, 2024 •

edited

Loading