PYIC-7613: Add new alarm for 80% 5XX #1641

Wynndow · 2024-11-08T15:59:34Z

Proposed changes

What changed

Add new alarm for 80% 5XX

Why did it change

This adds a new alarm that will send a PagerDuty alert if the core-front REST apigateway starts to return more than 80% of 5XX responses.

Issue tracking

PYIC-7613

Wynndow · 2024-11-08T16:01:21Z

deploy/template.yaml

@@ -1527,6 +1527,55 @@ Resources:
            Period: !FindInMap [ EnvironmentConfiguration, !Ref AWS::AccountId, tg500ErrorWindow ]
            Stat: Sum

+  FrontRestApiGateway5xxErrorsPercentage:


There appears to be some duplication with the above alarms, although they're informed by metrics from the load balancer and target groups. They use the SUM of 5XX metrics over a much longer window.
Without really knowing the importance of these alarms being in this configuration I've opted to add a new one rather than adjust. Up for debate.

I think it would be good to assess across all of these holistically... I'd be keen to see a clear list of each alarm and what it's purpose is - either as comments here, or in a runbook somewhere.

At the moment it looks like we have:

FrontLoadBalancer5xxErrors - ALB 5XX errors (>2 in 5m) - triggers PD

FrontTargetGroup5xxErrors - Target group 5XX errors (>50 in 5m) - triggers PD

FrontTargetGroup5xxPercentErrors - Target group 5XX errors (>5% in 1m) - used by the canary deployments

FE5XXErrorAlarm - API GW 5XX errors (>10% in 1m) - goes somewhere?

FrontRestApiGateway5xxErrorsPercentage - API GW 5XX errors (>80% in 1m) - what you've just added

There might be a good reason to have lots of different ones - e.g. I can see value in having both %-based and absolute thresholds, and it seems reasonable that the canary alarm is a bit more sensitive than the P1 incident alarm; but at the moment it's not obvious what the rationale is.

FWIW it's the first one that seems to trigger all the real incidents!

This updates the existing alarm for the percentage of 5XX errors from the frontend API gateway. The biggest change is the API ID being targeted - previously it was connected to an unused (should we remove it?), http API gateway. It was also set to alert if error rates were 1%. This fixes it to 10%. It also removes the check for if the number of errors is a certain count - as we're using a percentage this is implicit in the lower limit on number of invocations.

sonarcloud · 2024-11-12T14:20:27Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Wynndow · 2024-11-12T14:23:47Z

deploy/template.yaml

+        - Id: errorThreshold
+          Label: Threshold error percentage
+          ReturnData: true
+          Expression: IF(invocations<50,0,errorPercentage)


Overnight invocations are around the 8-10 per mintue mark. These seem to be from some sort of smoke testing happening every 3 minutes. Setting this alarm to only invoke when invocations are greater than 50 will mean that this alarm will effectively ignore the traffic generate by the smoke tests.
If we have actualy users overnight, hopefullt the traffic would increase above 50 so it gets caught. We could reduce the limit here, but then we might get alerted overnight for a very low number of 500s.

Wynndow · 2024-11-12T14:27:10Z

deploy/template.yaml

@@ -1568,6 +1571,60 @@ Resources:
          Period: 60
          Stat: Sum

+  FrontRestApiGateway5XXErrorAlarm:


Does this alarm render the FrontTargetGroup5xxErrors redundant?
In prod, we'd need to see two 5 minute periods back to back with more than 50 5xx errors from the target group for that alarm to trigger.
If we had 50 errors in minute 1 and then 50 in minute 6, the new alarm wouldn't trigger. So maybe this is still useful?

Wynndow requested review from a team as code owners November 8, 2024 15:59

Wynndow commented Nov 8, 2024

View reviewed changes

shivanshuit914 previously approved these changes Nov 8, 2024

View reviewed changes

Wynndow dismissed shivanshuit914’s stale review via 465027b November 12, 2024 14:02

Wynndow force-pushed the pyic-7613-update-apigateway-5xx-alarms branch from d2fbc99 to 465027b Compare November 12, 2024 14:02

Wynndow force-pushed the pyic-7613-update-apigateway-5xx-alarms branch from 465027b to d53685e Compare November 12, 2024 14:18

Wynndow commented Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PYIC-7613: Add new alarm for 80% 5XX #1641

PYIC-7613: Add new alarm for 80% 5XX #1641

Wynndow commented Nov 8, 2024 •

edited by jira bot

Loading

Wynndow Nov 8, 2024

Joe-Edwards-GDS Nov 8, 2024 •

edited

Loading

Joe-Edwards-GDS Nov 8, 2024

sonarcloud bot commented Nov 12, 2024

Wynndow Nov 12, 2024

Wynndow Nov 12, 2024

PYIC-7613: Add new alarm for 80% 5XX #1641

Are you sure you want to change the base?

PYIC-7613: Add new alarm for 80% 5XX #1641

Conversation

Wynndow commented Nov 8, 2024 • edited by jira bot Loading

Proposed changes

What changed

Why did it change

Issue tracking

Wynndow Nov 8, 2024

Choose a reason for hiding this comment

Joe-Edwards-GDS Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Joe-Edwards-GDS Nov 8, 2024

Choose a reason for hiding this comment

sonarcloud bot commented Nov 12, 2024

Quality Gate passed

Wynndow Nov 12, 2024

Choose a reason for hiding this comment

Wynndow Nov 12, 2024

Choose a reason for hiding this comment

Wynndow commented Nov 8, 2024 •

edited by jira bot

Loading

Joe-Edwards-GDS Nov 8, 2024 •

edited

Loading