Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PYIC-7613: Add new alarm for 80% 5XX #1641

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Wynndow
Copy link
Contributor

@Wynndow Wynndow commented Nov 8, 2024

Proposed changes

What changed

Add new alarm for 80% 5XX

Why did it change

This adds a new alarm that will send a PagerDuty alert if the core-front REST apigateway starts to return more than 80% of 5XX responses.

Issue tracking

@Wynndow Wynndow requested review from a team as code owners November 8, 2024 15:59
@@ -1527,6 +1527,55 @@ Resources:
Period: !FindInMap [ EnvironmentConfiguration, !Ref AWS::AccountId, tg500ErrorWindow ]
Stat: Sum

FrontRestApiGateway5xxErrorsPercentage:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be some duplication with the above alarms, although they're informed by metrics from the load balancer and target groups. They use the SUM of 5XX metrics over a much longer window.
Without really knowing the importance of these alarms being in this configuration I've opted to add a new one rather than adjust. Up for debate.

Copy link
Contributor

@Joe-Edwards-GDS Joe-Edwards-GDS Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to assess across all of these holistically... I'd be keen to see a clear list of each alarm and what it's purpose is - either as comments here, or in a runbook somewhere.

At the moment it looks like we have:

  • FrontLoadBalancer5xxErrors - ALB 5XX errors (>2 in 5m) - triggers PD
  • FrontTargetGroup5xxErrors - Target group 5XX errors (>50 in 5m) - triggers PD
  • FrontTargetGroup5xxPercentErrors - Target group 5XX errors (>5% in 1m) - used by the canary deployments
  • FE5XXErrorAlarm - API GW 5XX errors (>10% in 1m) - goes somewhere?
  • FrontRestApiGateway5xxErrorsPercentage - API GW 5XX errors (>80% in 1m) - what you've just added

There might be a good reason to have lots of different ones - e.g. I can see value in having both %-based and absolute thresholds, and it seems reasonable that the canary alarm is a bit more sensitive than the P1 incident alarm; but at the moment it's not obvious what the rationale is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it's the first one that seems to trigger all the real incidents!

shivanshuit914
shivanshuit914 previously approved these changes Nov 8, 2024
This updates the existing alarm for the percentage of 5XX errors from
the frontend API gateway.

The biggest change is the API ID being targeted - previously it was
connected to an unused (should we remove it?), http API gateway. It was
also set to alert if error rates were 1%. This fixes it to 10%.

It also removes the check for if the number of errors is a certain count
- as we're using a percentage this is implicit in the lower limit on
  number of invocations.
@Wynndow Wynndow force-pushed the pyic-7613-update-apigateway-5xx-alarms branch from 465027b to d53685e Compare November 12, 2024 14:18
Copy link

sonarcloud bot commented Nov 12, 2024

- Id: errorThreshold
Label: Threshold error percentage
ReturnData: true
Expression: IF(invocations<50,0,errorPercentage)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overnight invocations are around the 8-10 per mintue mark. These seem to be from some sort of smoke testing happening every 3 minutes. Setting this alarm to only invoke when invocations are greater than 50 will mean that this alarm will effectively ignore the traffic generate by the smoke tests.
If we have actualy users overnight, hopefullt the traffic would increase above 50 so it gets caught. We could reduce the limit here, but then we might get alerted overnight for a very low number of 500s.

@@ -1568,6 +1571,60 @@ Resources:
Period: 60
Stat: Sum

FrontRestApiGateway5XXErrorAlarm:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this alarm render the FrontTargetGroup5xxErrors redundant?
In prod, we'd need to see two 5 minute periods back to back with more than 50 5xx errors from the target group for that alarm to trigger.
If we had 50 errors in minute 1 and then 50 in minute 6, the new alarm wouldn't trigger. So maybe this is still useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants