-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PYIC-7613: Add new alarm for 80% 5XX #1641
base: main
Are you sure you want to change the base?
Conversation
deploy/template.yaml
Outdated
@@ -1527,6 +1527,55 @@ Resources: | |||
Period: !FindInMap [ EnvironmentConfiguration, !Ref AWS::AccountId, tg500ErrorWindow ] | |||
Stat: Sum | |||
|
|||
FrontRestApiGateway5xxErrorsPercentage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be some duplication with the above alarms, although they're informed by metrics from the load balancer and target groups. They use the SUM of 5XX metrics over a much longer window.
Without really knowing the importance of these alarms being in this configuration I've opted to add a new one rather than adjust. Up for debate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to assess across all of these holistically... I'd be keen to see a clear list of each alarm and what it's purpose is - either as comments here, or in a runbook somewhere.
At the moment it looks like we have:
FrontLoadBalancer5xxErrors
- ALB 5XX errors (>2 in 5m) - triggers PDFrontTargetGroup5xxErrors
- Target group 5XX errors (>50 in 5m) - triggers PDFrontTargetGroup5xxPercentErrors
- Target group 5XX errors (>5% in 1m) - used by the canary deploymentsFE5XXErrorAlarm
- API GW 5XX errors (>10% in 1m) - goes somewhere?FrontRestApiGateway5xxErrorsPercentage
- API GW 5XX errors (>80% in 1m) - what you've just added
There might be a good reason to have lots of different ones - e.g. I can see value in having both %-based and absolute thresholds, and it seems reasonable that the canary alarm is a bit more sensitive than the P1 incident alarm; but at the moment it's not obvious what the rationale is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it's the first one that seems to trigger all the real incidents!
d2fbc99
to
465027b
Compare
This updates the existing alarm for the percentage of 5XX errors from the frontend API gateway. The biggest change is the API ID being targeted - previously it was connected to an unused (should we remove it?), http API gateway. It was also set to alert if error rates were 1%. This fixes it to 10%. It also removes the check for if the number of errors is a certain count - as we're using a percentage this is implicit in the lower limit on number of invocations.
465027b
to
d53685e
Compare
Quality Gate passedIssues Measures |
- Id: errorThreshold | ||
Label: Threshold error percentage | ||
ReturnData: true | ||
Expression: IF(invocations<50,0,errorPercentage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overnight invocations are around the 8-10 per mintue mark. These seem to be from some sort of smoke testing happening every 3 minutes. Setting this alarm to only invoke when invocations are greater than 50 will mean that this alarm will effectively ignore the traffic generate by the smoke tests.
If we have actualy users overnight, hopefullt the traffic would increase above 50 so it gets caught. We could reduce the limit here, but then we might get alerted overnight for a very low number of 500s.
@@ -1568,6 +1571,60 @@ Resources: | |||
Period: 60 | |||
Stat: Sum | |||
|
|||
FrontRestApiGateway5XXErrorAlarm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this alarm render the FrontTargetGroup5xxErrors
redundant?
In prod, we'd need to see two 5 minute periods back to back with more than 50 5xx errors from the target group for that alarm to trigger.
If we had 50 errors in minute 1 and then 50 in minute 6, the new alarm wouldn't trigger. So maybe this is still useful?
Proposed changes
What changed
Add new alarm for 80% 5XX
Why did it change
This adds a new alarm that will send a PagerDuty alert if the core-front REST apigateway starts to return more than 80% of 5XX responses.
Issue tracking