Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [BUG] - SLO Generator Cloud Run service in test project crashes continuously #361

Open
1 task done
lvaylet opened this issue Oct 20, 2023 · 0 comments
Open
1 task done
Assignees
Labels
bug Something isn't working

Comments

@lvaylet
Copy link
Collaborator

lvaylet commented Oct 20, 2023

SLO Generator Version

v2.5.1

Python Version

3.9

What happened?

While designing end-to-end tests in #360, I discovered that the Cloud Run service deployed when a new version is released did not respond to any query. With no Availability SLO, uptime check or alerting in place, I was not notified before these manual tests.

Looking at the logs, it looks like the issue has been going on for at least 30 days (the max default retention period for logs). I was not able to trace the exact source of the error but it is definitely one of the Cloud Scheduler Jobs used for simulating traffic. The Cloud Run service restarted successfully after I paused all the Cloud Scheduler Jobs, and stayed that way.

I managed to troubleshoot and fix the configuration file as well as some of the SLO definitions. I uploaded these files to the Cloud Storage bucket used by the Cloud Scheduler Jobs, and re-enabled each job one by one.

I was not able to troubleshoot and fix all the jobs though. The SLO definitions that still need attention are in the GCS bucket and date from Oct 27, 2022 (vs. the new ones, uploaded on Oct 20, 2023).

Finally, I configured an uptime check and an Availability SLO, both with alerting to [email protected], to prevent the issue from happening again (or at least go unnoticed for a long period of time).

What did you expect?

I expected the Cloud Run service to be available for my end-to-end tests.

Screenshots

Cloud Scheduler Jobs are here:
https://console.cloud.google.com/cloudscheduler?referrer=search&authuser=2&project=slo-generator-ci-a2b4

Config and SLO definitions are here:
https://console.cloud.google.com/storage/browser/slo-generator-ci-a2b4?authuser=2&project=slo-generator-ci-a2b4

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@lvaylet lvaylet added bug Something isn't working triage labels Oct 20, 2023
@lvaylet lvaylet self-assigned this Oct 20, 2023
@lvaylet lvaylet removed the triage label Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant