Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning #930

Open
10 of 25 tasks
darunrs opened this issue Jul 26, 2024 · 0 comments

Comments

@darunrs
Copy link
Collaborator

darunrs commented Jul 26, 2024

On July 25, 2024, Rate Exceeded errors were observed from the production Hasura instance. Following this, an investigation was performed with the help of SRE. One of the actions taken was raising the concurrent request limit of each Hasura instance from 80 to 200 while increasing the max instances from 5 to 10. This increase was sufficient to stop the Rate Exceeded error. The following morning, it was discovered that the number of DB connections had spiked to 600 and was floating at 600, with the number of active connections locked at roughly 400. As a result, Hasura did not have enough connections to maintain its metadata, causing it to fall out of sync. This led to QueryApi once again experiencing issues. After QueryApi was shut down in prod, and the database restarted, the connection count fell. However, when QueryApi restarted, it immediately began to deprovision many indexers without cause. QueryApi was shut down again, and the deprovisioning was investigated. After the impacted indexers were documented, QueryApi was restarted with a custom commit which increased the timeout between stalled stream/executor restart attempts, and disabled deprovisioning. After this, the deprovisioned indexers were all brought back and backfilled on Jul 26, 2024.

TLDR:

  • Hasura rejects all requests due to accumulated timeout queries from either a block stream which was being repeatedly restarted or from KitWallet which tried accessing Hasura after Postgres connections reached the limit
  • Postgres connections rapidly rise to maximum due to above timing out queries from QueryApi creating permanently active connections
  • Indexers suddenly deprovisioned when not deleted from contract

More details on Incident Document.

I've separated the task list into two as the two incidents are unrelated.

Deprovisioning Incident

  1. component: Registry
    darunrs
  2. component: Coordinator
    darunrs
  3. component: Coordinator
    darunrs
  4. darunrs
  5. component: Runner
    darunrs
  6. Ungroomed component: Coordinator
@darunrs darunrs self-assigned this Jul 26, 2024
@darunrs darunrs changed the title Prod Issue 25/07/24 - Hasura Rate Exceeded and Postgres Connection Limit Reached Prod Issue 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning Jul 29, 2024
@darunrs darunrs changed the title Prod Issue 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant