You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.
We've identified a number of weaknesses in the hydra design and implementation, which cause ungraceful failures (worker crashes) and downtimes when utilization spikes. The problem occurred in the window 7/7/2021-7/21/2021.
Problem analysis (theory)
The backend Postgres database can become overloaded under high volume of DHT requests to the hydras.
This causes query times to the database to increase. This in turn causes DHT requests to backup in the provider manager loop, which in turn causes the hydra nodes to crash.
Verify that a sustained increased request load at the hydra level does not propagate to the Postgres backing datastore. This should be ensured by measures for graceful degradation of quality (above) at the DHT provider manager.
The text was updated successfully, but these errors were encountered:
@petar : thanks for putting this together. A few comments/questions coming to mind:
I'm not saying we need to backfill now, but in future I think it would be ideal to include the data that lead us to our theory.
Do we know why we're crashing now vs. not previously?
What's the impact to Hydra nodes crashing? Does the whole network see impact? Or is our ability to monitor/inspect the network impaired?
Is there anything else architecturally or infra wise we could do that would help here? I'm not saying we should, but for example, would AWS RDS Postgres Aurora help here?
You don't need to answer these questions here. They are the things that came to mind while reading this.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We've identified a number of weaknesses in the hydra design and implementation, which cause ungraceful failures (worker crashes) and downtimes when utilization spikes. The problem occurred in the window 7/7/2021-7/21/2021.
Problem analysis (theory)
The backend Postgres database can become overloaded under high volume of DHT requests to the hydras.
This causes query times to the database to increase. This in turn causes DHT requests to backup in the provider manager loop, which in turn causes the hydra nodes to crash.
Corrective steps
Keep Hydra head peerIDs between restarts #128
Resolved by Allow deterministic key generation from seed #130
Use fast approximate queries to Postgres for metrics collection #133
Allow the ProviderManager to have more paralleism go-libp2p-kad-dht#729
Degrade provider handling quality gracefully under load go-libp2p-kad-dht#730
Acceptance criteria
The text was updated successfully, but these errors were encountered: