-
Problem summaryI'm using Occasionally, some event will happen that will interrupt connectivity between the service and the DB for some brief period of time (10s of seconds maybe), and the system will enter a state where 100% of requests are timing out, and it cannot recover without a restart of all service instances. During this time, new connections are being reestablished constantly, causing high CPU usage on the target database instance, which further exacerbates the problem. After a restart of the app, things are fine again. The overall scenario is similar to what @ericreis describes in #1282 (though in our case we're not particularly concerned with doing automatic retries in-band). Repro caseMy coworkers and I have narrowed this down to what we think is a representative and locally-runnable repro case: package main
import (
"context"
"fmt"
"github.com/jackc/pgx/v5/pgxpool"
"log"
"os"
"runtime"
"time"
)
func main() {
// urlExample := "postgres://username:password@localhost:5432/database_name"
ctx := context.Background()
cfg, err := pgxpool.ParseConfig(os.Getenv("DATABASE_URL"))
if err != nil {
panic(err)
}
cfg.MinConns = 10
cfg.MaxConns = 10
pool, err := pgxpool.NewWithConfig(context.Background(), cfg)
if err != nil {
panic(err)
}
defer pool.Close()
for {
if err := pool.Ping(context.Background()); err == nil {
break
}
fmt.Printf("got error pinging pool, trying again in 10ms\n")
time.Sleep(10 * time.Millisecond)
}
_, err = pool.Exec(ctx, "CREATE TABLE IF NOT EXISTS beeps (id integer NOT NULL)")
if err != nil {
log.Fatal(err)
}
ticker := time.NewTicker(2 * time.Millisecond)
statsTicker := time.NewTicker(1 * time.Second)
errors := make(chan error, 100)
successCount := 0
errCount := 0
for {
select {
case <-ticker.C:
go runOnce(ctx, pool, errors)
case <-statsTicker.C:
successRate := float64(successCount) / (float64(successCount + errCount))
fmt.Printf("%d requests, success rate: %.2f%%, # goroutines = %d\n", successCount+errCount, successRate*100, runtime.NumGoroutine())
successCount = 0
errCount = 0
case err := <-errors:
if err == nil {
successCount += 1
} else {
errCount += 1
}
}
}
}
func runOnce(ctx context.Context, pool *pgxpool.Pool, errs chan error) {
ctx, cancel := context.WithTimeout(ctx, time.Second)
defer cancel()
result, err := pool.Query(ctx, "SELECT 1 FROM beeps LIMIT 1")
if err != nil {
errs <- err
return
}
result.Close()
errs <- nil
} If I run this against a local DB, all is well:
However, if I induce a series of timeouts by briefly acquiring an
... the test driver enters a state where the success rate drops to zero, and stays there long after the lock is released:
In this test scenario, it does seem to eventually recover, but only after a minute or more. During that time, CPU on the database process is highly elevated on account of the heavy volume of new connections being created. The questionCan anyone offer any techniques for allowing the system to recover more quickly from events like this? We tried rate limiting the rate of new connection creation with a static rate limit, but it doesn't seem to help with recovery time, and can make things worse, depending on the rate limit you choose. Other ideas we've discussed included limiting the number of concurrently-constructing connections, but pgx doesn't seem to offer enough hooks to make this possible cleanly, since there's no hook for when creation of a new connection fails / times out. We've also talked about implementing a backoff mechanism to insert increasing delays between connection creation attempts in the face of repeated failures, but again due to the lack of a callback for failed connection creation, there's no obvious way to do this. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Gah, I just realized I had left a call to |
Beta Was this translation helpful? Give feedback.
-
It turns out that my test case was poorly constructed (as mentioned above, I had left a call to We've essentially resolved this issue in production by implementing a circuit-breaker pattern for DB queries, where >N consecutive failures results in fast-failing all DB queries for a brief period of time before trying again. |
Beta Was this translation helpful? Give feedback.
It turns out that my test case was poorly constructed (as mentioned above, I had left a call to
pool.Reset()
in it).We've essentially resolved this issue in production by implementing a circuit-breaker pattern for DB queries, where >N consecutive failures results in fast-failing all DB queries for a brief period of time before trying again.