Recovering from a connection interruption when using pgxpool #1415

benweint · 2022-12-06T05:14:17Z

benweint
Dec 6, 2022

Problem summary

I'm using pgxpool from pgx v5.1.0 in a service that performs only read queries. The workload is pretty predictable, most queries are 1-2 ms. Each service instance handles 1-2k requests / sec, and each uses a connection with min conns == max conns == 10. Most of the time, this setup works great. Queries are issued from incoming web requests, which have a context timeout of 2s.

Occasionally, some event will happen that will interrupt connectivity between the service and the DB for some brief period of time (10s of seconds maybe), and the system will enter a state where 100% of requests are timing out, and it cannot recover without a restart of all service instances. During this time, new connections are being reestablished constantly, causing high CPU usage on the target database instance, which further exacerbates the problem. After a restart of the app, things are fine again.

The overall scenario is similar to what @ericreis describes in #1282 (though in our case we're not particularly concerned with doing automatic retries in-band).

Repro case

My coworkers and I have narrowed this down to what we think is a representative and locally-runnable repro case:

package main

import (
	"context"
	"fmt"
	"github.com/jackc/pgx/v5/pgxpool"
	"log"
	"os"
	"runtime"
	"time"
)

func main() {
	// urlExample := "postgres://username:password@localhost:5432/database_name"
	ctx := context.Background()

	cfg, err := pgxpool.ParseConfig(os.Getenv("DATABASE_URL"))
	if err != nil {
		panic(err)
	}

	cfg.MinConns = 10
	cfg.MaxConns = 10

	pool, err := pgxpool.NewWithConfig(context.Background(), cfg)
	if err != nil {
		panic(err)
	}
	defer pool.Close()

	for {
		if err := pool.Ping(context.Background()); err == nil {
			break
		}
		fmt.Printf("got error pinging pool, trying again in 10ms\n")
		time.Sleep(10 * time.Millisecond)
	}

	_, err = pool.Exec(ctx, "CREATE TABLE IF NOT EXISTS beeps (id integer NOT NULL)")
	if err != nil {
		log.Fatal(err)
	}

	ticker := time.NewTicker(2 * time.Millisecond)
	statsTicker := time.NewTicker(1 * time.Second)
	errors := make(chan error, 100)

	successCount := 0
	errCount := 0

	for {
		select {
		case <-ticker.C:
			go runOnce(ctx, pool, errors)
		case <-statsTicker.C:
			successRate := float64(successCount) / (float64(successCount + errCount))
			fmt.Printf("%d requests, success rate: %.2f%%, # goroutines = %d\n", successCount+errCount, successRate*100, runtime.NumGoroutine())
			successCount = 0
			errCount = 0
		case err := <-errors:
			if err == nil {
				successCount += 1
			} else {
				errCount += 1
			}
		}
	}
}

func runOnce(ctx context.Context, pool *pgxpool.Pool, errs chan error) {
	ctx, cancel := context.WithTimeout(ctx, time.Second)
	defer cancel()
	result, err := pool.Query(ctx, "SELECT 1 FROM beeps LIMIT 1")
	if err != nil {
		errs <- err
		return
	}
	result.Close()
	errs <- nil
}

If I run this against a local DB, all is well:

❯ DATABASE_URL="postgres://<user>:<pass>@localhost:5432/postgres" go run main.go
499 requests, success rate: 100.00%, # goroutines = 3
500 requests, success rate: 100.00%, # goroutines = 3
496 requests, success rate: 100.00%, # goroutines = 3
500 requests, success rate: 100.00%, # goroutines = 3

However, if I induce a series of timeouts by briefly acquiring an ACCESS EXCLUSIVE mode lock against the table being selected from in a separate DB session, like this:

BEGIN;
LOCK TABLE beeps IN ACCESS EXCLUSIVE MODE;
SELECT pg_sleep(2);
COMMIT;

... the test driver enters a state where the success rate drops to zero, and stays there long after the lock is released:

485 requests, success rate: 100.00%, # goroutines = 3
492 requests, success rate: 100.00%, # goroutines = 3
491 requests, success rate: 100.00%, # goroutines = 3
500 requests, success rate: 100.00%, # goroutines = 3
310 requests, success rate: 100.00%, # goroutines = 200
187 requests, success rate: 5.88%, # goroutines = 554
498 requests, success rate: 0.00%, # goroutines = 537
496 requests, success rate: 0.00%, # goroutines = 540
498 requests, success rate: 0.00%, # goroutines = 528
477 requests, success rate: 0.00%, # goroutines = 507
464 requests, success rate: 0.00%, # goroutines = 513

In this test scenario, it does seem to eventually recover, but only after a minute or more. During that time, CPU on the database process is highly elevated on account of the heavy volume of new connections being created.

The question

Can anyone offer any techniques for allowing the system to recover more quickly from events like this?

We tried rate limiting the rate of new connection creation with a static rate limit, but it doesn't seem to help with recovery time, and can make things worse, depending on the rate limit you choose.

Other ideas we've discussed included limiting the number of concurrently-constructing connections, but pgx doesn't seem to offer enough hooks to make this possible cleanly, since there's no hook for when creation of a new connection fails / times out. We've also talked about implementing a backoff mechanism to insert increasing delays between connection creation attempts in the face of repeated failures, but again due to the lack of a callback for failed connection creation, there's no obvious way to do this.

Answered by benweint

Dec 17, 2022

It turns out that my test case was poorly constructed (as mentioned above, I had left a call to pool.Reset() in it).

We've essentially resolved this issue in production by implementing a circuit-breaker pattern for DB queries, where >N consecutive failures results in fast-failing all DB queries for a brief period of time before trying again.

View full answer

benweint · 2022-12-06T16:23:52Z

benweint
Dec 6, 2022
Author

Gah, I just realized I had left a call to pool.Reset() in this repro case, so I may need to play with it more to get something that actually reproduces the problem we are seeing in production.

0 replies

benweint · 2022-12-17T04:41:56Z

benweint
Dec 17, 2022
Author

It turns out that my test case was poorly constructed (as mentioned above, I had left a call to pool.Reset() in it).

We've essentially resolved this issue in production by implementing a circuit-breaker pattern for DB queries, where >N consecutive failures results in fast-failing all DB queries for a brief period of time before trying again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering from a connection interruption when using pgxpool #1415

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

Select a reply

Recovering from a connection interruption when using pgxpool #1415

benweint Dec 6, 2022

Problem summary

Repro case

The question

Replies: 3 comments

benweint Dec 6, 2022 Author

benweint Dec 17, 2022 Author

benweint
Dec 6, 2022

benweint
Dec 6, 2022
Author

benweint
Dec 17, 2022
Author