returnPermits exception stalls connection pool #255

erikbeerepoot · 2024-12-09T17:54:36Z

We are seeing an issue with drainLoop in SimpleDequePool that is similar to this issue. We see the following pattern:

Elevated errors over a period of hours (in our case, we see a high count of StackOverflowErrors). We are still attempting to understand the cause of these errors.
We observe a Too many permits returned error in our logs. In our case, the connection pool is configured with 500 connections, so the error is Too many permits returned: returned=1, would bring to 501/500..
Idle an active connections go to 0 (observed via metrics) and pending connections continue to build (we've seen many thousands).
No more outgoing request are observed.

We have manually attempted to reproduce this issue by implementing a custom AllocationStrategy that delegates to an actual allocation strategy but throws an exception in returnPermits (either randomly, or after a given number of calls). After this exception is thrown, we observe the same behaviour as above. We don't currently understand how reach a state where the PERMITS count is off (we are unable to enable DEBUG logging in production due to PII concerns, but are looking to deploy some changes that will add logging on the number of permits as well as the state of the connection pool. Any pointers you have would be appreciated).

One hypothesis is that the exception is not handled, and WIP is not decremented. When doAcquire() is called, elements are added to the pending queue as before, but drainLoop cannot be entered as WIP will not be 0.

Another option is that it's hitting this exception in destroyPoolable.

Expected Behavior

Ideally, the connection pool would continue to function. I don't know if that is realistic given that PERMITS would continue to be wrong.

Actual Behavior

No more connections are made.

Steps to Reproduce

We realize this is a contrived example, but matches what we end up seeing in production.

class ThrowingAllocationStrategy(...) {
    private val allocationStrategyDelegate = // your actually, allocation strategy here, eg. Http2AllocationStrategy, SizeBasedAllocationStrategy
    // implement the remainder by just delegating the calls to the underlying allocation strategy

    override fun returnPermits(p0: Int) {
        randomlyThrow()

        allocationStrategyDelegate.returnPermits(p0)
    }

    private fun randomlyThrow() {
        if (Random.nextInt(100) == 0) {
            throw RuntimeException("Boom!")
        }
    }
}

<SNIP>
// init custom connection provider with the custom allocation strategy
return ConnectionProvider
        .allocationStrategy(loggingAllocationStrategy!!)
        .build()

Possible Solution

Given the the PERMITS count is off, it's unclear to me what the proper resolution would be. We noticed that disposing of the connection provider did resolve the issue as expected, presumably because the connection pool would be fresh and not in a bad state. I'm not sure what the implications would be of continuing to use the existing connection pool, as I would expect to see the permits exception being thrown repeatedly.

Your Environment

reactor-pool == 0.2.13
io.projectreactor:reactor-core:3.5.20 -> 3.6.9
org.springframework:spring-webflux -> 6.1.12
io.projectreactor.netty:reactor-netty-core:1.1.22
netty:4.1.112.Final

Also tried the latest main commit for reactor-core and reactor-pool.

Other relevant libraries versions (eg. netty, ...): netty, netty == 4.1.x,
JVM version (java -version): 21, 17.
OS and version (eg uname -a):

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

returnPermits exception stalls connection pool #255

returnPermits exception stalls connection pool #255

erikbeerepoot commented Dec 9, 2024 •

edited

Loading

returnPermits exception stalls connection pool #255

returnPermits exception stalls connection pool #255

Comments

erikbeerepoot commented Dec 9, 2024 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce

Possible Solution

Your Environment

erikbeerepoot commented Dec 9, 2024 •

edited

Loading