Connection recovery hangs infinitely #52

marcuslinke · 2015-07-21T07:43:09Z

While my experiments with two rabbit nodes behind a load balancer i ran into a situation where both nodes where down for a short period. In this case the connection recovery hangs infinitely although both nodes came up again after a few seconds:

Name: main
State: WAITING on net.jodah.lyra.internal.util.concurrent.ReentrantCircuit$Sync@2224ea85
Total blocked: 4  Total waited: 53

Stack trace: 
 sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
net.jodah.lyra.internal.util.concurrent.ReentrantCircuit.await(ReentrantCircuit.java:63)
net.jodah.lyra.internal.RetryableResource.callWithRetries(RetryableResource.java:76)
net.jodah.lyra.internal.ChannelHandler.invoke(ChannelHandler.java:167)
com.sun.proxy.$Proxy1.basicPublish(Unknown Source)
de.na.messaging.rabbitmq.TestLoadBalancer.impl(TestLoadBalancer.java:124)

Here is my configuration:

RecoveryPolicy recoveryPolicy = new RecoveryPolicy().withBackoff(Duration.seconds(2), Duration.seconds(20))
                .withMaxDuration(Duration.seconds(20));

RetryPolicy retryPolicy = new RetryPolicy().withBackoff(Duration.seconds(3), Duration.seconds(60));

Shouldn't the recovery attempt at least timeout after 20 seconds so a retry will occur or have I misunderstood something?

The text was updated successfully, but these errors were encountered:

michaelklishin · 2015-07-21T08:05:52Z

Can you post a JVM thread dump when that happens?

marcuslinke · 2015-07-21T08:19:35Z

Of course. Here it is:

2015-07-21 10:18:39
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.65-b04-466.1 mixed mode):

"AWT-Shutdown" prio=5 tid=7fc6a9000000 nid=0x1111b1000 in Object.wait() [1111b0000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <7f42b0100> (a java.lang.Object)
    at sun.awt.AWTAutoShutdown.run(AWTAutoShutdown.java:284)
    - locked <7f42b0100> (a java.lang.Object)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - None

"RMI TCP Connection(2)-192.168.59.3" daemon prio=9 tid=7fc6a60bc000 nid=0x1142b8000 runnable [1142b7000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    - locked <7f329f5c8> (a java.io.BufferedInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:66)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:517)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - <7f30e80b0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"JMX server connection timeout 25" daemon prio=9 tid=7fc6a5a76000 nid=0x1141b5000 in Object.wait() [1141b4000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <7f315db70> (a [I)
    at com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
    - locked <7f315db70> (a [I)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - None

"RMI Scheduler(0)" daemon prio=9 tid=7fc6a9013000 nid=0x1140b2000 waiting on condition [1140b1000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <7f43b31b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
    at java.util.concurrent.DelayQueue.take(DelayQueue.java:164)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:609)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:602)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - None

"RMI TCP Connection(1)-192.168.59.3" daemon prio=9 tid=7fc6ab802000 nid=0x113faf000 runnable [113fae000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    - locked <7f310d958> (a java.io.BufferedInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:66)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:517)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - <7f30e75a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"RMI TCP Accept-0" daemon prio=9 tid=7fc6a59dc000 nid=0x113eac000 runnable [113eab000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:439)
    - locked <7f42b4088> (a java.net.SocksSocketImpl)
    at java.net.ServerSocket.implAccept(ServerSocket.java:468)
    at java.net.ServerSocket.accept(ServerSocket.java:436)
    at sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - None

"AWT-AppKit" daemon prio=5 tid=7fc6a60b7800 nid=0x7fff7a1bd310 runnable [00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Attach Listener" daemon prio=9 tid=7fc6a60bd800 nid=0x10ecf5000 waiting on condition [00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Low Memory Detector" daemon prio=5 tid=7fc6a600e000 nid=0x10f209000 runnable [00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"C2 CompilerThread1" daemon prio=9 tid=7fc6a593b000 nid=0x10f106000 waiting on condition [00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"C2 CompilerThread0" daemon prio=9 tid=7fc6a593a800 nid=0x10f003000 waiting on condition [00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Signal Dispatcher" daemon prio=9 tid=7fc6a5939800 nid=0x10ef00000 runnable [00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Surrogate Locker Thread (Concurrent GC)" daemon prio=5 tid=7fc6a600d000 nid=0x10edfd000 waiting on condition [00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Finalizer" daemon prio=8 tid=7fc6a5939000 nid=0x10ebf2000 in Object.wait() [10ebf1000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <7f42b2440> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <7f42b2440> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:171)

   Locked ownable synchronizers:
    - None

"Reference Handler" daemon prio=10 tid=7fc6a5938000 nid=0x10eaef000 in Object.wait() [10eaee000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <7f42b6000> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <7f42b6000> (a java.lang.ref.Reference$Lock)

   Locked ownable synchronizers:
    - None

"main" prio=5 tid=7fc6a3801000 nid=0x105ced000 waiting on condition [105cec000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <7f42ba010> (a net.jodah.lyra.internal.util.concurrent.ReentrantCircuit$Sync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
    at net.jodah.lyra.internal.util.concurrent.ReentrantCircuit.await(ReentrantCircuit.java:63)
    at net.jodah.lyra.internal.RetryableResource.callWithRetries(RetryableResource.java:76)
    at net.jodah.lyra.internal.ChannelHandler.invoke(ChannelHandler.java:167)
    at com.sun.proxy.$Proxy1.basicPublish(Unknown Source)
    at de.na.messaging.rabbitmq.TestLoadBalancer.impl(TestLoadBalancer.java:124)
    at de.na.messaging.rabbitmq.TestLoadBalancer.setup(TestLoadBalancer.java:110)
    at de.na.messaging.rabbitmq.TestLoadBalancer.main(TestLoadBalancer.java:51)

   Locked ownable synchronizers:
    - None

"VM Thread" prio=9 tid=7fc6a5933800 nid=0x10e9ec000 runnable 

"Gang worker#0 (Parallel GC Threads)" prio=9 tid=7fc6a6000800 nid=0x10a0c5000 runnable 

"Gang worker#1 (Parallel GC Threads)" prio=9 tid=7fc6a6001000 nid=0x10a1c8000 runnable 

"Gang worker#2 (Parallel GC Threads)" prio=9 tid=7fc6a3802800 nid=0x10a2cb000 runnable 

"Gang worker#3 (Parallel GC Threads)" prio=9 tid=7fc6a3803000 nid=0x10a3ce000 runnable 

"Gang worker#4 (Parallel GC Threads)" prio=9 tid=7fc6a6002000 nid=0x10a4d1000 runnable 

"Gang worker#5 (Parallel GC Threads)" prio=9 tid=7fc6a6002800 nid=0x10a5d4000 runnable 

"Gang worker#6 (Parallel GC Threads)" prio=9 tid=7fc6a3803800 nid=0x10a6d7000 runnable 

"Gang worker#7 (Parallel GC Threads)" prio=9 tid=7fc6a5800800 nid=0x10a7da000 runnable 

"Concurrent Mark-Sweep GC Thread" prio=9 tid=7fc6a58dd800 nid=0x10e664000 runnable 
"Gang worker#0 (Parallel CMS Threads)" prio=9 tid=7fc6a58dc800 nid=0x10dc5e000 runnable 

"Gang worker#1 (Parallel CMS Threads)" prio=9 tid=7fc6a3804000 nid=0x10dd61000 runnable 

"VM Periodic Task Thread" prio=10 tid=7fc6a9011000 nid=0x10f30c000 waiting on condition 

"Exception Catcher Thread" prio=10 tid=7fc6a3801800 nid=0x106f01000 runnable 
JNI global references: 1310

jhalterman · 2015-07-22T01:39:10Z

What version of Lyra and amqp-client are you using? Also, what are the exact steps to reproduce? Can it be done with a single, non-clustered instance?

marcuslinke · 2015-07-22T06:24:50Z

I'm using Lyra v0.5.2, apmqp-client v3.5.3 and RabbitMQ v3.5.3. The behavior is NOT reproducible with a single instance. If it helps I could provide a network packet dump.

Minalan · 2015-08-04T16:10:15Z

I have basically the exact same problem (my thread dumps look nearly the same in the lyra code -- I can post them if needed).

In my case, there is no load balancer, but the connection bounced up and down a few times.

RecoveryPolicy policy = new RecoveryPolicy();
RetryPolicy retryPolicy = RetryPolicies.retryAlways();

if (_maxAttempts >0) {
  policy.withMaxAttempts(_maxAttempts)
  retryPolicy.withMaxAttempts(_maxAttempts);
}

if (_interval > 0) {
  policy.withInterval(Duration.seconds(_interval));
  retryPolicy.withInterval(Duration.seconds(_interval));
}

if (_maxDuration > 0) {
  policy.withMaxDuration(Duration.seconds(_maxDuration));
  retryPolicy.withMaxDuration(Duration.seconds(_maxDuration));
}

if (_backoffTime > 0) {
  policy.withBackoff(Duration.seconds(_backoffTime),Duration.seconds(_backoffMaxInterval));
  retryPolicy.withBackoff(Duration.seconds(_backoffTime),Duration.seconds(_backoffMaxInterval));
}

return new Config().withRecoveryPolicy(policy).withRetryPolicy(retryPolicy);

Currently the variables are set like this:
_maxAttempts = 0
_interval = 0
_maxDuration = 0
_backoffTime = 1
_backoffMaxInterval = 60

From my understanding, that should create Recovery and Retry policies that continuously retry (forever), waiting longer and longer between retries until there is 1 minute between each.

As a note, I have two different channels from one connection. Not sure if that is relevant

I created connection and channel listeners with logging and it looks like the first time the connection went down, recovery happened normally. My log has about 10 retries and then the connection, channels, and consumers all come back up, and then the connection drops back out and lyra locks up.

Edit. forgot to post versions... Lyra 0.4.3, amqp-client 3.4.1, rabbitmq (server) 3.4.2

jhalterman · 2015-08-04T16:57:53Z

@Minalan are you also using a cluster? Do you know if one or all nodes go down around the same time?

Minalan · 2015-08-04T17:13:05Z

Not using a cluster. Only one server. If it just goes down and comes back
up, recovery works fine. If it bounces up and down quickly the lockup
happens.
Not sure if it is caused by having more than 1 channel go down at a time.

On Tue, Aug 4, 2015 at 11:57 AM, Jonathan Halterman <
[email protected]> wrote:

@Minalan https://github.com/Minalan are you also using a cluster? Do
you know if one or all nodes go down around the same time?

—
Reply to this email directly or view it on GitHub
#52 (comment).

jhalterman · 2015-08-04T17:46:25Z

@Minalan What are you doing with the two channels when the failure happens?

Minalan · 2015-08-04T18:03:48Z

One channel is reading from a queue (pretty continuously -- there are
nearly always messages there).
One is publishing to several different queues including the one that the
other channel is reading from (not attached to a consumer at all -- this is
not nearly as continuous as the reading as only some of the messages
require a response)

On Tue, Aug 4, 2015 at 12:46 PM, Jonathan Halterman <
[email protected]> wrote:

What are you doing with the two channels when the failure happens?

—
Reply to this email directly or view it on GitHub
#52 (comment).

jhalterman · 2015-08-04T19:19:37Z

Do you know what sort of failures are happening on your server or how I might reproduce it? Trying to reproduce this and killing networking or restarting the machine repeatedly doesn't do it.

Minalan · 2015-08-04T20:27:25Z

I don't know what the failure is specifically, but I do know that the java
app is connecting to a RMQ server in another city over a VPN. It could be
the VPN going up and down.

On Tue, Aug 4, 2015 at 2:19 PM, Jonathan Halterman <[email protected]

wrote:

Do you know what sort of failures are happening on your server or how I
might reproduce it? Trying to reproduce this and killing networking or
restarting the machine repeatedly doesn't do it.

—
Reply to this email directly or view it on GitHub
#52 (comment).

atfire · 2015-09-15T13:44:32Z

Here are my steps to reproduce:

I run both server and client locally on my mac.

    new Config()
        .withRecoveryPolicy(RecoveryPolicies.recoverAlways())
        .withRetryPolicy(RetryPolicies.retryAlways())

Stop rabbitmq server by runing: rabbitmqctl stop.

In debug mode I see that lyra is trying to reconnect once, then gets IOException (with ShutdownSignalException cause) and never reconnects again. It happens in

    RetryableResource:58

    if (sse != null && (recovery || !recoverable))
        throw e;

Exception is just rethrown up and lost. No other reconnection attempts are made - lyra doesn't get this SSE and doesn't trigger reconnection. So last line I see in logs is

[ConnectionHandler.java:240] Recovering connection cxn-7 to [localhost:5672]

and it hangs indefinitely even after I start rabbitmq-server back up.

If I kill connection from admin console (server is not shut down) everything seems to be fine, reconnection just works.

Lyra is 0.5.2, rabbit - 3.5.3

So steps to reproduce are easy. Am I doing anything wrong?

jhalterman · 2015-11-20T18:02:13Z

@atfire @Minalan @marcuslinke What OS are you running on? As #53 brought up, this could be the result of a different exception being thrown on your platform which Lyra isn't recovering from.

atfire · 2015-11-30T08:56:18Z

I was testing on Mac OS 10.9.5

marcuslinke · 2015-11-30T10:01:35Z

Me too

avratnam · 2016-08-31T04:15:21Z

My setup is similar to marcuslinke, I am Using lyra 0.5.2 and rabbit 3.5.3 (server & client) and able to reproduce this issue on Mac OS 10.10.5 as well on RHEL Enterprise 6.5. I see an IOException thrown when this happens. Any solution for this issue?

jottinger · 2017-05-05T17:00:01Z

I'm having this problem too. Is there a cookbook entry that shows the proper way to recover when a rabbitmq node goes down? The process I'm following is:

connect
set up a consumer
send messages (which get consumed properly)
shut down rabbitmq
start rabbitmq
(observe exception)
send messages (which do not get consumed, because the connection is hung)

This is using the following configuration:

lyraConfig = new Config()
        .withRecoveryPolicy(RecoveryPolicies.recoverAlways())
        .withRetryPolicy(RetryPolicies.retryAlways())
        .withConsumerRecovery(true);```

michaelklishin · 2017-10-04T11:23:30Z

I suspect this is a variation of #53 (comment). Without knowing what the exception is and server logs it is difficult to reconstruct what's going on.

karlney mentioned this issue Oct 26, 2015

Connection recovery hangs after EOFException is thrown #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection recovery hangs infinitely #52

Connection recovery hangs infinitely #52

marcuslinke commented Jul 21, 2015

michaelklishin commented Jul 21, 2015

marcuslinke commented Jul 21, 2015

jhalterman commented Jul 22, 2015

marcuslinke commented Jul 22, 2015

Minalan commented Aug 4, 2015

jhalterman commented Aug 4, 2015

Minalan commented Aug 4, 2015

jhalterman commented Aug 4, 2015

Minalan commented Aug 4, 2015

jhalterman commented Aug 4, 2015

Minalan commented Aug 4, 2015

atfire commented Sep 15, 2015

jhalterman commented Nov 20, 2015

atfire commented Nov 30, 2015

marcuslinke commented Nov 30, 2015

avratnam commented Aug 31, 2016

jottinger commented May 5, 2017

michaelklishin commented Oct 4, 2017

Connection recovery hangs infinitely #52

Connection recovery hangs infinitely #52

Comments

marcuslinke commented Jul 21, 2015

michaelklishin commented Jul 21, 2015

marcuslinke commented Jul 21, 2015

jhalterman commented Jul 22, 2015

marcuslinke commented Jul 22, 2015

Minalan commented Aug 4, 2015

jhalterman commented Aug 4, 2015

Minalan commented Aug 4, 2015

jhalterman commented Aug 4, 2015

Minalan commented Aug 4, 2015

jhalterman commented Aug 4, 2015

Minalan commented Aug 4, 2015

atfire commented Sep 15, 2015

jhalterman commented Nov 20, 2015

atfire commented Nov 30, 2015

marcuslinke commented Nov 30, 2015

avratnam commented Aug 31, 2016

jottinger commented May 5, 2017

michaelklishin commented Oct 4, 2017