Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

ajwootto · 2019-06-05T05:51:11Z

Bug Report

Environment

Node version: 8
Kafka-node version: 4.1.3
Kafka version: 1.10

This is a bit of an edge case but we've run into it pretty consistently with our setup. The logical steps are as follows:

Given two consumers that have successfully connected to a broker and started heartbeats:

1. next heartbeat is currently scheduled
2. connect is called outside heartbeat loop (due to socket closed etc)
3. next heartbeat happens with rebalance error because of current reconnect
4. another reconnect is scheduled due to heartbeat error
5. first connect finishes
6. heartbeat interval is cleared and restarted
7. next heartbeat succeeds on the latest generation id
8. scheduled reconnect occurrs from previous heartbeat failure (outside context of current heartbeat loop, ie. from the old generation id)
GOTO 3.

Basically the problem seems to be kicked off by connect() getting called from some mechanism other than a heartbeat failure (in this case a socket close event, which triggers a reconnect). Since this process does not cancel the heartbeat interval, it is possible that the scheduled heartbeat can occur during the connection (rebalance) process. In this case, the heartbeat receives error code 27 and triggers a rebalance, thus scheduling another connection for 1 second in the future. Assuming the first connect() call finishes in time, it will start a new heartbeat loop but not clear the currently scheduled reconnect. One second later the reconnect occurs, but the latest heartbeat loop is still scheduled and will receive error code 27 on its next request, triggering another reconnect and so on.

To simulate this problem, I added some code to the consumerGroup that calls connect() a few times one second apart. This is enough to throw it into a loop when running with two consumers against my local Kafka.

taplytics@8fd6b92

Just set process.env.FAKE_CONNECT=1 for one consumer and not the other.

The text was updated successfully, but these errors were encountered:

thynson · 2019-06-10T08:09:12Z

Looks like an issue I want to fix in PR. #1281

ajwootto · 2019-06-10T20:14:35Z

I don't think it's the same problem. In my case, there is only one topic involved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

ajwootto commented Jun 5, 2019

thynson commented Jun 10, 2019

ajwootto commented Jun 10, 2019

Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

Comments

ajwootto commented Jun 5, 2019

Bug Report

Environment

thynson commented Jun 10, 2019

ajwootto commented Jun 10, 2019