Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

Open
ajwootto opened this issue Jun 5, 2019 · 2 comments

Comments

@ajwootto
Copy link

ajwootto commented Jun 5, 2019

Bug Report

Environment

  • Node version: 8
  • Kafka-node version: 4.1.3
  • Kafka version: 1.10

This is a bit of an edge case but we've run into it pretty consistently with our setup. The logical steps are as follows:

Given two consumers that have successfully connected to a broker and started heartbeats:

1. next heartbeat is currently scheduled
2. connect is called outside heartbeat loop (due to socket closed etc)
3. next heartbeat happens with rebalance error because of current reconnect
4. another reconnect is scheduled due to heartbeat error
5. first connect finishes
6. heartbeat interval is cleared and restarted
7. next heartbeat succeeds on the latest generation id
8. scheduled reconnect occurrs from previous heartbeat failure (outside context of current heartbeat loop, ie. from the old generation id)
GOTO 3.

Basically the problem seems to be kicked off by connect() getting called from some mechanism other than a heartbeat failure (in this case a socket close event, which triggers a reconnect). Since this process does not cancel the heartbeat interval, it is possible that the scheduled heartbeat can occur during the connection (rebalance) process. In this case, the heartbeat receives error code 27 and triggers a rebalance, thus scheduling another connection for 1 second in the future. Assuming the first connect() call finishes in time, it will start a new heartbeat loop but not clear the currently scheduled reconnect. One second later the reconnect occurs, but the latest heartbeat loop is still scheduled and will receive error code 27 on its next request, triggering another reconnect and so on.

To simulate this problem, I added some code to the consumerGroup that calls connect() a few times one second apart. This is enough to throw it into a loop when running with two consumers against my local Kafka.

taplytics@8fd6b92

Just set process.env.FAKE_CONNECT=1 for one consumer and not the other.

@thynson
Copy link

thynson commented Jun 10, 2019

Looks like an issue I want to fix in PR. #1281

@ajwootto
Copy link
Author

I don't think it's the same problem. In my case, there is only one topic involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants