-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RabbitMQ 3.13.0 nodes with Consul peer discovery enabled fails to form a cluster #10760
Comments
…callbacks [Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. It must open the session before the `list_nodes/0` callback can return meaningful results. [How] The new `pre_discovery/0` and `post_discovery/1` callbacks are used to create and delete that session before the whole discover/lock/join process. Fixes #10760.
…callbacks [Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. It must open the session before the `list_nodes/0` callback can return meaningful results. [How] The new `pre_discovery/0` and `post_discovery/1` callbacks are used to create and delete that session before the whole discover/lock/join process. Fixes #10760.
…callbacks [Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. It must open the session before the `list_nodes/0` callback can return meaningful results. [How] The new `pre_discovery/0` and `post_discovery/1` callbacks are used to create and delete that session before the whole discover/lock/join process. Fixes #10760.
After more investigation, it looks like the problem is not the lock-related changes, but the fact that we register a node after the discovery step. This means that a node can't discover itself (among other members of a cluster). This was fine in RabbitMQ 3.12.x and before because we had far fewer checks in place, one of them being a requirement that the current node is among the discovered nodes. In 3.12.x and before, this was fine because that check didn't exist and after some timeout, one node would give up on peer discovery and boot. As part of the boot, it would register itself and other nodes will discover it. With the new checks in place in 3.13.x, we reject the discovered nodes list until the same timeout. At that point, all nodes will boot as standalone nodes because they didn't discover anyone. One possible solution is to register first, then run peer discovery. |
This issue should be fixed by #11045. |
The conclusion of the discussion in #10661 is that the Consul peer discovery backend broke in RabbitMQ 3.13.0, following the rewrite of peer discovery in #9797. In this rewrite, the behavior changed significantly. In particular, the lock is only acquired after the discovery phase and only if the node executing peer discovery must join another node.
This breaks the Consul peer discovery backend because the lock also opened a Consul session. Discoverable nodes are those that have a session open.
This was ok in RabbitMQ 3.12.x and before because the steps were:
In RabbitMQ 3.13.0, the behavior is:
While looking at this, I see that the session is never explicitly closed. Another thing to fix, once I know how to improve peer discovery to allow the Consul backend to open a session early, separate from the locking.
The text was updated successfully, but these errors were encountered: