-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rabbit_peer_discovery: Rewrite the core logic #9797
Conversation
d37aab6
to
ed97c34
Compare
2246618
to
3eb3b34
Compare
After several iterations on the patch and more discussions with the team, I will restart from scratch. The idea is still to use the common The current consensus is to have the following properties:
This pull request will cover properties from point 1 to |
Sounds great, thanks for all the work you are putting into this! Perhaps this is implicitly covered by 4B, but one things I'd add is that if the cluster size hint is |
This is not covered by 4B. I clarified my comment. The idea is that if the cluster size hint is 2 or more, peer discovery should expect that the backend returns at least two nodes. This is to avoid that an early query of the backend only returns
Do you suggest that peer discovery should wait for a list of discovered nodes of size N? Or should peer discovery finish, but later in the boot process, we pause until all N nodes joined the cluster? What about deployments where nodes are started sequentially? Or deployments where the peer discovery backend and the configured cluster hint are out-of-sync? Should the node fail to boot after some time, or should it boot with a warning? |
If we wait for N nodes to appear, and N = the total number of nodes, a single node that fails to boot will cause issues. Also, Kubernetes generally assumes that nodes do not have any inter-dependencies. Hence the idea to re-evaluate cluster membership and retry after node boot, with a delay. |
I guess I haven't thought that through before. Indeed, with sequentially booted nodes, the nodes would need to be fully booted, for the next to even start booting (eg. that's the default behaviour for StatefulSets on Kubermetes, our Operator sets the startup policy to |
Small update to this: instead of using the configured target cluster size hint alone, I take the max between this value and the number of nodes returned by the backend. This is handy with the classic config backend for instance: the static list of nodes is the cluster size hint. |
d35c752
to
e39d53c
Compare
e39d53c
to
89bbfc3
Compare
686084b
to
18be108
Compare
I finished another round of fixes. In particular:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've run many tests on Kubernetes (kind and GKE) and we found no issues. It's certainly much better than current main
, which has multiple issues.
Once merged, we can ask the community for additional testing in other environments.
…_subset_of_nodes_coming_online` [Why] The testcase was broken as part of the work on Khepri (#7206): all nodes were started, making it an equivalent of the `successful_discovery` testcase. [How] We drop the first entry in the list of nodes given to `rabbit_ct_broker_helpers`. This way, it won't be started at all while still being listed in the classic config parameter.
[Why] This work started as an effort to add peer discovery support to our Khepri integration. Indeed, as part of the task to integrate Khepri, we missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was called from the Mnesia-specific code only. Even though we knew about it because we hit many issues caused by the fact the `join_cluster` and peer discovery use different code path to create a cluster. To add support for Khepri, the first version of this patch was to move the call to `rabbit_peer_discovery:maybe_create_cluster/1` from `rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made sense to unify the code and simply call `rabbit_db_cluster:join/2` instead of duplicating the work. Unfortunately, doing so highlighted another issue: the way the node to cluster with was selected. Indeed, it could cause situations where multiple clusters are created instead of one, without resorting to out-of-band counter-measures, like a 30-second delay added in the Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was even more frequent when we tried to unify the code path and call `join_cluster`. After several iterations on the patch and even more discussions with the team, we decided to rewrite the algorithm to make node selection more robust and still use `rabbit_db_cluster:join/2` to create the cluster. [How] This commit is only about the rewrite of the algorithm. Calling peer discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus making peer discovery work with Khepri) will be done in a follow-up commit. We wanted the new algorithm to fulfill the following properties: 1. `rabbit_peer_discovery` should provide the ability to re-trigger it easily to re-evaluate the cluster. The new public API is `rabbit_peer_discovery:sync_desired_cluster/0`. 2. The selection of the node to join should be designed in a way that all nodes select the same, regardless of the order in which they become available. The adopted solution is to sort the list of discovered nodes with the following criterias (in that order): 1. the size of the cluster a discovered node is part of; sorted from bigger to smaller clusters 2. the start time of a discovered node; sorted from older to younger nodes 3. the name of a discovered node; sorted alphabetically The first node in that list will not join anyone and simply proceed with its boot process. Other nodes will try to join the first node. 3. To reduce the chance of incorrectly having multiple standalone nodes because the discovery backend returned only a single node, we want to apply the following constraints to the list of nodes after it is filtered and sorted (see property 2 above): * The list must contain `node()` (i.e. the node running peer discovery itself). * If the RabbitMQ's cluster size hint is greater than 1, the list must have at least two nodes. The cluster size hint is the maximum between the configured target cluster size hint and the number of elements in the nodes list returned by the backend. If one of the constraint is not met, the entire peer discovery process is restarted after a delay. 4. The lock is acquired only to protect the actual join, not the discovery step where the backend is queried to get the list of peers. With the node selection described above, this will let the first node to start without acquiring the lock. 5. The cluster membership views queried as part of the algorithm to sort the list of nodes will be used to detect additional clusters or standalone nodes that did not cluster correctly. These nodes will be asked to re-evaluate peer discovery to increase the chance of forming a single cluster. 6. After some delay, peer discovery will be re-evaluated to further eliminate the chances of having multiple clusters instead of one. This commit covers properties from point 1 to point 4. Remaining properties will be the scope of additional pull requests after this one works. If there is a failure at any point during discovery, filtering/sorting, locking or joining, the entire process is restarted after a delay. This is configured using the following parameters: * cluster_formation.discovery_retry_limit * cluster_formation.discovery_retry_interval The default parameters were bumped to 30 retries with a delay of 1 second between each. The locking retries/interval parameters are not used by the new algorithm anymore. There are extra minor changes that come with the rewrite: * The configured backend is cached in a persistent term. The goal is to make sure we use the same backend throughout the entire process and when we call `maybe_unregister/0` even if the configuration changed for whatever reason in between. * `maybe_register/0` is called from `rabbit_db_cluster` instead of at the end of a successful peer discovery process. `rabbit_db_cluster` had to call `maybe_register/0` if the node was not virgin anyway. So make it simpler and always call it in `rabbit_db_cluster` regardless of the state of the node. * `log_configured_backend/0` is gone. `maybe_init/0` can log the backend directly. There is no need to explicitly call another function for that. * Messages are logged using `?LOG_*()` macros instead of the old `rabbit_log` module.
[Why] We go through a temporary hidden node to query all other discovered peers properties, instead of querying them directly. The reason is that we don't want that Erlang automatically connect all nodes together as a side effect (to form the full mesh network by default). If we let Erlang do that, we may interfere with the Feature flags controller which is globally registered when it performs an operation. If all nodes become connected, it's possible two or more globally registered controllers end up connected before they are ready to be clustered, and thus in the same "namespace". `global' will kill all but one of them. [How] By using a temporary intermediate hidden node, we ask Erlang not to connect everyone automatically. V2: Set `-setcookie <cookie>` in the temporary hidden node's VM arguments if one was set in the RabbitMQ context. This is required if the Erlang cookie is not written to disk; it might be the case with some container deployments.
…nodes only [Why] A lock is acquired to protect against concurrent cluster joins. Some backends used to use the entire list of discovered nodes and used `global` as the lock implementation. This was a problem because a side effect was that all discovered Erlang nodes were connected to each other. This led to conflicts in the global process name registry and thus processes were killed randomly. This was the case with the feature flags controller for instance. Nodes are running some feature flags operation early in boot before they are ready to cluster or run the peer discovery code. But if another node was executing peer discovery, it could make all nodes connected. Feature flags controller unrelated instances were thus killed because of another node running peer discovery. [How] Acquiring a lock on the joining and the joined nodes only is enough to achieve the goal of protecting against concurrent joins. This is possible because of the new core logic which ensures the same node is used as the "seed node". I.e. all nodes will join the same node. Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed to take a list of nodes (the two nodes mentionned above) instead of one node (which was the current node, so not that helpful in the first place). These backends also used to check if the current node was part of the discovered nodes. But that's already handled in the generic peer discovery code already. CAUTION: This brings a breaking change in the peer discovery backend API. The `Backend:lock/1` callback now takes a list of node names instead of a single node name. This list will contain the current node name.
[Why] The group leader for all processes on the temporary hidden node is the calling process' group leader on the upstream node. When we use `erpc:call/4` (or the multicall equivalent) to execute code on one of the given nodes, the remotely executed code will also use the calling process' group leader by default. We use this temporary hidden node to ensure the downstream node will not connected to the upstream node. Therefore, we must change the group leader as well, otherwise any I/O from the downstream node will send a message to the upstream node's group leader and thus open a connection. This would defeat the entire purpose of this temporary hidden node. [How] To avoid this, we start a proxy process which we will use as a group leader. This process will send all messages it receives to the group leader on the upstream node. There is one caveat: the logger (local to the temporary hidden node) forwards log messages to the upstream logger (on the upstream node) only if the group leader of that message is a remote PID. Because we set a local PID, it stops forwarding log messages originating from that temporary hidden node. That's why we use `with_group_leader_proxy/2` to set the group leader to our proxy only around the use of `erpc`. That's a lot just to keep logging working while not reveal the upstream node to the downstream node...
…art-cluster` [Why] So far, we use the CLI to create the cluster after starting the individual nodes. It's faster to use peer discovery and gives more exposure to the feature. Thus it will be easier to quickly test changes to the peer discovery subsystem with a simple `make start-cluster`. [How] We pass the classic configuration `cluster_nodes` application environment variable to all nodes' command line. This is only when the target is `start-cluster`, not `start-brokers`.
[Why] If a node joins the selected node but the selected node's DB layer is not ready, it will fail and the whole peer discovery process will restart (until the selected node is ready). That's fine, but scary messages are logged for a situation that is not really an actual error at this point. [How] While querying properties of all discovered nodes, we also check is the DB layer is ready using `rabbit_db:is_init_finished/0`. We then use this property to determine if we can try to join or if we should wait and retry. This avoids a join which we know will fail eventually, and thus error messages.
... instead of `rabbit_mnesia`. [Why] We need it for both Mnesia and Khepri. So instead of calling it in `rabbit_khepri` too, let's manage this from `rabbit_db` directly.
fef910b
to
e1261b9
Compare
%% Peer discovery may have been a no-op if it decided that all other nodes | ||
%% should join this one. Therefore, we need to look at if this node is | ||
%% still virgin and finish our use of Mnesia accordingly. In particular, | ||
%% this second part crates all our Mnesia tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%% this second part crates all our Mnesia tables. | |
%% this second part creates all our Mnesia tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #10073. Thanks!
|
||
maybe_init() -> | ||
Backend = backend(), | ||
?LOG_DEBUG( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?LOG_DEBUG( | |
?LOG_INFO( |
Is there another place where the configured / in-use backend is logged at an INFO level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #10073. Thanks!
Ret = erpc:call(Peer, ?MODULE, do_query_node_props, [Nodes]), | ||
peer:stop(Pid), | ||
Ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ret = erpc:call(Peer, ?MODULE, do_query_node_props, [Nodes]), | |
peer:stop(Pid), | |
Ret; | |
Ret = try | |
erpc:call(Peer, ?MODULE, do_query_node_props, [Nodes]) | |
after | |
peer:stop(Pid) | |
end, | |
Ret; |
Would this help ensure no dangling peers in the case of an erpc:call
exception? Apologies if my suggested code doesn't compile OOTB 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #10073. Thanks!
UpstreamGroupLeader = erlang:group_leader(), | ||
true = erlang:group_leader(ProxyGroupLeader, self()), | ||
Ret = Fun(), | ||
true = erlang:group_leader(UpstreamGroupLeader, self()), | ||
Ret. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpstreamGroupLeader = erlang:group_leader(), | |
true = erlang:group_leader(ProxyGroupLeader, self()), | |
Ret = Fun(), | |
true = erlang:group_leader(UpstreamGroupLeader, self()), | |
Ret. | |
UpstreamGroupLeader = erlang:group_leader(), | |
Ret = try | |
true = erlang:group_leader(ProxyGroupLeader, self()), | |
Fun() | |
after | |
true = erlang:group_leader(UpstreamGroupLeader, self()) | |
end, | |
Ret. |
Overkill? Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #10073. Thanks!
NodesAndProps2 = sort_nodes_and_props(NodesAndProps1), | ||
%% Wait for the proxy group leader to flush its inbox. | ||
ProxyGroupLeader ! stop_proxy, | ||
receive proxy_stopped -> ok end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a timeout here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #10073. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one more suggestion.
|
||
ThisNodeIsIncluded andalso HasEnoughNodes; | ||
can_use_discovered_nodes(_DiscoveredNodes, []) -> | ||
?LOG_DEBUG( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be LOG_INFO or WARNING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #10073. Thanks!
Follow up to #9797. Submitted by: @lukebakken
[Why] They both can be useful to diagnose before debug messages are enabled. Also, the second message tells to enable debug messages to get more details, but it was logged at the debug level :-) Follow up to #9797. Submitted by: @lukebakken
[Why] We start a temporary hidden node, then ask it to execute some code, then stop it. But if there is an execution in between, we leave the temporary hidden node running. Likewise when we mess with the group leader: if the function executed after overriding the group leader temporarily raises an exception, the group leader becomes permanent and we may miss log messages. [How] We simply use a try/after block to ensure the temporary things are reverted at the end, regardless of the success or failure. Follow up to #9797. Submitted by: @lukebakken
… leader ... to exit. [How] It should never be stuck obviously™. But in case planets are not aligned, we wait for its exit with a timeout. If it stays around, this is not the end of the world. Follow up to #9797. Submitted by: @lukebakken
Follow up to #9797. Submitted by: @lukebakken
[Why] They both can be useful to diagnose before debug messages are enabled. Also, the second message tells to enable debug messages to get more details, but it was logged at the debug level :-) Follow up to #9797. Submitted by: @lukebakken
[Why] We start a temporary hidden node, then ask it to execute some code, then stop it. But if there is an execution in between, we leave the temporary hidden node running. Likewise when we mess with the group leader: if the function executed after overriding the group leader temporarily raises an exception, the group leader becomes permanent and we may miss log messages. [How] We simply use a try/after block to ensure the temporary things are reverted at the end, regardless of the success or failure. Follow up to #9797. Submitted by: @lukebakken
… leader ... to exit. [How] It should never be stuck obviously™. But in case planets are not aligned, we wait for its exit with a timeout. If it stays around, this is not the end of the world. Follow up to #9797. Submitted by: @lukebakken
[Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. This session was created as part of the lock callback. However, after pull request #9797, the lock was only acquired if and when a node had ot join another one. Thus, after the actual discovery phase. This broke Consul peer discovery because the discovery was performed before that Consul session was created. [How] We introduce two new callbacks, `pre_discovery/0` and `post_discovery/1` to allow a backend to perform actions before and after the whole discover/lock/join process. To remain compatible with other peer discovery backend, the new callbacks are optional.
[Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. This session was created as part of the lock callback. However, after pull request #9797, the lock was only acquired if and when a node had to join another one. Thus, after the actual discovery phase. This broke Consul peer discovery because the discovery was performed before that Consul session was created. [How] We introduce two new callbacks, `pre_discovery/0` and `post_discovery/1` to allow a backend to perform actions before and after the whole discover/lock/join process. To remain compatible with other peer discovery backend, the new callbacks are optional.
[Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. This session was created as part of the lock callback. However, after pull request #9797, the lock was only acquired if and when a node had to join another one. Thus, after the actual discovery phase. This broke Consul peer discovery because the discovery was performed before that Consul session was created. [How] We introduce two new callbacks, `pre_discovery/0` and `post_discovery/1` to allow a backend to perform actions before and after the whole discover/lock/join process. To remain compatible with other peer discovery backend, the new callbacks are optional.
Why
This work started as an effort to add peer discovery support to our Khepri integration. Indeed, as part of the task to integrate Khepri, we missed the fact that
rabbit_peer_discovery:maybe_create_cluster/1
was called from the Mnesia-specific code only. Even though we knew about it because we hit many issues caused by the fact thejoin_cluster
and peer discovery use different code path to create a cluster.To add support for Khepri, the first version of this patch was to move the call to
rabbit_peer_discovery:maybe_create_cluster/1
fromrabbit_db_cluster
instead ofrabbit_mnesia
. To achieve that, it made sense to unify the code and simply callrabbit_db_cluster:join/2
instead of duplicating the work.Unfortunately, doing so highlighted another issue: the way the node to cluster with was selected. Indeed, it could cause situations where multiple clusters are created instead of one, without resorting to out-of-band counter-measures, like a 30-second delay added in the Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was even more frequent when we tried to unify the code path and call
join_cluster
.After several iterations on the patch and even more discussions with the team, we decided to rewrite the algorithm to make node selection more robust and still use
rabbit_db_cluster:join/2
to create the cluster.How
We wanted the new algorithm to fulfill the following properties:
rabbit_peer_discovery
should provide the ability to re-trigger it easily to re-evaluate the cluster. The new public API israbbit_peer_discovery:sync_desired_cluster/0
.The selection of the node to join should be designed in a way that all nodes select the same, regardless of the order in which they become available. The adopted solution is to sort the list of discovered nodes with the following criterias (in that order):
The first node in that list will not join anyone and simply proceed with its boot process. Other nodes will try to join the first node.
To reduce the chance of incorrectly having multiple standalone nodes because the discovery backend returned only a single node, we want to apply the following constraints to the list of nodes after it is filtered and sorted (see property 2 above):
node()
(i.e. the node running peer discovery itself).If one of the constraint is not met, the entire peer discovery process is restarted after a delay.
The lock is acquired only to protect the actual join, not the discovery step where the backend is queried to get the list of peers. With the node selection described above, this will let the first node to start without acquiring the lock.
The cluster membership views queried as part of the algorithm to sort the list of nodes will be used to detect additional clusters or standalone nodes that did not cluster correctly. These nodes will be asked to re-evaluate peer discovery to increase the chance of forming a single cluster.
After some delay, peer discovery will be re-evaluated to further eliminate the chances of having multiple clusters instead of one.
This commit covers properties from point 1 to point 4. Remaining properties will be the scope of additional pull requests after this one works.
If there is a failure at any point during discovery, filtering/sorting, locking or joining, the entire process is restarted after a delay. This is configured using the following parameters:
cluster_formation.discovery_retry_limit
cluster_formation.discovery_retry_interval
The default parameters were bumped to 30 retries with a delay of 1 second between each.
The locking retries/interval parameters are not used by the new algorithm anymore.
There are extra minor changes that come with the rewrite:
maybe_unregister/0
even if the configuration changed for whatever reason in between.maybe_register/0
is called fromrabbit_db_cluster
instead of at the end of a successful peer discovery process.rabbit_db_cluster
had to callmaybe_register/0
if the node was not virgin anyway. So make it simpler and always call it inrabbit_db_cluster
regardless of the state of the node.log_configured_backend/0
is gone.maybe_init/0
can log the backend directly. There is no need to explicitly call another function for that.?LOG_*()
macros instead of the oldrabbit_log
module.