rabbit_peer_discovery: Rewrite the core logic #9797

dumbbell · 2023-10-26T16:39:41Z

Why

This work started as an effort to add peer discovery support to our Khepri integration. Indeed, as part of the task to integrate Khepri, we missed the fact that rabbit_peer_discovery:maybe_create_cluster/1 was called from the Mnesia-specific code only. Even though we knew about it because we hit many issues caused by the fact the join_cluster and peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move the call to rabbit_peer_discovery:maybe_create_cluster/1 from rabbit_db_cluster instead of rabbit_mnesia. To achieve that, it made sense to unify the code and simply call rabbit_db_cluster:join/2 instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to cluster with was selected. Indeed, it could cause situations where multiple clusters are created instead of one, without resorting to out-of-band counter-measures, like a 30-second delay added in the Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was even more frequent when we tried to unify the code path and call join_cluster.

After several iterations on the patch and even more discussions with the team, we decided to rewrite the algorithm to make node selection more robust and still use rabbit_db_cluster:join/2 to create the cluster.

How

We wanted the new algorithm to fulfill the following properties:

rabbit_peer_discovery should provide the ability to re-trigger it easily to re-evaluate the cluster. The new public API is rabbit_peer_discovery:sync_desired_cluster/0.
The selection of the node to join should be designed in a way that all nodes select the same, regardless of the order in which they become available. The adopted solution is to sort the list of discovered nodes with the following criterias (in that order):
1. the size of the cluster a discovered node is part of; sorted from bigger to smaller clusters
2. the start time of a discovered node; sorted from older to younger nodes
3. the name of a discovered node; sorted alphabetically
The first node in that list will not join anyone and simply proceed with its boot process. Other nodes will try to join the first node.
To reduce the chance of incorrectly having multiple standalone nodes because the discovery backend returned only a single node, we want to apply the following constraints to the list of nodes after it is filtered and sorted (see property 2 above):
- The list must contain node() (i.e. the node running peer discovery itself).
- If the RabbitMQ's cluster size hint is greater than 1, the list must have at least two nodes. The cluster size hint is the maximum between the configured target cluster size hint and the number of elements in the nodes list returned by the backend.
If one of the constraint is not met, the entire peer discovery process is restarted after a delay.
The lock is acquired only to protect the actual join, not the discovery step where the backend is queried to get the list of peers. With the node selection described above, this will let the first node to start without acquiring the lock.
The cluster membership views queried as part of the algorithm to sort the list of nodes will be used to detect additional clusters or standalone nodes that did not cluster correctly. These nodes will be asked to re-evaluate peer discovery to increase the chance of forming a single cluster.
After some delay, peer discovery will be re-evaluated to further eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining properties will be the scope of additional pull requests after this one works.

If there is a failure at any point during discovery, filtering/sorting, locking or joining, the entire process is restarted after a delay. This is configured using the following parameters:

cluster_formation.discovery_retry_limit
cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1 second between each.

The locking retries/interval parameters are not used by the new algorithm anymore.

There are extra minor changes that come with the rewrite:

The configured backend is cached in a persistent term. The goal is to make sure we use the same backend throughout the entire process and when we call maybe_unregister/0 even if the configuration changed for whatever reason in between.
maybe_register/0 is called from rabbit_db_cluster instead of at the end of a successful peer discovery process. rabbit_db_cluster had to call maybe_register/0 if the node was not virgin anyway. So make it simpler and always call it in rabbit_db_cluster regardless of the state of the node.
log_configured_backend/0 is gone. maybe_init/0 can log the backend directly. There is no need to explicitly call another function for that.
Messages are logged using ?LOG_*() macros instead of the old rabbit_log module.

dumbbell · 2023-11-08T12:47:36Z

After several iterations on the patch and more discussions with the team, I will restart from scratch. The idea is still to use the common join_cluster code path instead of having something specific to peer discovery. However, this highlights some issues with the current peer discovery code. In particular, the way the node to join is selected means that e.g. node 4 joins node 1, but node 1 then joins node 2, kicking node 4 out of the cluster.

The current consensus is to have the following properties:

rabbit_peer_discovery should provide the ability to re-trigger it easily to re-evaluate the cluster.
The selection of the node to join should be designed in a way that all nodes select the same, regardless of the order in which they become available. The current solution is to sort the list of discovered nodes with the following criterias (in that order):
1. the size of the cluster a discovered node is part of; sorted from bigger to smaller clusters
2. the start time of a discovered node; sorted from older to younger nodes
3. the name of a discovered node; sorted alphabetically
The first node in that list does not join anyone and simply proceed with its boot process. Other nodes try to join the first node.
Only the actual join will be protected by the peer discovery lock, not the discovery step where the backend is queried to get the list of peers. With the node selection described above, this will let the first node to start without acquiring the lock.
To reduce the chance of incorrectly having multiple standalone nodes because the discovery backend returned only a single node, we want the following failsafes:
- If the backend returns a list that does not contain node(), the node will wait and retry the query.
- If the RabbitMQ's cluster size hint is greater than 1 and the backend returns a single node, the node will wait and retry the query until at least two nodes are returned. The cluster size hint is the maximum between the configured target cluster size hint and the number of elements in the nodes list returned by the backend.
- After a timeout, the node will abort peer discovery and boot as a standalone node.
The cluster membership views queried as part of the algorithm to sort the list of nodes will be used to detect standalone nodes that did not cluster correctly or additional clusters. These nodes will be asked to re-evaluate peer discovery to increase the chance of forming a single cluster.
After some delay, peer discovery will be re-evaluated to further eliminate the chances of having multiple clusters instead of one.

This pull request will cover properties from point 1 to ~~point 3~~ point 4. Remaining properties will be the scope of additional pull requests after this one works.

mkuratczyk · 2023-11-08T12:59:13Z

Sounds great, thanks for all the work you are putting into this! Perhaps this is implicitly covered by 4B, but one things I'd add is that if the cluster size hint is N, a node that doesn't know see N nodes in the cluster should not consider itself successfully booted.

dumbbell · 2023-11-08T15:02:35Z

This is not covered by 4B. I clarified my comment. The idea is that if the cluster size hint is 2 or more, peer discovery should expect that the backend returns at least two nodes. This is to avoid that an early query of the backend only returns [node()] to every nodes.

a node that doesn't now see N nodes in the cluster should not consider itself successfully booted.

Do you suggest that peer discovery should wait for a list of discovered nodes of size N? Or should peer discovery finish, but later in the boot process, we pause until all N nodes joined the cluster?

What about deployments where nodes are started sequentially? Or deployments where the peer discovery backend and the configured cluster hint are out-of-sync? Should the node fail to boot after some time, or should it boot with a warning?

michaelklishin · 2023-11-08T15:40:20Z

If we wait for N nodes to appear, and N = the total number of nodes, a single node that fails to boot will cause issues.

Also, Kubernetes generally assumes that nodes do not have any inter-dependencies.

Hence the idea to re-evaluate cluster membership and retry after node boot, with a delay.

mkuratczyk · 2023-11-08T15:47:42Z

I guess I haven't thought that through before. Indeed, with sequentially booted nodes, the nodes would need to be fully booted, for the next to even start booting (eg. that's the default behaviour for StatefulSets on Kubermetes, our Operator sets the startup policy to Parallel, but there are many deployments that don't use this option). I guess what I'd mostly want and I think Michael's comment covers is that ideally, the cluster should be somehow unusable until all expected nodes are there. Eg we discussed in the past the possibility of rejecting any queue declaration until all expected nodes are present - that could be a way to make it clear that something went wrong (and ideally automatically fixing/retrying cluster formation).

dumbbell · 2023-11-09T09:52:28Z

If the RabbitMQ's cluster size hint is greater than 1 and the backend returns a single node, the node will wait and retry the query until at least two nodes are returned.

Small update to this: instead of using the configured target cluster size hint alone, I take the max between this value and the number of nodes returned by the backend. This is handy with the classic config backend for instance: the static list of nodes is the cluster size hint.

dumbbell · 2023-12-06T16:58:44Z

I finished another round of fixes. In particular:

Some testcases like classic_config_discovery_node_list in clustering_management_SUITE didn't reproduce a real world use case and peer discovery wouldn't work in these peculiar conditions. The testcases were fixed to better reflect reality.
If the Erlang cooking was set through $RABBITMQ_ERLANG_COOKIE instead of the default file, the temporary hidden node would not use the expected cookie, leading to failure to communicate.

mkuratczyk

We've run many tests on Kubernetes (kind and GKE) and we found no issues. It's certainly much better than current main, which has multiple issues.

Once merged, we can ask the community for additional testing in other environments.

…_subset_of_nodes_coming_online` [Why] The testcase was broken as part of the work on Khepri (#7206): all nodes were started, making it an equivalent of the `successful_discovery` testcase. [How] We drop the first entry in the list of nodes given to `rabbit_ct_broker_helpers`. This way, it won't be started at all while still being listed in the classic config parameter.

[Why] This work started as an effort to add peer discovery support to our Khepri integration. Indeed, as part of the task to integrate Khepri, we missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was called from the Mnesia-specific code only. Even though we knew about it because we hit many issues caused by the fact the `join_cluster` and peer discovery use different code path to create a cluster. To add support for Khepri, the first version of this patch was to move the call to `rabbit_peer_discovery:maybe_create_cluster/1` from `rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made sense to unify the code and simply call `rabbit_db_cluster:join/2` instead of duplicating the work. Unfortunately, doing so highlighted another issue: the way the node to cluster with was selected. Indeed, it could cause situations where multiple clusters are created instead of one, without resorting to out-of-band counter-measures, like a 30-second delay added in the Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was even more frequent when we tried to unify the code path and call `join_cluster`. After several iterations on the patch and even more discussions with the team, we decided to rewrite the algorithm to make node selection more robust and still use `rabbit_db_cluster:join/2` to create the cluster. [How] This commit is only about the rewrite of the algorithm. Calling peer discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus making peer discovery work with Khepri) will be done in a follow-up commit. We wanted the new algorithm to fulfill the following properties: 1. `rabbit_peer_discovery` should provide the ability to re-trigger it easily to re-evaluate the cluster. The new public API is `rabbit_peer_discovery:sync_desired_cluster/0`. 2. The selection of the node to join should be designed in a way that all nodes select the same, regardless of the order in which they become available. The adopted solution is to sort the list of discovered nodes with the following criterias (in that order): 1. the size of the cluster a discovered node is part of; sorted from bigger to smaller clusters 2. the start time of a discovered node; sorted from older to younger nodes 3. the name of a discovered node; sorted alphabetically The first node in that list will not join anyone and simply proceed with its boot process. Other nodes will try to join the first node. 3. To reduce the chance of incorrectly having multiple standalone nodes because the discovery backend returned only a single node, we want to apply the following constraints to the list of nodes after it is filtered and sorted (see property 2 above): * The list must contain `node()` (i.e. the node running peer discovery itself). * If the RabbitMQ's cluster size hint is greater than 1, the list must have at least two nodes. The cluster size hint is the maximum between the configured target cluster size hint and the number of elements in the nodes list returned by the backend. If one of the constraint is not met, the entire peer discovery process is restarted after a delay. 4. The lock is acquired only to protect the actual join, not the discovery step where the backend is queried to get the list of peers. With the node selection described above, this will let the first node to start without acquiring the lock. 5. The cluster membership views queried as part of the algorithm to sort the list of nodes will be used to detect additional clusters or standalone nodes that did not cluster correctly. These nodes will be asked to re-evaluate peer discovery to increase the chance of forming a single cluster. 6. After some delay, peer discovery will be re-evaluated to further eliminate the chances of having multiple clusters instead of one. This commit covers properties from point 1 to point 4. Remaining properties will be the scope of additional pull requests after this one works. If there is a failure at any point during discovery, filtering/sorting, locking or joining, the entire process is restarted after a delay. This is configured using the following parameters: * cluster_formation.discovery_retry_limit * cluster_formation.discovery_retry_interval The default parameters were bumped to 30 retries with a delay of 1 second between each. The locking retries/interval parameters are not used by the new algorithm anymore. There are extra minor changes that come with the rewrite: * The configured backend is cached in a persistent term. The goal is to make sure we use the same backend throughout the entire process and when we call `maybe_unregister/0` even if the configuration changed for whatever reason in between. * `maybe_register/0` is called from `rabbit_db_cluster` instead of at the end of a successful peer discovery process. `rabbit_db_cluster` had to call `maybe_register/0` if the node was not virgin anyway. So make it simpler and always call it in `rabbit_db_cluster` regardless of the state of the node. * `log_configured_backend/0` is gone. `maybe_init/0` can log the backend directly. There is no need to explicitly call another function for that. * Messages are logged using `?LOG_*()` macros instead of the old `rabbit_log` module.

[Why] We go through a temporary hidden node to query all other discovered peers properties, instead of querying them directly. The reason is that we don't want that Erlang automatically connect all nodes together as a side effect (to form the full mesh network by default). If we let Erlang do that, we may interfere with the Feature flags controller which is globally registered when it performs an operation. If all nodes become connected, it's possible two or more globally registered controllers end up connected before they are ready to be clustered, and thus in the same "namespace". `global' will kill all but one of them. [How] By using a temporary intermediate hidden node, we ask Erlang not to connect everyone automatically. V2: Set `-setcookie <cookie>` in the temporary hidden node's VM arguments if one was set in the RabbitMQ context. This is required if the Erlang cookie is not written to disk; it might be the case with some container deployments.

…nodes only [Why] A lock is acquired to protect against concurrent cluster joins. Some backends used to use the entire list of discovered nodes and used `global` as the lock implementation. This was a problem because a side effect was that all discovered Erlang nodes were connected to each other. This led to conflicts in the global process name registry and thus processes were killed randomly. This was the case with the feature flags controller for instance. Nodes are running some feature flags operation early in boot before they are ready to cluster or run the peer discovery code. But if another node was executing peer discovery, it could make all nodes connected. Feature flags controller unrelated instances were thus killed because of another node running peer discovery. [How] Acquiring a lock on the joining and the joined nodes only is enough to achieve the goal of protecting against concurrent joins. This is possible because of the new core logic which ensures the same node is used as the "seed node". I.e. all nodes will join the same node. Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed to take a list of nodes (the two nodes mentionned above) instead of one node (which was the current node, so not that helpful in the first place). These backends also used to check if the current node was part of the discovered nodes. But that's already handled in the generic peer discovery code already. CAUTION: This brings a breaking change in the peer discovery backend API. The `Backend:lock/1` callback now takes a list of node names instead of a single node name. This list will contain the current node name.

[Why] The group leader for all processes on the temporary hidden node is the calling process' group leader on the upstream node. When we use `erpc:call/4` (or the multicall equivalent) to execute code on one of the given nodes, the remotely executed code will also use the calling process' group leader by default. We use this temporary hidden node to ensure the downstream node will not connected to the upstream node. Therefore, we must change the group leader as well, otherwise any I/O from the downstream node will send a message to the upstream node's group leader and thus open a connection. This would defeat the entire purpose of this temporary hidden node. [How] To avoid this, we start a proxy process which we will use as a group leader. This process will send all messages it receives to the group leader on the upstream node. There is one caveat: the logger (local to the temporary hidden node) forwards log messages to the upstream logger (on the upstream node) only if the group leader of that message is a remote PID. Because we set a local PID, it stops forwarding log messages originating from that temporary hidden node. That's why we use `with_group_leader_proxy/2` to set the group leader to our proxy only around the use of `erpc`. That's a lot just to keep logging working while not reveal the upstream node to the downstream node...

…art-cluster` [Why] So far, we use the CLI to create the cluster after starting the individual nodes. It's faster to use peer discovery and gives more exposure to the feature. Thus it will be easier to quickly test changes to the peer discovery subsystem with a simple `make start-cluster`. [How] We pass the classic configuration `cluster_nodes` application environment variable to all nodes' command line. This is only when the target is `start-cluster`, not `start-brokers`.

[Why] If a node joins the selected node but the selected node's DB layer is not ready, it will fail and the whole peer discovery process will restart (until the selected node is ready). That's fine, but scary messages are logged for a situation that is not really an actual error at this point. [How] While querying properties of all discovered nodes, we also check is the DB layer is ready using `rabbit_db:is_init_finished/0`. We then use this property to determine if we can try to join or if we should wait and retry. This avoids a join which we know will fail eventually, and thus error messages.

... instead of `rabbit_mnesia`. [Why] We need it for both Mnesia and Khepri. So instead of calling it in `rabbit_khepri` too, let's manage this from `rabbit_db` directly.

lukebakken · 2023-12-07T14:19:37Z

deps/rabbit/src/rabbit_mnesia.erl

+    %% Peer discovery may have been a no-op if it decided that all other nodes
+    %% should join this one. Therefore, we need to look at if this node is
+    %% still virgin and finish our use of Mnesia accordingly. In particular,
+    %% this second part crates all our Mnesia tables.


Suggested change

%% this second part crates all our Mnesia tables.

%% this second part creates all our Mnesia tables.

Addressed in #10073. Thanks!

lukebakken · 2023-12-07T14:25:14Z

deps/rabbit/src/rabbit_peer_discovery.erl


 maybe_init() ->
    Backend = backend(),
+    ?LOG_DEBUG(


Suggested change

?LOG_DEBUG(

?LOG_INFO(

Is there another place where the configured / in-use backend is logged at an INFO level?

Addressed in #10073. Thanks!

lukebakken · 2023-12-07T14:48:26Z

deps/rabbit/src/rabbit_peer_discovery.erl

+            Ret = erpc:call(Peer, ?MODULE, do_query_node_props, [Nodes]),
+            peer:stop(Pid),
+            Ret;


Suggested change

Ret = erpc:call(Peer, ?MODULE, do_query_node_props, [Nodes]),

peer:stop(Pid),

Ret;

Ret = try

erpc:call(Peer, ?MODULE, do_query_node_props, [Nodes])

after

peer:stop(Pid)

end,

Ret;

Would this help ensure no dangling peers in the case of an erpc:call exception? Apologies if my suggested code doesn't compile OOTB 😬

Addressed in #10073. Thanks!

lukebakken · 2023-12-07T14:52:46Z

deps/rabbit/src/rabbit_peer_discovery.erl

+    UpstreamGroupLeader = erlang:group_leader(),
+    true = erlang:group_leader(ProxyGroupLeader, self()),
+    Ret = Fun(),
+    true = erlang:group_leader(UpstreamGroupLeader, self()),
+    Ret.


Suggested change

UpstreamGroupLeader = erlang:group_leader(),

true = erlang:group_leader(ProxyGroupLeader, self()),

Ret = Fun(),

true = erlang:group_leader(UpstreamGroupLeader, self()),

Ret.

UpstreamGroupLeader = erlang:group_leader(),

Ret = try

true = erlang:group_leader(ProxyGroupLeader, self()),

Fun()

after

true = erlang:group_leader(UpstreamGroupLeader, self())

end,

Ret.

Overkill? Not sure.

Addressed in #10073. Thanks!

lukebakken · 2023-12-07T14:53:54Z

deps/rabbit/src/rabbit_peer_discovery.erl

+    NodesAndProps2 = sort_nodes_and_props(NodesAndProps1),
+    %% Wait for the proxy group leader to flush its inbox.
+    ProxyGroupLeader ! stop_proxy,
+    receive proxy_stopped -> ok end,


Should there be a timeout here?

Addressed in #10073. Thanks!

lukebakken

Only one more suggestion.

lukebakken · 2023-12-07T15:15:24Z

deps/rabbit/src/rabbit_peer_discovery.erl

+
+    ThisNodeIsIncluded andalso HasEnoughNodes;
+can_use_discovered_nodes(_DiscoveredNodes, []) ->
+    ?LOG_DEBUG(


This should be LOG_INFO or WARNING

Addressed in #10073. Thanks!

@lukebakken

Follow up to #9797. Submitted by: @lukebakken

@lukebakken

[Why] They both can be useful to diagnose before debug messages are enabled. Also, the second message tells to enable debug messages to get more details, but it was logged at the debug level :-) Follow up to #9797. Submitted by: @lukebakken

@lukebakken

[Why] We start a temporary hidden node, then ask it to execute some code, then stop it. But if there is an execution in between, we leave the temporary hidden node running. Likewise when we mess with the group leader: if the function executed after overriding the group leader temporarily raises an exception, the group leader becomes permanent and we may miss log messages. [How] We simply use a try/after block to ensure the temporary things are reverted at the end, regardless of the success or failure. Follow up to #9797. Submitted by: @lukebakken

@lukebakken

… leader ... to exit. [How] It should never be stuck obviously™. But in case planets are not aligned, we wait for its exit with a timeout. If it stays around, this is not the end of the world. Follow up to #9797. Submitted by: @lukebakken

@lukebakken

Follow up to #9797. Submitted by: @lukebakken

@lukebakken

[Why] They both can be useful to diagnose before debug messages are enabled. Also, the second message tells to enable debug messages to get more details, but it was logged at the debug level :-) Follow up to #9797. Submitted by: @lukebakken

@lukebakken

[Why] We start a temporary hidden node, then ask it to execute some code, then stop it. But if there is an execution in between, we leave the temporary hidden node running. Likewise when we mess with the group leader: if the function executed after overriding the group leader temporarily raises an exception, the group leader becomes permanent and we may miss log messages. [How] We simply use a try/after block to ensure the temporary things are reverted at the end, regardless of the success or failure. Follow up to #9797. Submitted by: @lukebakken

@lukebakken

… leader ... to exit. [How] It should never be stuck obviously™. But in case planets are not aligned, we wait for its exit with a timeout. If it stays around, this is not the end of the world. Follow up to #9797. Submitted by: @lukebakken

[Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. This session was created as part of the lock callback. However, after pull request #9797, the lock was only acquired if and when a node had ot join another one. Thus, after the actual discovery phase. This broke Consul peer discovery because the discovery was performed before that Consul session was created. [How] We introduce two new callbacks, `pre_discovery/0` and `post_discovery/1` to allow a backend to perform actions before and after the whole discover/lock/join process. To remain compatible with other peer discovery backend, the new callbacks are optional.

[Why] The Consul peer discovery backend needs to create a session before it can acquire a lock. This session is also required for nodes to discover each other. This session was created as part of the lock callback. However, after pull request #9797, the lock was only acquired if and when a node had to join another one. Thus, after the actual discovery phase. This broke Consul peer discovery because the discovery was performed before that Consul session was created. [How] We introduce two new callbacks, `pre_discovery/0` and `post_discovery/1` to allow a backend to perform actions before and after the whole discover/lock/join process. To remain compatible with other peer discovery backend, the new callbacks are optional.

dumbbell added this to the 3.13.0 milestone Oct 26, 2023

dumbbell requested review from dcorbacho and mkuratczyk October 26, 2023 16:39

dumbbell self-assigned this Oct 26, 2023

dumbbell mentioned this pull request Oct 26, 2023

rabbit_db: join/2 now takes care of stopping/starting RabbitMQ #9782

Merged

dumbbell force-pushed the use-join_cluster-in-peer-discovery branch 3 times, most recently from d37aab6 to ed97c34 Compare October 27, 2023 14:07

dumbbell modified the milestones: 3.13.0, 4.0.0 Oct 27, 2023

dumbbell force-pushed the use-join_cluster-in-peer-discovery branch 7 times, most recently from 2246618 to 3eb3b34 Compare November 2, 2023 16:39

dumbbell force-pushed the use-join_cluster-in-peer-discovery branch 2 times, most recently from d35c752 to e39d53c Compare November 9, 2023 13:10

dumbbell modified the milestones: 4.0.0, 3.13.0 Nov 9, 2023

dumbbell force-pushed the use-join_cluster-in-peer-discovery branch from e39d53c to 89bbfc3 Compare November 9, 2023 15:40

mergify bot added the bazel label Nov 10, 2023

dumbbell force-pushed the use-join_cluster-in-peer-discovery branch from 686084b to 18be108 Compare November 10, 2023 15:42

dumbbell marked this pull request as ready for review December 6, 2023 15:26

mkuratczyk approved these changes Dec 7, 2023

View reviewed changes

dumbbell added 8 commits December 7, 2023 15:51

rabbit_db: Run peer discovery from generic code

e1261b9

... instead of `rabbit_mnesia`. [Why] We need it for both Mnesia and Khepri. So instead of calling it in `rabbit_khepri` too, let's manage this from `rabbit_db` directly.

dumbbell force-pushed the use-join_cluster-in-peer-discovery branch from fef910b to e1261b9 Compare December 7, 2023 14:52

dumbbell merged commit 5157f25 into main Dec 7, 2023
19 checks passed

dumbbell deleted the use-join_cluster-in-peer-discovery branch December 7, 2023 15:05

lukebakken reviewed Dec 7, 2023

View reviewed changes

dumbbell added a commit that referenced this pull request Dec 8, 2023

rabbit_mnesia: Fix typo: "crates" -> "creates"

5ded16a

Follow up to #9797. Submitted by: @lukebakken

dumbbell mentioned this pull request Dec 8, 2023

Peer discovery core logic rewrite follow-up improvements #10073

Merged

michaelklishin pushed a commit that referenced this pull request Feb 29, 2024

rabbit_mnesia: Fix typo: "crates" -> "creates"

08a9b77

Follow up to #9797. Submitted by: @lukebakken

dumbbell mentioned this pull request Mar 16, 2024

RabbitMQ 3.13.0 nodes with Consul peer discovery enabled fails to form a cluster #10760

Closed

dumbbell mentioned this pull request Mar 18, 2024

rabbit_peer_discovery: Add pre/post discovery steps #10763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rabbit_peer_discovery: Rewrite the core logic #9797

rabbit_peer_discovery: Rewrite the core logic #9797

dumbbell commented Oct 26, 2023 •

edited

Loading

dumbbell commented Nov 8, 2023 •

edited

Loading

mkuratczyk commented Nov 8, 2023

dumbbell commented Nov 8, 2023 •

edited

Loading

michaelklishin commented Nov 8, 2023 •

edited

Loading

mkuratczyk commented Nov 8, 2023

dumbbell commented Nov 9, 2023

dumbbell commented Dec 6, 2023

mkuratczyk left a comment

lukebakken Dec 7, 2023

dumbbell Dec 8, 2023

lukebakken Dec 7, 2023

dumbbell Dec 8, 2023

lukebakken Dec 7, 2023

dumbbell Dec 8, 2023

lukebakken Dec 7, 2023

dumbbell Dec 8, 2023

lukebakken Dec 7, 2023

dumbbell Dec 8, 2023

lukebakken left a comment

lukebakken Dec 7, 2023

dumbbell Dec 8, 2023

	%% this second part crates all our Mnesia tables.
	%% this second part creates all our Mnesia tables.

rabbit_peer_discovery: Rewrite the core logic #9797

rabbit_peer_discovery: Rewrite the core logic #9797

Conversation

dumbbell commented Oct 26, 2023 • edited Loading

Why

How

dumbbell commented Nov 8, 2023 • edited Loading

mkuratczyk commented Nov 8, 2023

dumbbell commented Nov 8, 2023 • edited Loading

michaelklishin commented Nov 8, 2023 • edited Loading

mkuratczyk commented Nov 8, 2023

dumbbell commented Nov 9, 2023

dumbbell commented Dec 6, 2023

mkuratczyk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukebakken left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dumbbell commented Oct 26, 2023 •

edited

Loading

dumbbell commented Nov 8, 2023 •

edited

Loading

dumbbell commented Nov 8, 2023 •

edited

Loading

michaelklishin commented Nov 8, 2023 •

edited

Loading