khepri_cluster: Use key metrics to determine if a Ra server is running #292

dumbbell · 2024-09-09T13:04:14Z

Why

The previous use of ra:ping/2 was too expensive.

As
khepri_cluster:is_store_running/1 is now used by mnesia_to_khepri:is_migration_finished/2 and
mnesia_to_khepri:hande_fallback/5 since khepri_mnesia_migration 0.6.0, we saw a regression in performance in RabbitMQ because of this.

khepri_mnesia_migration was using a very basic and incomplete version of is_store_running() before. That's why the issue was not spotted earlier.

How

The new code uses ra:key_metrics/2 which simply checks if the process us running and query a few local counters. This is way faster because it does not send messages to the Ra server.

codecov · 2024-09-09T13:08:47Z

Codecov Report

Attention: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.67%. Comparing base (c5d02f4) to head (b47b2fa).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/khepri_cluster.erl	75.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #292   +/-   ##
=======================================
  Coverage   89.67%   89.67%           
=======================================
  Files          21       21           
  Lines        3187     3187           
=======================================
  Hits         2858     2858           
  Misses        329      329

Flag	Coverage Δ
erlang-25	`88.79% <75.00%> (-0.04%)`	⬇️
erlang-26	`89.42% <75.00%> (-0.19%)`	⬇️
erlang-27	`89.61% <75.00%> (+0.03%)`	⬆️
os-ubuntu-latest	`89.67% <75.00%> (ø)`
os-windows-latest	`89.64% <75.00%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

[Why] The previous use of `ra:ping/2` was too expensive. As `khepri_cluster:is_store_running/1` is now used by `mnesia_to_khepri:is_migration_finished/2` and `mnesia_to_khepri:hande_fallback/5` since khepri_mnesia_migration 0.6.0, we saw a regression in performance in RabbitMQ because of this. khepri_mnesia_migration was using a very basic and incomplete version of `is_store_running()` before. That's why the issue was not spotted earlier. [How] The new code uses `ra:key_metrics/2` which simply checks if the process us running and query a few local counters. This is way faster because it does not send messages to the Ra server.

kjnilsson · 2024-09-09T13:15:09Z

would it not be even faster to avoid creating that map (that isn't used) and just do an erpc with a whereis call on the process name?

dumbbell · 2024-09-09T13:30:14Z

I thought about that but that’s a bit too much knowledge of the Ra implementation. As far as Khepri is concerned, the server ID is opaque.

ra:key_metrics() is the closest I could find to a bare rpc + whereis without adding knowledge of the Ra internals into Khepri.

the-mikedavis · 2024-09-09T13:53:20Z

We could consider adding a ra:whereis/1 function that accepts a ra:server_id(). That would leave the implementation details to Ra and hopefully be flexible should we decide to change how processes are registered (also see rabbitmq/ra#12).

I tried this branch in RabbitMQ and the change to use whereis/1 would be better (especially since we're querying only the local node) but as-is on this branch the is_store_running check becomes barely noticeable looking at a flamegraph.

dumbbell · 2024-09-09T14:56:32Z

What about ra:is_server_running/{1,2}? I would prefer such a function because the intent and the purpose is clear. ra:whereis() share an internal detail and it’s not straightforward you should use this API over another one to determine if the Ra server is running or not.

@the-mikedavis: I see you approved this pull request. Do you believe we should merge it as is instead of waiting for a more appropriate API in Ra?

the-mikedavis · 2024-09-09T15:13:35Z

Since we need this to restore performance in RabbitMQ and the performance is already pretty good I think we should merge this as-is. I was thinking that we should make changes to Ra as a follow-up.

michaelklishin · 2024-09-09T15:26:51Z

@the-mikedavis I agree with you. We can always optimize things some more in 4.0.x and 4.x.

kjnilsson · 2024-09-10T09:22:54Z

The ra:server_id() type is a public type (not opaque) so you can depend on the structure of the type just fine.

the-mikedavis · 2024-09-11T13:54:37Z

Is it guaranteed to always be the registered name of the process (at least until the next major version)? Not that we would want to do this but Ra could change to use the first element in the tuple to lookup the actual process in an ETS table without changing the type of ra:server_id().

kjnilsson · 2024-09-11T14:04:49Z

Is it guaranteed to always be the registered name of the process (at least until the next major version)? Not that we would want to do this but Ra could change to use the first element in the tuple to lookup the actual process in an ETS table without changing the type of ra:server_id().

If we decided to introduce a means of dynamically discovering the remote pid of a server it would be encoded as an explicit new server_id() type case which newly declared servers would have to opt into to use. I don't see the current approach disappear ever. We don't even have a reasonable way to do dynamic discovery or any ideas of how to do it will (without depending on some other consensus system).

dumbbell added the enhancement New feature or request label Sep 9, 2024

dumbbell added this to the v0.15.1 milestone Sep 9, 2024

dumbbell requested a review from the-mikedavis September 9, 2024 13:04

dumbbell self-assigned this Sep 9, 2024

dumbbell force-pushed the optimize-is_store_running branch from eb270ef to b47b2fa Compare September 9, 2024 13:14

the-mikedavis approved these changes Sep 9, 2024

View reviewed changes

dumbbell marked this pull request as ready for review September 9, 2024 16:59

dumbbell merged commit e31b236 into main Sep 9, 2024
12 checks passed

dumbbell deleted the optimize-is_store_running branch September 9, 2024 17:00

the-mikedavis mentioned this pull request Sep 11, 2024

Use erlang:whereis/1 for checking if a store is running #296

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

khepri_cluster: Use key metrics to determine if a Ra server is running #292

khepri_cluster: Use key metrics to determine if a Ra server is running #292

dumbbell commented Sep 9, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

kjnilsson commented Sep 9, 2024

dumbbell commented Sep 9, 2024

the-mikedavis commented Sep 9, 2024 •

edited

Loading

dumbbell commented Sep 9, 2024

the-mikedavis commented Sep 9, 2024

michaelklishin commented Sep 9, 2024

kjnilsson commented Sep 10, 2024

the-mikedavis commented Sep 11, 2024

kjnilsson commented Sep 11, 2024

khepri_cluster: Use key metrics to determine if a Ra server is running #292

khepri_cluster: Use key metrics to determine if a Ra server is running #292

Conversation

dumbbell commented Sep 9, 2024

Why

How

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

kjnilsson commented Sep 9, 2024

dumbbell commented Sep 9, 2024

the-mikedavis commented Sep 9, 2024 • edited Loading

dumbbell commented Sep 9, 2024

the-mikedavis commented Sep 9, 2024

michaelklishin commented Sep 9, 2024

kjnilsson commented Sep 10, 2024

the-mikedavis commented Sep 11, 2024

kjnilsson commented Sep 11, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

the-mikedavis commented Sep 9, 2024 •

edited

Loading