Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

khepri_cluster: Use key metrics to determine if a Ra server is running #292

Merged
merged 1 commit into from
Sep 9, 2024

Conversation

dumbbell
Copy link
Member

@dumbbell dumbbell commented Sep 9, 2024

Why

The previous use of ra:ping/2 was too expensive.

As
khepri_cluster:is_store_running/1 is now used by mnesia_to_khepri:is_migration_finished/2 and
mnesia_to_khepri:hande_fallback/5 since khepri_mnesia_migration 0.6.0, we saw a regression in performance in RabbitMQ because of this.

khepri_mnesia_migration was using a very basic and incomplete version of is_store_running() before. That's why the issue was not spotted earlier.

How

The new code uses ra:key_metrics/2 which simply checks if the process us running and query a few local counters. This is way faster because it does not send messages to the Ra server.

@dumbbell dumbbell added the enhancement New feature or request label Sep 9, 2024
@dumbbell dumbbell added this to the v0.15.1 milestone Sep 9, 2024
@dumbbell dumbbell self-assigned this Sep 9, 2024
Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.67%. Comparing base (c5d02f4) to head (b47b2fa).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/khepri_cluster.erl 75.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #292   +/-   ##
=======================================
  Coverage   89.67%   89.67%           
=======================================
  Files          21       21           
  Lines        3187     3187           
=======================================
  Hits         2858     2858           
  Misses        329      329           
Flag Coverage Δ
erlang-25 88.79% <75.00%> (-0.04%) ⬇️
erlang-26 89.42% <75.00%> (-0.19%) ⬇️
erlang-27 89.61% <75.00%> (+0.03%) ⬆️
os-ubuntu-latest 89.67% <75.00%> (ø)
os-windows-latest 89.64% <75.00%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

[Why]
The previous use of `ra:ping/2` was too expensive.

As
`khepri_cluster:is_store_running/1` is now used by
`mnesia_to_khepri:is_migration_finished/2` and
`mnesia_to_khepri:hande_fallback/5` since khepri_mnesia_migration 0.6.0,
we saw a regression in performance in RabbitMQ because of this.

khepri_mnesia_migration was using a very basic and incomplete version of
`is_store_running()` before. That's why the issue was not spotted
earlier.

[How]
The new code uses `ra:key_metrics/2` which simply checks if the process
us running and query a few local counters. This is way faster because it
does not send messages to the Ra server.
@kjnilsson
Copy link

would it not be even faster to avoid creating that map (that isn't used) and just do an erpc with a whereis call on the process name?

@dumbbell
Copy link
Member Author

dumbbell commented Sep 9, 2024

I thought about that but that’s a bit too much knowledge of the Ra implementation. As far as Khepri is concerned, the server ID is opaque.

ra:key_metrics() is the closest I could find to a bare rpc + whereis without adding knowledge of the Ra internals into Khepri.

@the-mikedavis
Copy link
Member

the-mikedavis commented Sep 9, 2024

We could consider adding a ra:whereis/1 function that accepts a ra:server_id(). That would leave the implementation details to Ra and hopefully be flexible should we decide to change how processes are registered (also see rabbitmq/ra#12).

I tried this branch in RabbitMQ and the change to use whereis/1 would be better (especially since we're querying only the local node) but as-is on this branch the is_store_running check becomes barely noticeable looking at a flamegraph.

@dumbbell
Copy link
Member Author

dumbbell commented Sep 9, 2024

What about ra:is_server_running/{1,2}? I would prefer such a function because the intent and the purpose is clear. ra:whereis() share an internal detail and it’s not straightforward you should use this API over another one to determine if the Ra server is running or not.

@the-mikedavis: I see you approved this pull request. Do you believe we should merge it as is instead of waiting for a more appropriate API in Ra?

@the-mikedavis
Copy link
Member

Since we need this to restore performance in RabbitMQ and the performance is already pretty good I think we should merge this as-is. I was thinking that we should make changes to Ra as a follow-up.

@michaelklishin
Copy link
Member

@the-mikedavis I agree with you. We can always optimize things some more in 4.0.x and 4.x.

@dumbbell dumbbell marked this pull request as ready for review September 9, 2024 16:59
@dumbbell dumbbell merged commit e31b236 into main Sep 9, 2024
12 checks passed
@dumbbell dumbbell deleted the optimize-is_store_running branch September 9, 2024 17:00
@kjnilsson
Copy link

The ra:server_id() type is a public type (not opaque) so you can depend on the structure of the type just fine.

@the-mikedavis
Copy link
Member

Is it guaranteed to always be the registered name of the process (at least until the next major version)? Not that we would want to do this but Ra could change to use the first element in the tuple to lookup the actual process in an ETS table without changing the type of ra:server_id().

@kjnilsson
Copy link

Is it guaranteed to always be the registered name of the process (at least until the next major version)? Not that we would want to do this but Ra could change to use the first element in the tuple to lookup the actual process in an ETS table without changing the type of ra:server_id().

If we decided to introduce a means of dynamically discovering the remote pid of a server it would be encoded as an explicit new server_id() type case which newly declared servers would have to opt into to use. I don't see the current approach disappear ever. We don't even have a reasonable way to do dynamic discovery or any ideas of how to do it will (without depending on some other consensus system).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants