-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cumulus/minimal-node: added prometheus metrics for the RPC client #5572
cumulus/minimal-node: added prometheus metrics for the RPC client #5572
Conversation
cumulus/client/relay-chain-minimal-node/src/blockchain_rpc_client.rs
Outdated
Show resolved
Hide resolved
4c1f326
to
a42b3ae
Compare
The CI pipeline was cancelled due to failure one of the required jobs. |
6d99a1b
to
2f51ebd
Compare
cf3c960
to
fa15b54
Compare
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
4c0c7fa
to
0ef2ee7
Compare
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, nits only.
"polkadot_parachain_relay_chain_rpc_interface", | ||
"Tracks stats about cumulus relay chain RPC interface", | ||
), | ||
buckets: prometheus::exponential_buckets(0.001, 4.0, 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curiosity - any reason for these values?
Am I right that buckets will be?
0.001
0.004
0.016
0.064
0.256
1.024
4.096
16.384
65.536
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct for the buckets.
I picked the buckets split by seeing it used for other requests timers in the code (related to substrate libp2p), although there isn't a particular relationship between them. Ideally we'll have some preliminary measurements for these first and then pick the buckets. I am ok with these values though because they correspond to some rough back of the envelope measurements for exchanging data over the network (e.g USA -> EU -> USA ~ 150 ms). I think that the higher buckets (e.g. >1s) can be considered extreme, and might correspond to super infrequent outliers (assuming the network runs fine most of the time).
LE: on my above higher buckets note, depends as usual. We measure implicitly the time it takes for the external RPC to process the request and return the response, so I think they can hold some of the observations, depending on the nature of the RPC call.
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
@@ -127,6 +128,7 @@ pub async fn build_minimal_relay_chain_node_with_rpc( | |||
let client = cumulus_relay_chain_rpc_interface::create_client_and_start_worker( | |||
relay_chain_url, | |||
task_manager, | |||
polkadot_config.prometheus_registry(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about whether this is better registered on the parachain side.
Technically, this is doing relay chain calls. However, the calls to the relay chain are basically done only on collators, the code lives in cumulus. As a user I would probably expect these metrics to be attached to the parachain prometheus endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked for reasons to keep the metrics on the relay chain side but couldn't find any. To be honest, it is still not clear to me what kind of metrics should fall under the relay chain prometheus exporter (its concerns on the collator side are not crispy clear in my mind yet), but for our case I agree that these metrics seem more relevant to the internals of how parachains work, so it would be useful to expose them in the "parachain" prometheus exporter.
Changed this here: 309ae23.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, it is still not clear to me what kind of metrics should fall under the relay chain prometheus exporter
So when you don't use --relay-chain-rpc-urls
then the collator will start an embedded node. Which is basically the same as a polkadot full node that you start with the polkadot
binary.
If course this internal node will export all its metrics, there is a whole bunch of them defined in substrate. So in these scenarios it can make sense to monitor the relay chain node separately. This RPC functionality however is in the end parachain specific and therefore goes to the parachain prometheus endpoint.
...and rename for clarity Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
c8d5e5a
Description
When we start a node with connections to external RPC servers (as a minimal node), we lack metrics around how many individual calls we're doing to the remote RPC servers and their duration. This PR adds metrics that measure durations of each RPC call made by the minimal nodes, and implicitly how many calls there are.
Closes #5409
Closes #5689
Integration
Node operators should be able to track minimal node metrics and decide appropriate actions according to how the metrics are interpreted/felt. The added metrics can be observed by curl'ing the prometheus metrics endpoint for the
relaychainparachain (it was changed based on the review). The metrics are represented bypolkadot_parachain_relay_chain_rpc_interface
relay_chain_rpc_interface
namespace (I realized lining upparachain_relay_chain
in the same metric might be confusing :). Excerpt from the curl:Review Notes
The way we measure durations/hits is based on
HistogramVec
struct which allows us to collect timings for each RPC client method called from the minimal node., It can be extended to measure the RPCs against other dimensions too (status codes, response sizes, etc). The timing measuring is done at the level of therelay-chain-rpc-interface
, in theRelayChainRpcClient
struct's method 'request_tracing'. A single entry point for all RPC requests done through the relay-chain-rpc-interface. The requests durations will fall under exponential buckets described by start0.001
, factor4
and count9
.