Duplicated heavy operation in TransportClusterHealthAction.executeHealth #88303

AlexanderGunnarssonMW · 2022-07-06T09:42:49Z

Description

We are running a large scale Elasticsearch 7 cluster. When performing profiling on the master node, we observed that about 50% of the CPU time is spent in TransportClusterHealthAction.executeHealth. Half of this time is spent in the validateRequest method, and the other half in getResponse. These two methods are doing the same heavy operation twice, building a ClusterHealthResponse first to validate it and then to return it. Optimizing this would for our workload save 25% of the overall CPU time spent by the master node and presumably make cluster health requests respond twice as fast.

I have already written the small code change needed to optimize this (just a new nullable method validateRequestAndGetResponse). Would you like me to submit this as a PR?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-07-12T16:17:13Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen · 2022-08-12T10:03:13Z

@AlexanderGunnarssonMW thanks for your interest in improving Elasticsearch. We'd like to understand the issue a bit better, can you provide a few more details to help us grasp the problem?

ES version (including minor version).
Number of indices and shards in the cluster
Frequency of calling _cluster/health (and perhaps why, if it is high).
Size of the master node (how many cpus)

Also, if you can produce a flamegraph using async profiler for this, it would help us understand where the problem lies.

The direction you chose here in reducing the number of times we do the calculation may not be the final direction we'd want to pursue. We hope that there is more fundamental improvements we can make to optimize the calculation of the response.

AlexanderGunnarssonMW · 2022-08-29T08:10:30Z

Version 7.17.3
4600 indices, 42k shards
See below
The master has 16 vCPU's (recently increased from 4 for other reasons)

Just last week, we by chance identified one of our services that did an unnecessary amount of cluster health requests. Fixing this reduced the master nodes average cpu usage from roughly 24% to 6%. When running a smaller master node (before reducing health checks), typical usage was around 50% with occasional spikes.

So we don't consider this a problem for us anymore, but it is still the case that the calculation is unnecessarily done twice, as seen in the attached flame graph. In red I have marked the duplicated calculation, first done by validateRequest and then by getResponse, calling clusterHealth with the same parameters again. The width one marked red block occupies on the flame graph is roughly 30%. So to be clear, the logical calculation is just done once (but physically twice, unnecessarily).

AlexanderGunnarssonMW · 2022-08-29T08:20:50Z

Here is a PR on our forked repository to show a solution in code: meltwater#3

pxsalehi · 2024-06-04T08:06:04Z

we don't consider this a problem for us anymore

I'll close this then. It doesn't seem that we plan to work on the mentioned optimization. Feel free to reopen if necessary.

AlexanderGunnarssonMW added >enhancement needs:triage Requires assignment of a team area label labels Jul 6, 2022

DJRickyB added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed needs:triage Requires assignment of a team area label labels Jul 12, 2022

elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Jul 12, 2022

pxsalehi closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated heavy operation in TransportClusterHealthAction.executeHealth #88303

Duplicated heavy operation in TransportClusterHealthAction.executeHealth #88303

AlexanderGunnarssonMW commented Jul 6, 2022

elasticmachine commented Jul 12, 2022

henningandersen commented Aug 12, 2022

AlexanderGunnarssonMW commented Aug 29, 2022

AlexanderGunnarssonMW commented Aug 29, 2022

pxsalehi commented Jun 4, 2024

Duplicated heavy operation in TransportClusterHealthAction.executeHealth #88303

Duplicated heavy operation in TransportClusterHealthAction.executeHealth #88303

Comments

AlexanderGunnarssonMW commented Jul 6, 2022

Description

elasticmachine commented Jul 12, 2022

henningandersen commented Aug 12, 2022

AlexanderGunnarssonMW commented Aug 29, 2022

AlexanderGunnarssonMW commented Aug 29, 2022

pxsalehi commented Jun 4, 2024