Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated heavy operation in TransportClusterHealthAction.executeHealth #88303

Closed
AlexanderGunnarssonMW opened this issue Jul 6, 2022 · 5 comments
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team (obsolete)

Comments

@AlexanderGunnarssonMW
Copy link

Description

We are running a large scale Elasticsearch 7 cluster. When performing profiling on the master node, we observed that about 50% of the CPU time is spent in TransportClusterHealthAction.executeHealth. Half of this time is spent in the validateRequest method, and the other half in getResponse. These two methods are doing the same heavy operation twice, building a ClusterHealthResponse first to validate it and then to return it. Optimizing this would for our workload save 25% of the overall CPU time spent by the master node and presumably make cluster health requests respond twice as fast.

I have already written the small code change needed to optimize this (just a new nullable method validateRequestAndGetResponse). Would you like me to submit this as a PR?

@AlexanderGunnarssonMW AlexanderGunnarssonMW added >enhancement needs:triage Requires assignment of a team area label labels Jul 6, 2022
@DJRickyB DJRickyB added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed needs:triage Requires assignment of a team area label labels Jul 12, 2022
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Jul 12, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@henningandersen
Copy link
Contributor

@AlexanderGunnarssonMW thanks for your interest in improving Elasticsearch. We'd like to understand the issue a bit better, can you provide a few more details to help us grasp the problem?

  • ES version (including minor version).
  • Number of indices and shards in the cluster
  • Frequency of calling _cluster/health (and perhaps why, if it is high).
  • Size of the master node (how many cpus)

Also, if you can produce a flamegraph using async profiler for this, it would help us understand where the problem lies.

The direction you chose here in reducing the number of times we do the calculation may not be the final direction we'd want to pursue. We hope that there is more fundamental improvements we can make to optimize the calculation of the response.

@AlexanderGunnarssonMW
Copy link
Author

  • Version 7.17.3
  • 4600 indices, 42k shards
  • See below
  • The master has 16 vCPU's (recently increased from 4 for other reasons)

Just last week, we by chance identified one of our services that did an unnecessary amount of cluster health requests. Fixing this reduced the master nodes average cpu usage from roughly 24% to 6%. When running a smaller master node (before reducing health checks), typical usage was around 50% with occasional spikes.

So we don't consider this a problem for us anymore, but it is still the case that the calculation is unnecessarily done twice, as seen in the attached flame graph. In red I have marked the duplicated calculation, first done by validateRequest and then by getResponse, calling clusterHealth with the same parameters again. The width one marked red block occupies on the flame graph is roughly 30%. So to be clear, the logical calculation is just done once (but physically twice, unnecessarily).

flamegraph

@AlexanderGunnarssonMW
Copy link
Author

Here is a PR on our forked repository to show a solution in code: meltwater#3

@pxsalehi
Copy link
Member

pxsalehi commented Jun 4, 2024

we don't consider this a problem for us anymore

I'll close this then. It doesn't seem that we plan to work on the mentioned optimization. Feel free to reopen if necessary.

@pxsalehi pxsalehi closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team (obsolete)
Projects
None yet
Development

No branches or pull requests

5 participants