-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated heavy operation in TransportClusterHealthAction.executeHealth #88303
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
@AlexanderGunnarssonMW thanks for your interest in improving Elasticsearch. We'd like to understand the issue a bit better, can you provide a few more details to help us grasp the problem?
Also, if you can produce a flamegraph using async profiler for this, it would help us understand where the problem lies. The direction you chose here in reducing the number of times we do the calculation may not be the final direction we'd want to pursue. We hope that there is more fundamental improvements we can make to optimize the calculation of the response. |
Just last week, we by chance identified one of our services that did an unnecessary amount of cluster health requests. Fixing this reduced the master nodes average cpu usage from roughly 24% to 6%. When running a smaller master node (before reducing health checks), typical usage was around 50% with occasional spikes. So we don't consider this a problem for us anymore, but it is still the case that the calculation is unnecessarily done twice, as seen in the attached flame graph. In red I have marked the duplicated calculation, first done by |
Here is a PR on our forked repository to show a solution in code: meltwater#3 |
I'll close this then. It doesn't seem that we plan to work on the mentioned optimization. Feel free to reopen if necessary. |
Description
We are running a large scale Elasticsearch 7 cluster. When performing profiling on the master node, we observed that about 50% of the CPU time is spent in TransportClusterHealthAction.executeHealth. Half of this time is spent in the
validateRequest
method, and the other half ingetResponse
. These two methods are doing the same heavy operation twice, building aClusterHealthResponse
first to validate it and then to return it. Optimizing this would for our workload save 25% of the overall CPU time spent by the master node and presumably make cluster health requests respond twice as fast.I have already written the small code change needed to optimize this (just a new nullable method
validateRequestAndGetResponse
). Would you like me to submit this as a PR?The text was updated successfully, but these errors were encountered: