Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature:Configurable summarySetBlackList to set monitoring blacklist. #2165

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zichen-gan
Copy link

@zichen-gan zichen-gan commented May 16, 2024

Issue:

  • The current ZooKeeper (ZK) cluster frequently experiences excessively long Full Garbage Collection (FGC) events (averaging once a week) that last over 10 seconds, leading to occasional, unexpected leader elections in the ZK cluster.
  • The cluster currently fails to collect mntr monitoring data.

Root Cause:

  • The ZK cluster uses the direct child nodes under the root directory (/) as Namespaces, At the Namespace level, the following have been implemented:
  • Monitoring of MIN, CNT, AVG, MAX, and SUM request counts, which can be obtained via mntr monitoring.
  • monitoring for Namespaces only accounts for additions and not removals, tasks are created directly under the root directory (/), with tasks older than seven days being cleaned daily.
  • This leads to the following phenomenon: The number of Namespaces stored in ZK nodes (157,376) is significantly greater than the actual number of Namespaces in the ZK cluster (5,856).
image image

Solution:
Provide a new version of the ZK server that supports configurable Namespace monitoring.

/**
* Base implementation of {@link MetricsProvider}.<br>
* 提供默认的实现,当前包含summarySet黑名单功能。
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use English, instead of Chinese

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@maoling
Copy link
Member

maoling commented May 21, 2024

I cannot understand the intention of this PR clearly. From the description, I guess

  • it wants to solve the metric issues which is inaccurate or too expensive? which metrics(write_per_namespace,read_per_namespace)?
  • Add a blacklist can completely solve this problem?

@zichen-gan
Copy link
Author

it wants to solve metrics(write_per_namespace,read_per_namespace) which is too expensive.

  • / stores task IDs, which will be frequently added and deleted.metrics(write_per_namespace,read_per_namespace) only increase and do not decrease.
  • we do not require monitoring for write_per_namespace and read_per_namespace
    Hope to support write_per_namespace, read_per_namespace blacklist。
    @maoling

@zichen-gan
Copy link
Author

Perhaps it's a more universal and maintainable solution to remove the metric monitoring when the / path node is deleted. Regarding this PR, I will make some further modifications.

@zichen-gan zichen-gan marked this pull request as draft May 24, 2024 01:43
@zichen-gan
Copy link
Author

#2171
@maoling
Added a new test MetricSetRemoveTest to help understand this bug, and I found that fixing it using the previous blacklist method was much easier.
bug fix:At the time of ZK service delete root path, the audit and control was remove the NameSpace Metric.

@maoling
Copy link
Member

maoling commented Jun 18, 2024

Yes, I understand now. When you frequently create and delete numerous znodes at the first level, the write_per_namespace and read_per_namespace metrics will indeed expand and consume excessive memory.

A possible workaround is to add an additional layer under the root directory of the tree (e.g., /taskName/taskId-[1,n]).

While deleting the SummarySet key might work, it seems somewhat counterintuitive.

Here are a few alternative suggestions (just my personal opinion):

  1. Limit the size of a SummarySet. For example, when the number of keys exceeds 1000, prevent any further additions.
  2. Implementing a blacklist is a good idea, as we have other metrics that are costly and can be resource-intensive to query at times.
    2.1. Place the blacklist in the zoo.cfg file within the metrics section, instead of using a SystemProperty.
    2.2. Ensure the blacklist supports all metric types, not just SummarySet.
## Metrics Providers
metricsProvider.exportJvmInfo=true
# Support multiple and prefix matching
metricsProvider.blackList=prep_processor*,sync_processor*

@zichen-gan
Copy link
Author

I am the ZK cluster provider, so it is difficult for businesses to modify the way they use it.😂

The reason why i don't want to use blacklists is that each business that developer them incorrectly needs to configure a blacklist.So i deleting the SummarySet key to make ZK developer unconsciously.

I'm sorry, I don't understand why this is counterintuitive. @maoling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants