Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SUCCESS RATE as the criteria for health assessment #7

Open
fogfish opened this issue Mar 8, 2024 · 1 comment
Open

Use SUCCESS RATE as the criteria for health assessment #7

fogfish opened this issue Mar 8, 2024 · 1 comment
Assignees

Comments

@fogfish
Copy link
Member

fogfish commented Mar 8, 2024

As a user I want to reduce number of false positive reports so that my workflow is not interrupted for the noise.

For example, The rule engine is only uses absolute values to consider success or failure.

Should(rules.OsCpuUtil.Below(40.0, 60.0))

As a consequence, event if a single sampled value is above threshold the utility report an error. It causes a few false positive.
Usage of % of success as criteria would be helpful. In the example below, it would be nice to claim failure if success rate is over 60%.

STATUS       %            MIN            AVG            MAX	 ID CHECK
FAILED  32.14%           0.03          13.33         250.61	 D3: storage i/o latency
@fogfish fogfish self-assigned this Apr 5, 2024
@fogfish
Copy link
Member Author

fogfish commented Apr 25, 2024

The success rate is calculated as percentile of tAvg value, which is actually controls the status. Instead of adding extra config parameter, we should find better ways of educating on configuration. Visualising raw metrics would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant