Add useful Operator metrics #690

janhoy · 2024-03-06T12:48:04Z

Since #307 we now have generic go metrics, like mem, gc, threads etc.

Let's add application level metrics for the operator iself, that could be useful for Grafana Board and alerts. Suggestions:

Gauge of nuber of currently managed CRD instances for SolrClouds, SolrBackups, SolrPrometheusExporter
Gauge for CRDs currently in a failure state
Reconcile stats
- Successful vs failed reconcile events, broken down to what kind of event
- Size of pending operations in reconcile queue (if such a thing)
Operation stats
- For each operation type (install, upgrade, delete, backup etc) counts and status

Goal would be to make a simple Grafana board where you can filter on namespace etc to see raw operator health, and at a glance whether some operations are in failure state etc. Futher filter by labels like SolrCloud name, so you can see number of failed operations towards each cluster, and when they happened.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add useful Operator metrics #690

Add useful Operator metrics #690

janhoy commented Mar 6, 2024 •

edited

Loading

Add useful Operator metrics #690

Add useful Operator metrics #690

Comments

janhoy commented Mar 6, 2024 • edited Loading

janhoy commented Mar 6, 2024 •

edited

Loading