Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: Configure KSM & cluster dashboard #4116

Closed
wants to merge 1 commit into from
Closed

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Jul 31, 2023

NOTE: These changes are added in the new flux2-monitoring-examples repository - fluxcd/flux2-monitoring-example#1 .

The motivation behind this change is to move the responsibility of custom resource metrics to kube-state-metrics (KSM) instead of the individual controllers. The controllers will continue to export metrics about reconciliation and other controller specific metrics. All the metrics about resources that are available on the CRD are exported using KSM. This will allow users to configure custom metrics as per their needs without any changes in the controllers. This will also allow us to have static resources that don't have a reconciler for reporting resource readiness metrics but continue to have the same monitoring capabilities, for example the HelmRepository source resource in OCI mode, Alert and Provider notification resources, and other upcoming API resources that may not be backed by reconcilers.

Update kube-prometheus-stack helm release values to configure kube-state-metrics and use kube-state-metrics to collect gotk resource state metrics.

  • Configure kube-state-metrics to run in custom resource state only mode. In this mode, it'll only watch custom resources. Also, pass empty collectors as extra args to prevent passing all the core resources to watch as an argument.
  • Running kube-state-metrics in custom resource state only mode makes the default grafana dashboards of no use. Disable the default dashboards.
  • Add kube-state-metrics configuration to provide RBAC permissions to it to allow listing and watching flux CRDs.
  • Also, configure custom resource state for each of the flux custom resources using Info type metrics called gotk_resource_info. KSM issues a warning if an Info type object doesn't have _info suffix. These metrics have the value 1 always. This works well for the CRD state metrics as a zero value would mean that the resource doesn't exist, in which case, the resource is deleted.
  • Update the cluster dashboard panels to use gotk_resource_info in the queries.
    • Only the following panels have been updated
      • Cluster Reconcilers
      • Failing Reconcilers
      • Cluster reconciliation readiness
      • Kubernetes Manifests Sources
      • Failing Sources
      • Source acquisition readiness
    • The panels have been updated such that it's work with static resources which don't have any status as well. By default, it assumes such static resources to be in a Ready state. Resources are seen as failed only when the ready value is false.
    • The queries have been updated to Instant type in order to show the current data, instead of the result of past 15 minutes. This shows more accurate resource data as the resource metrics change.
    • The Stat visualizers have been updated to have zero as the default value when there's no data. This is to prevent showing no data when there's no object. This was motivated by the behavior of the previous configuration which depended on stale metrics from controllers and deleted conditions to show zero value when objects get deleted. With the fixes in the controller metrics that removes stale metrics, this will no longer work. In order to show a zero value for these stats, a default is set.
    • The $namespace variable has been updated to refer to exported_namespace from gotk_resource_info.

Sample resource metrics from KSM:

# HELP gotk_resource_info The current state of a GitOps Toolkit resource.
# TYPE gotk_resource_info info
gotk_resource_info{customresource_group="kustomize.toolkit.fluxcd.io",customresource_kind="Kustomization",customresource_version="v1",exported_namespace="default",name="podinfo",ready="True"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="GitRepository",customresource_version="v1",exported_namespace="monitoring",name="test-2",ready="True"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="GitRepository",customresource_version="v1",exported_namespace="default",name="test-1",ready="True"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmChart",customresource_version="v1beta2",exported_namespace="default",name="podinfo",ready="True"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1beta2",exported_namespace="default",name="podinfo",ready="True"} 1

The dashboard is identical to the existing dashboard with slight differences:
image

Kube-state-metrics custom-resource state metrics docs: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/customresourcestate-metrics.md

@darkowlzz darkowlzz added the area/monitoring Monitoring related issues and pull requests label Jul 31, 2023
@stefanprodan stefanprodan mentioned this pull request Aug 2, 2023
9 tasks
Update kube-prometheus-stack helm release values to configure
kube-state-metrics and use kube-state-metrics to collect gotk resource
state metrics.

- Configure kube-state-metrics to run in custom resource state only
  mode. In this mode, it'll only watch custom resources. Also, pass
  empty collectors as extra args to prevent passing all the core
  resources to watch as an argument.
- Running kube-state-metrics in custom resource state only mode makes
  the default grafana dashboards of no use. Disable the default
  dashboards.
- Add kube-state-metrics configuration to provide RBAC permissions to it
  to allow listing and watching flux CRDs.
- Also, configure custom resource state for each of the flux custom
  resources using Info type metrics called `gotk_resource_info`. KSM
  issues a warning if an Info type object doesn't have `_info` suffix.
  These metrics have the value 1 always. This works well for the CRD
  state metrics as a zero value would mean that the resource doesn't
  exist, in which case, the resource is deleted.
- Update the cluster dashboard panels to use `gotk_resource_info` in the
  queries.
  - Only the following panels have been updated
    - Cluster Reconcilers
    - Failing Reconcilers
    - Cluster reconciliation readiness
    - Kubernetes Manifests Sources
    - Failing Sources
    - Source acquisition readiness
  - The panels have been updated such that it's work with static
  resources which don't have any status as well. By default, it assumes
  such static resources to be in a Ready state. Resources are seen as
  failed only when the ready value is false.
  - The queries have been updated to Instant type in order to show the
  current data, instead of the result of past 15 minutes. This shows
  more accurate resource data as the resource metrics change.
  - The Stat visualizers have been updated to have zero as the default
  value when there's no data. This is to prevent showing no data when
  there's no object. This was motivated by the behavior of the previous
  configuration which depended on stale metrics from controllers and
  deleted conditions to show zero value when objects get deleted. With
  the fixes in the controller metrics that removes stale metrics, this
  will no longer work. In order to show a zero value for these stats, a
  default is set.
  - The `$namespace` variable has been updated to refer to
  `exported_namespace` from `gotk_resource_info`.

Signed-off-by: Sunny <[email protected]>
@stefanprodan
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring related issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants