Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Integrate kube-state-metrics and CR config into tilt. #7095

Merged
merged 1 commit into from
Sep 1, 2022

Conversation

chrischdi
Copy link
Member

@chrischdi chrischdi commented Aug 19, 2022

What this PR does / why we need it:

  • Adds kube-state-metrics to hack/observability which uses the kube-state-metrics helm chart and adds configuration for CAPI CRs
  • Integrates into tilt

Currently not added metrics:

  • *_labels

TODOs:

  • Use new kube-state-metrics release as soon as it got published
  • Split CR configuration to multiple files if Prevent definition of same gvk in custom resource configuration kubernetes/kube-state-metrics#1810 gets merged
    • won't implement for now. Split could be done in future when KSM provides a CR for configuration purposes.
  • Verify if all expected metrics are configured (except _labels)
  • update proposal for *_spec_paused and *_annotation_paused metrics
  • Provide docs
  • review kcp and md *_status_replicas_* metrics and unify

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Part of #6458

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 19, 2022
@k8s-ci-robot
Copy link
Contributor

@chrischdi: This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 19, 2022
Tiltfile Outdated Show resolved Hide resolved
Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few initial questions. Please just resolve if findings are just related to WIP.

Tiltfile Outdated Show resolved Hide resolved
hack/observability/capi-state-metrics/crd-config.yaml Outdated Show resolved Hide resolved
hack/observability/capi-state-metrics/kustomization.yaml Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 24, 2022
Comment on lines 1 to 4
image:
# TODO(chrischdi): drop to default to released image as soon as it got released
# repository: registry.k8s.io/kube-state-metrics/kube-state-metrics
# tag: v2.5.0
# custom image which includes changes > v2.5.0 required for custom resource metrics
repository: chrischdi/kube-state-metrics
tag: v2.5.0-fe097b6c
pullPolicy: IfNotPresent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove as soon as ksm release + helm chart release is out

Suggested change
image:
# TODO(chrischdi): drop to default to released image as soon as it got released
# repository: registry.k8s.io/kube-state-metrics/kube-state-metrics
# tag: v2.5.0
# custom image which includes changes > v2.5.0 required for custom resource metrics
repository: chrischdi/kube-state-metrics
tag: v2.5.0-fe097b6c
pullPolicy: IfNotPresent

@chrischdi chrischdi force-pushed the obs-ksm branch 2 times, most recently from 1b69a11 to d87d785 Compare August 25, 2022 08:29
@chrischdi
Copy link
Member Author

Updated and I think its ready for review.

The next kube-state-metrics release should happen by tomorrow, so the last open point could get adressed afterwards.

Happy to receive feedback 👍

Some first examples what could be queried:

  • Get all resources which are paused (either by annotation or spec):

    count(label_replace((
      capi_cluster_annotation_paused == 1 or capi_cluster_spec_paused == 1
      or capi_kubeadmcontrolplane_annotation_paused == 1
      or capi_machinedeployment_annotation_paused == 1 or capi_machinedeployment_spec_paused == 1
      or capi_machineset_annotation_paused == 1
      or capi_machine_annotation_paused == 1
      or capi_machinehealthcheck_annotation_paused == 1
    ), "object", "$1", "__name__", "capi_([a-z]+)_.*")) by (name,namespace,cluster_name,object)
    
  • First example Grafana dashboard (Disclamer: far from perfect! WIP but the current state):
    image

    {
      "__inputs": [
        {
          "name": "DS_PROMETHEUS",
          "label": "Prometheus",
          "description": "",
          "type": "datasource",
          "pluginId": "prometheus",
          "pluginName": "Prometheus"
        }
      ],
      "__elements": [],
      "__requires": [
        {
          "type": "grafana",
          "id": "grafana",
          "name": "Grafana",
          "version": "8.4.5"
        },
        {
          "type": "panel",
          "id": "piechart",
          "name": "Pie chart",
          "version": ""
        },
        {
          "type": "datasource",
          "id": "prometheus",
          "name": "Prometheus",
          "version": "1.0.0"
        },
        {
          "type": "panel",
          "id": "table",
          "name": "Table",
          "version": ""
        }
      ],
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": "-- Grafana --",
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "name": "Annotations & Alerts",
            "target": {
              "limit": 100,
              "matchAny": false,
              "tags": [],
              "type": "dashboard"
            },
            "type": "dashboard"
          }
        ]
      },
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": null,
      "iteration": 1661416446097,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Lists of all Clusters",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "custom": {
                "align": "auto",
                "displayMode": "auto"
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "matcher": {
                  "id": "byName",
                  "options": "cluster_name"
                },
                "properties": [
                  {
                    "id": "custom.width",
                    "value": 341
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 8,
            "w": 14,
            "x": 0,
            "y": 0
          },
          "id": 3,
          "options": {
            "footer": {
              "fields": "",
              "reducer": [
                "sum"
              ],
              "show": false
            },
            "showHeader": true,
            "sortBy": []
          },
          "pluginVersion": "8.4.5",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "exemplar": false,
              "expr": "count(capi_cluster_info{namespace=~\"$namespace\",name=~\"$cluster\"}) by (namespace,name,uid,topology_class,topology_version)",
              "format": "table",
              "instant": true,
              "interval": "",
              "legendFormat": "",
              "refId": "A"
            }
          ],
          "title": "Clusters",
          "transformations": [
            {
              "id": "seriesToColumns",
              "options": {
                "byField": "uid"
              }
            },
            {
              "id": "organize",
              "options": {
                "excludeByName": {
                  "Time": true,
                  "Value": true
                },
                "indexByName": {
                  "Time": 0,
                  "Value": 6,
                  "cluster_name": 4,
                  "name": 3,
                  "namespace": 2,
                  "resource": 1,
                  "uid": 5
                },
                "renameByName": {}
              }
            }
          ],
          "type": "table"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                }
              },
              "mappings": []
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 5,
            "x": 14,
            "y": 0
          },
          "id": 5,
          "options": {
            "displayLabels": [
              "name"
            ],
            "legend": {
              "displayMode": "list",
              "placement": "right",
              "values": []
            },
            "pieType": "pie",
            "reduceOptions": {
              "calcs": [
                "lastNotNull"
              ],
              "fields": "",
              "values": false
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "pluginVersion": "8.4.5",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "exemplar": false,
              "expr": "count(capi_cluster_info{namespace=~\"$namespace\",name=~\"$cluster\"}) by (topology_class)",
              "format": "time_series",
              "instant": true,
              "interval": "",
              "legendFormat": "{{ topology_class }}",
              "refId": "A"
            }
          ],
          "title": "Clusters by Topology Classes",
          "transformations": [],
          "type": "piechart"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "fixedColor": "blue",
                "mode": "palette-classic"
              },
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                }
              },
              "mappings": []
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 5,
            "x": 19,
            "y": 0
          },
          "id": 4,
          "options": {
            "displayLabels": [
              "name"
            ],
            "legend": {
              "displayMode": "list",
              "placement": "right",
              "values": []
            },
            "pieType": "pie",
            "reduceOptions": {
              "calcs": [
                "lastNotNull"
              ],
              "fields": "",
              "values": false
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "pluginVersion": "8.4.5",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "exemplar": false,
              "expr": "count(capi_cluster_info{namespace=~\"$namespace\",name=~\"$cluster\"}) by (topology_version)",
              "format": "time_series",
              "instant": true,
              "interval": "",
              "legendFormat": "{{ topology_version }}",
              "refId": "A"
            }
          ],
          "title": "Clusters by Topology Version",
          "transformations": [],
          "type": "piechart"
        },
        {
          "collapsed": false,
          "gridPos": {
            "h": 1,
            "w": 24,
            "x": 0,
            "y": 8
          },
          "id": 9,
          "panels": [],
          "title": "Paused resources",
          "type": "row"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Lists all paused CAPI resources",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "custom": {
                "align": "auto",
                "displayMode": "auto"
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "matcher": {
                  "id": "byName",
                  "options": "cluster_name"
                },
                "properties": [
                  {
                    "id": "custom.width",
                    "value": 341
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 11,
            "w": 24,
            "x": 0,
            "y": 9
          },
          "id": 2,
          "options": {
            "footer": {
              "fields": "",
              "reducer": [
                "sum"
              ],
              "show": false
            },
            "showHeader": true,
            "sortBy": []
          },
          "pluginVersion": "8.4.5",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "exemplar": false,
              "expr": "count(label_replace((\n  capi_cluster_annotation_paused{namespace=~\"$namespace\",name=~\"$cluster\"} == 1 or capi_cluster_spec_paused{namespace=~\"$namespace\",name=~\"$cluster\"} == 1\n  or capi_kubeadmcontrolplane_annotation_paused{namespace=~\"$namespace\",cluster_name=~\"$cluster\"} == 1\n  or capi_machinedeployment_annotation_paused{namespace=~\"$namespace\",cluster_name=~\"$cluster\"} == 1 or capi_machinedeployment_spec_paused{namespace=~\"$namespace\",cluster_name=~\"$cluster\"} == 1\n  or capi_machineset_annotation_paused{namespace=~\"$namespace\",cluster_name=~\"$cluster\"} == 1\n  or capi_machine_annotation_paused{namespace=~\"$namespace\",cluster_name=~\"$cluster\"} == 1\n  or capi_machinehealthcheck_annotation_paused{namespace=~\"$namespace\",cluster_name=~\"$cluster\"} == 1\n), \"resource\", \"$1\", \"__name__\", \"capi_([a-z]+)_.*\")) by (name,namespace,cluster_name,resource,uid)",
              "format": "table",
              "instant": true,
              "interval": "",
              "legendFormat": "",
              "refId": "A"
            }
          ],
          "title": "Paused resources",
          "transformations": [
            {
              "id": "seriesToColumns",
              "options": {
                "byField": "uid"
              }
            },
            {
              "id": "organize",
              "options": {
                "excludeByName": {
                  "Time": true,
                  "Value": true
                },
                "indexByName": {
                  "Time": 0,
                  "Value": 6,
                  "cluster_name": 4,
                  "name": 3,
                  "namespace": 2,
                  "resource": 1,
                  "uid": 5
                },
                "renameByName": {}
              }
            }
          ],
          "type": "table"
        }
      ],
      "refresh": "",
      "schemaVersion": 35,
      "style": "dark",
      "tags": [],
      "templating": {
        "list": [
          {
            "allValue": ".*",
            "current": {},
            "datasource": {
              "type": "prometheus",
              "uid": "${DS_PROMETHEUS}"
            },
            "definition": "label_values(capi_cluster_info, namespace)",
            "hide": 0,
            "includeAll": true,
            "label": "Namespace",
            "multi": true,
            "name": "namespace",
            "options": [],
            "query": {
              "query": "label_values(capi_cluster_info, namespace)",
              "refId": "StandardVariableQuery"
            },
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "sort": 0,
            "type": "query"
          },
          {
            "allValue": ".*",
            "current": {},
            "datasource": {
              "type": "prometheus",
              "uid": "${DS_PROMETHEUS}"
            },
            "definition": "label_values(capi_cluster_info{namespace=~\"$namespace\"}, name)",
            "hide": 0,
            "includeAll": true,
            "label": "Cluster",
            "multi": true,
            "name": "cluster",
            "options": [],
            "query": {
              "query": "label_values(capi_cluster_info{namespace=~\"$namespace\"}, name)",
              "refId": "StandardVariableQuery"
            },
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "sort": 0,
            "type": "query"
          }
        ]
      },
      "time": {
        "from": "now-6h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "",
      "title": "Cluster API Overview (WIP)",
      "uid": "qiW7XdZVkasd",
      "version": 5,
      "weekStart": ""
    }

/retitle ✨ Integrate kube-state-metrics and CR config into tilt.

@k8s-ci-robot k8s-ci-robot changed the title ✨ [WIP] Integrate kube-state-metrics and CR config into tilt. ✨ Integrate kube-state-metrics and CR config into tilt. Aug 25, 2022
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this nearing the finish line!
Overall lgtm, I still need to make a detaied pass to all the metrics

docs/book/src/developer/tilt.md Show resolved Hide resolved
docs/proposals/20220411-cluster-api-state-metrics.md Outdated Show resolved Hide resolved
hack/observability/kube-state-metrics/crd-config.yaml Outdated Show resolved Hide resolved
hack/observability/kube-state-metrics/metrics/build.sh Outdated Show resolved Hide resolved
@chrischdi
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2022
docs/book/src/developer/tilt.md Outdated Show resolved Hide resolved
docs/book/src/developer/tilt.md Outdated Show resolved Hide resolved
docs/proposals/20220411-cluster-api-state-metrics.md Outdated Show resolved Hide resolved
hack/observability/kube-state-metrics/metrics/regen.sh Outdated Show resolved Hide resolved
hack/observability/kube-state-metrics/kustomization.yaml Outdated Show resolved Hide resolved
@chrischdi
Copy link
Member Author

/hold

pending squash + reviews :-)

@fabriziopandini
Copy link
Member

Made a pass on the metrics, and I think this is a good set to start with.
Looking forward to start using those in dashboards so we can better appreciate what is useful and if something else is missing

@sbueringer
Copy link
Member

lgtm pending #7095 (comment)

@@ -0,0 +1,77 @@
image:
tag: v2.6.0
Copy link
Member

@sbueringer sbueringer Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good if we can drop the pinned tag in a follow-up PR once the corresponding Helm chart is out
(just that when we pull the latest chart the pinned tag and the chart don't run out of sync)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the latest chart still refers 2.5.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open an issue to track this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: #7143

@sbueringer
Copy link
Member

/lgtm

@chrischdi hold cancel?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 30, 2022
@sbueringer
Copy link
Member

/assign @fabriziopandini
for final review

@killianmuldoon
Copy link
Contributor

/lgtm
This is awesome, and working well on my machine.
One question about future work - what's the scope for including other providers objects in this? Is there an easy way for use to point to provider metrics definitions e.g. CAPD and have those loaded? Or is the idea that end users would load in their own metrics files and re-generate the metrics?

@fabriziopandini
Copy link
Member

/lgtm
/approve

One question about future work - what's the scope for including other providers objects in this? Is there an easy way for use to point to provider metrics definitions e.g. CAPD and have those loaded? Or is the idea that end users would load in their own metrics files and re-generate the metrics?

I think there is an intermediate step to get there, that is to automate the generation of the metrics configuration by introducing a new set of markers; this will make adoption in the providers straight forward

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 1, 2022
@sbueringer
Copy link
Member

/hold cancel
Thx @chrischdi
Really nice work, super super useful! :))

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2022
@k8s-ci-robot k8s-ci-robot merged commit 4cf110e into kubernetes-sigs:main Sep 1, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.3 milestone Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants