Raise a metric whenever CAPI cannot see a remote cluster client #5510

perithompson · 2021-10-27T12:22:47Z

User Story

As an operator I would like a capi to raise a metric whenever it cannot see a remote cluster client to see where we can see a continuous rise in errors contacting the workload cluster client and potentially raise an alert.

Detailed Description
Similar to kubernetes-sigs/cluster-api-provider-vsphere#1281, it would be useful to be able to spot when clusters are not contactable from the management cluster so that we can monitor and alert when reconciliation should be paused until such time that the remote cluster is again contactable.

Anything else you would like to add:
This also relates to #5394 in that that is asking for more information on the annotations to be used when cluster communication is interrupted

/kind feature

killianmuldoon · 2021-10-27T14:41:28Z

/area health

fabriziopandini · 2022-01-26T15:19:42Z

Praise for the clear definition of the required metric.
/milestone v1.2

killianmuldoon · 2022-01-28T12:46:06Z

/assign

killianmuldoon · 2022-04-14T14:38:24Z

All credit to @chrischdi for the following info 🙂

We have a metric, exposed by client-go through controller-runtime, which reports error responses in the client. It's called rest_client_requests_total and it has the following format:

rest_client_requests_total{code="<error>",host="172.18.0.4:6443",method="GET"} 36

The host IP here is the ControlPlaneEndpoint IP - i.e. from Cluster .spec.controlPlaneEndpoint.host

It is currently exposed by CAPI's metrics endpoint (by default :8080/metrics) You can see it yourself (with kubecontext set to the management cluster) using:

kubectl port-forward -n capi-system deployments/capi-controller-manager 8080:8080 &
curl localhost:8080/metrics | grep -i rest_client_requests_total

For now we don't have an automated way to link the Cluster IP to the Cluster in prometheus. Once #6404 is added to the repo we can add a metric that will link these two pieces of information together giving remote client errors by cluster name / namespace.

So this metric should be used to understand when the remote cluster is uncontactable. Does this suit your use case @perithompson ?

sbueringer · 2022-04-26T14:08:59Z

@killianmuldoon @chrischdi I wonder if it would be possible to extend this metric and similar like it with an additional label for the cluster. I think this might be a very nice improvement as it makes it easier to use the metrics and avoid join's in PromQL. (to use those metrics you basically always have to join with another metrics which has the IP)

chrischdi · 2022-04-26T16:30:23Z

@killianmuldoon @chrischdi I wonder if it would be possible to extend this metric and similar like it with an additional label for the cluster. I think this might be a very nice improvement as it makes it easier to use the metrics and avoid join's in PromQL. (to use those metrics you basically always have to join with another metrics which has the IP)

In theory this would be possible by:

passing the cluster value via context to the metrics adapter, and reading it there from the context.

However for us it would be only possible to implement this in three ways, which I think is too much effort or have too many cons:

modifying controller-runtime to support / allow this change in some way
write our own metric but:
- the metric would be required to have another name (not rest_client_requests_total due to controller-runtime registers the metric in a func init(){...} call xref)
- by doing that, rest_client_requests_total would not be used by client-go anymore and thus not exist anymore
rewrite the whole controller-runtime metrics package and use that for the metrics http endpoint

sbueringer · 2022-04-26T16:40:00Z

Maybe 1. or 3. is an option for the future. The upside to investing the effort in controller-runtime is usually that a lot of folks (including other providers) can profit from it.

k8s-triage-robot · 2022-07-25T16:41:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

killianmuldoon · 2022-07-25T16:43:44Z

/remove-lifecycle stale

Work is still ongoing - tracked in #6458 to make the UX for this better for CAPI

fabriziopandini · 2022-10-03T19:51:52Z

/triage accepted
/unassign @killianmuldoon
/help

k8s-ci-robot · 2022-10-03T19:51:53Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/unassign @killianmuldoon
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-02-08T06:25:40Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2023-02-09T09:49:57Z

/lifecycle frozen

chrischdi · 2023-09-01T11:04:49Z

This should meanwhile be possible by promql queries with the custom resource metrics configuration.

Example:

(sum(rate(rest_client_requests_total{code="<error>"}[1m])) by (host,code,provider))
* on(host) group_left(name)
label_join(capi_cluster_info, "host", ":", "control_plane_endpoint_host", "control_plane_endpoint_port")

(This graph shows the error response rate during cluster creation for this cluster per provider/controller)

Based on this information it should be possible to create alerts :-)

sbueringer · 2023-09-01T11:06:32Z

Very nice!!

fabriziopandini · 2024-04-12T14:31:41Z

/priority important-longterm

sbueringer · 2024-04-15T06:53:47Z

/close

We already have that now, see Christian's example above

k8s-ci-robot · 2024-04-15T06:53:51Z

@sbueringer: Closing this issue.

In response to this:

/close

We already have that now, see Christian's example above

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 27, 2021

k8s-ci-robot added the area/health label Oct 27, 2021

k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022

k8s-ci-robot assigned killianmuldoon Jan 28, 2022

chrischdi mentioned this issue Apr 28, 2022

Cluster API State Metrics implementation #6458

Closed

9 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot unassigned killianmuldoon Oct 3, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Oct 3, 2022

killianmuldoon mentioned this issue Oct 25, 2022

Proactively bounce capi-controller-manager in case of netsplits #7445

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2023

killianmuldoon added the area/metrics Issues or PRs related to metrics label May 4, 2023

killianmuldoon removed the area/health label May 12, 2023

killianmuldoon mentioned this issue Aug 17, 2023

Exposed Prometheus metrics for unreachable workload clusters kubernetes-sigs/cluster-api-provider-vsphere#1281

Closed

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024

k8s-ci-robot closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise a metric whenever CAPI cannot see a remote cluster client #5510

Raise a metric whenever CAPI cannot see a remote cluster client #5510

perithompson commented Oct 27, 2021

killianmuldoon commented Oct 27, 2021

fabriziopandini commented Jan 26, 2022

killianmuldoon commented Jan 28, 2022

killianmuldoon commented Apr 14, 2022

sbueringer commented Apr 26, 2022 •

edited

Loading

chrischdi commented Apr 26, 2022

sbueringer commented Apr 26, 2022

k8s-triage-robot commented Jul 25, 2022

killianmuldoon commented Jul 25, 2022 •

edited

Loading

fabriziopandini commented Oct 3, 2022

k8s-ci-robot commented Oct 3, 2022

k8s-triage-robot commented Feb 8, 2023

fabriziopandini commented Feb 9, 2023

chrischdi commented Sep 1, 2023

sbueringer commented Sep 1, 2023

fabriziopandini commented Apr 12, 2024

sbueringer commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

Raise a metric whenever CAPI cannot see a remote cluster client #5510

Raise a metric whenever CAPI cannot see a remote cluster client #5510

Comments

perithompson commented Oct 27, 2021

killianmuldoon commented Oct 27, 2021

fabriziopandini commented Jan 26, 2022

killianmuldoon commented Jan 28, 2022

killianmuldoon commented Apr 14, 2022

sbueringer commented Apr 26, 2022 • edited Loading

chrischdi commented Apr 26, 2022

sbueringer commented Apr 26, 2022

k8s-triage-robot commented Jul 25, 2022

killianmuldoon commented Jul 25, 2022 • edited Loading

fabriziopandini commented Oct 3, 2022

k8s-ci-robot commented Oct 3, 2022

Guidelines

k8s-triage-robot commented Feb 8, 2023

fabriziopandini commented Feb 9, 2023

chrischdi commented Sep 1, 2023

sbueringer commented Sep 1, 2023

fabriziopandini commented Apr 12, 2024

sbueringer commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

sbueringer commented Apr 26, 2022 •

edited

Loading

killianmuldoon commented Jul 25, 2022 •

edited

Loading