Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise a metric whenever CAPI cannot see a remote cluster client #5510

Closed
perithompson opened this issue Oct 27, 2021 · 18 comments
Closed

Raise a metric whenever CAPI cannot see a remote cluster client #5510

perithompson opened this issue Oct 27, 2021 · 18 comments
Labels
area/metrics Issues or PRs related to metrics help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@perithompson
Copy link

User Story

As an operator I would like a capi to raise a metric whenever it cannot see a remote cluster client to see where we can see a continuous rise in errors contacting the workload cluster client and potentially raise an alert.

Detailed Description
Similar to kubernetes-sigs/cluster-api-provider-vsphere#1281, it would be useful to be able to spot when clusters are not contactable from the management cluster so that we can monitor and alert when reconciliation should be paused until such time that the remote cluster is again contactable.

Anything else you would like to add:
This also relates to #5394 in that that is asking for more information on the annotations to be used when cluster communication is interrupted

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 27, 2021
@killianmuldoon
Copy link
Contributor

/area health

@fabriziopandini
Copy link
Member

Praise for the clear definition of the required metric.
/milestone v1.2

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022
@killianmuldoon
Copy link
Contributor

/assign

@killianmuldoon
Copy link
Contributor

All credit to @chrischdi for the following info 🙂

We have a metric, exposed by client-go through controller-runtime, which reports error responses in the client. It's called rest_client_requests_total and it has the following format:

rest_client_requests_total{code="<error>",host="172.18.0.4:6443",method="GET"} 36

The host IP here is the ControlPlaneEndpoint IP - i.e. from Cluster .spec.controlPlaneEndpoint.host

It is currently exposed by CAPI's metrics endpoint (by default :8080/metrics) You can see it yourself (with kubecontext set to the management cluster) using:

kubectl port-forward -n capi-system deployments/capi-controller-manager 8080:8080 &
curl localhost:8080/metrics | grep -i rest_client_requests_total

For now we don't have an automated way to link the Cluster IP to the Cluster in prometheus. Once #6404 is added to the repo we can add a metric that will link these two pieces of information together giving remote client errors by cluster name / namespace.

So this metric should be used to understand when the remote cluster is uncontactable. Does this suit your use case @perithompson ?

@sbueringer
Copy link
Member

sbueringer commented Apr 26, 2022

@killianmuldoon @chrischdi I wonder if it would be possible to extend this metric and similar like it with an additional label for the cluster. I think this might be a very nice improvement as it makes it easier to use the metrics and avoid join's in PromQL. (to use those metrics you basically always have to join with another metrics which has the IP)

@chrischdi
Copy link
Member

@killianmuldoon @chrischdi I wonder if it would be possible to extend this metric and similar like it with an additional label for the cluster. I think this might be a very nice improvement as it makes it easier to use the metrics and avoid join's in PromQL. (to use those metrics you basically always have to join with another metrics which has the IP)

In theory this would be possible by:

  • passing the cluster value via context to the metrics adapter, and reading it there from the context.

However for us it would be only possible to implement this in three ways, which I think is too much effort or have too many cons:

  • modifying controller-runtime to support / allow this change in some way
  • write our own metric but:
    • the metric would be required to have another name (not rest_client_requests_total due to controller-runtime registers the metric in a func init(){...} call xref)
    • by doing that, rest_client_requests_total would not be used by client-go anymore and thus not exist anymore
  • rewrite the whole controller-runtime metrics package and use that for the metrics http endpoint

@sbueringer
Copy link
Member

Maybe 1. or 3. is an option for the future. The upside to investing the effort in controller-runtime is usually that a lot of folks (including other providers) can profit from it.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@killianmuldoon
Copy link
Contributor

killianmuldoon commented Jul 25, 2022

/remove-lifecycle stale

Work is still ongoing - tracked in #6458 to make the UX for this better for CAPI

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/triage accepted
/unassign @killianmuldoon
/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/unassign @killianmuldoon
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Oct 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@fabriziopandini
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2023
@chrischdi
Copy link
Member

This should meanwhile be possible by promql queries with the custom resource metrics configuration.

Example:

(sum(rate(rest_client_requests_total{code="<error>"}[1m])) by (host,code,provider))
* on(host) group_left(name)
label_join(capi_cluster_info, "host", ":", "control_plane_endpoint_host", "control_plane_endpoint_port")

image

(This graph shows the error response rate during cluster creation for this cluster per provider/controller)

Based on this information it should be possible to create alerts :-)

@sbueringer
Copy link
Member

Very nice!!

@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024
@sbueringer
Copy link
Member

/close

We already have that now, see Christian's example above

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

/close

We already have that now, see Christian's example above

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Issues or PRs related to metrics help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants