Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposed Prometheus metrics for unreachable workload clusters #1281

Closed
perithompson opened this issue Oct 27, 2021 · 8 comments
Closed

Exposed Prometheus metrics for unreachable workload clusters #1281

perithompson opened this issue Oct 27, 2021 · 8 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@perithompson
Copy link

/kind feature

Describe the solution you'd like
When a Management cluster cannot reach a Workload Cluster it may be necessary to pause reconciliation of that cluster until a time when cluster connectivity can be restored in order to prevent capi from constantly trying to reconcile something it can't. There is an issue(#5394) that outlines some gaps in the documentation around this.

One problem operators will face is knowing when this state occurs. A simple way to monitor for this state is a metric for monitoring workload clusters. Having count of workload clusters, count of paused and count of unreachable clusters exposed to Prometheus would allow for an alert on a change to unreachable cluster count and then operators could implement some automation or SOPs to check and pause clusters that cannot reconcile for any reason.

Anything else you would like to add:

Environment:

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2022
@srm09
Copy link
Contributor

srm09 commented Jan 28, 2022

@perithompson This seems like something that will sit in the CAPI repo. I am happy to keep this one around if CAPV would need to do something specifically, but I think the entire change will rest directly in CAPI.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2022
@srm09
Copy link
Contributor

srm09 commented Mar 1, 2022

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Mar 1, 2022
@chrischdi
Copy link
Member

Maybe gets resolved or partially resolved in #2061

@killianmuldoon
Copy link
Contributor

I think this should be closed in favor of a solution on the CAPI side - there's a related issue here: kubernetes-sigs/cluster-api#5510

@sbueringer
Copy link
Member

+1

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

+1

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

7 participants