-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-33659: Improve resliency of size tagging when hostedcluster KAS down #4034
OCPBUGS-33659: Improve resliency of size tagging when hostedcluster KAS down #4034
Conversation
@csrwng: This pull request references Jira Issue OCPBUGS-33659, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
Show resolved
Hide resolved
hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
Show resolved
Hide resolved
hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go
Outdated
Show resolved
Hide resolved
if hccoReportsNodeCount { | ||
nodeCountRequiresAPIServer = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also let the HCCO to compute nodecount from the NodePools spec?
Even without facing this issue wouldn't that make sense from the autoscaling functionallity pov?
If so we could change existing hostedControlPlane.Status.NodeCount or maybe have a new one e.g. hostedControlPlane.Status.NodeCountIntent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's harder (since the hcco now needs to be aware of nodepools or machinesets). Could we instead use the hcco count if the KAS is functional, otherwise fallback to the nodepool.spec count if no autoscaling is in use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually scratch that. In the cases where we don't have nodepools, but nodes exist (IBM Cloud), falling back on the nodepools will give us the wrong answer.
Relates to kubernetes-sigs/cluster-api#10195 |
hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go
Outdated
Show resolved
Hide resolved
95b8a59
to
7a2df01
Compare
d3a91c1
to
e411488
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest-required |
/override ci/prow/e2e-kubevirt-aws-ovn |
@csrwng: Overrode contexts on behalf of csrwng: ci/prow/e2e-kubevirt-aws-ovn In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
e411488
to
e468da3
Compare
When the kube apiserver of a hosted cluster is not available, the replica status of nodepools will not be accurate because the CAPI controllers can no longer get node counts from the API server. This commit improves the handling of this situation with 2 changes: - Switches to use .spec.replicas to determine node count of nodepools that do not have autoscaling turned on. - Once a hosted cluster has been tagged with a size, only if the kube apiserver of the hosted cluster is available is the hosted cluster allowed to move to a different size.
e468da3
to
001916e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/lgtm |
/retest |
/jira refresh The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity. |
@openshift-bot: This pull request references Jira Issue OCPBUGS-33659, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest-required |
/override ci/prow/e2e-aws |
/override ci/prow/e2e-kubevirt-aws-ovn |
CI is currently flaking on cloud quota issues, and this PR does not impact any codepath tested by CI |
@csrwng: Overrode contexts on behalf of csrwng: ci/prow/e2e-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@csrwng: Overrode contexts on behalf of csrwng: ci/prow/e2e-kubevirt-aws-ovn In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@csrwng: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
630c4d8
into
openshift:main
@csrwng: Jira Issue OCPBUGS-33659: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-33659 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-hypershift-container-v4.17.0-202405171741.p0.g630c4d8.assembly.stream.el9 for distgit hypershift. |
What this PR does / why we need it:
When the kube apiserver of a hosted cluster is not available, the replica status of nodepools will not be accurate because the CAPI controllers can no longer get node counts from the API server. This commit improves the handling of this situation with 2 changes:
Which issue(s) this PR fixes (optional, use
fixes #<issue_number>(, fixes #<issue_number>, ...)
format, where issue_number might be a GitHub issue, or a Jira story:Fixes #OCPBUGS-33659
Checklist