-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remediate Ready Worker Node in E2E #159
Conversation
Improve getAvailableWorkerNodes to getReadyWorkerNodes by getting a list of all the ready worker nodes. Otherwise we will try to remediate node and fail on fetching its boot time at BeforeEach. Furthemrmore getNodeRoleFromMachine function will add the node role at the list of nodes prior to every e2e test
Bad return error and continue inside if was removed in favor of else to the if statement
Skipping CI for Draft Pull Request. |
/test 4.14-openshift-e2e |
/test 4.16-openshift-e2e |
/test 4.12-openshift-e2e |
/test 4.16-openshift-e2e |
RetryCount of 5 is too low for running the AWS agent. Leading to a context timed out, and stop trying to reboot the unhealthy node
… for test pods Move cluster information from BeforeEach to an independent test, and skip subsequent tests on failure. Tested pods toleration should use the Exist operator over Equal as the toleration value is irrelevant
/test 4.15-openshift-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general remark: we should fail tests when we don't run them, otherwise the risk to just be happy about fast and green tests, while something significant went wrong, is too high IMHO
test/e2e/far_e2e_test.go
Outdated
} | ||
} | ||
if len(readyWorkerNodes.Items) < 1 { | ||
Skip("There isn't an available (and ready) worker node in the cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should result in a failure IMHO
test/e2e/far_e2e_test.go
Outdated
} | ||
|
||
// pickRemediatedNode randomly returns a next remediated node from the current available nodes, | ||
// and then the node is removed from the list of available nodes | ||
func pickRemediatedNode(availableNodes *corev1.NodeList) *corev1.Node { | ||
if len(availableNodes.Items) < 1 { | ||
Fail("No available node found for remediation") | ||
Skip("There isn't an available (and ready) worker node in the cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why switch from Fail to Skip? This is hiding that we don't run tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not sure if this If
is needed as it was already checked (and will Fail) in https://github.com/medik8s/fence-agents-remediation/pull/159/files/8589575a67f85b8b82334a1ea5edd51f334659c6#diff-b9729ac4458c7637438e3bf558999b35adedcaf4d13af1e3eb752db501fd4997R272.
Do you see a value of checking that again after getReadyWorkerNodes function? It might be nice to make them independent 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about moving these checks to the place where both functions are called?
getWorkerNodes
returns an empty list. Caller decides if that´s a problem. If yes, fail. If not, go on with pickNode
...
Previously it was just skiping but it hides that we don't run tests and still getting a green result
/lgtm Feel free to address my suggestion or unhold :) |
Move the check of missing Ready worker node outside of the previous two functions. It simplifies the code
/test 4.15-openshift-e2e |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: razo7, slintes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
It would be nice if we can merge openshift/release#58825 and then test the PR with OCP 4.18 |
openshift/release#58825 tests are failing, and only when this PR will be merged then the tests will begin to be green again. /retest |
/unhold |
Why we need this PR
resourceDeletion
strategy and it is selected for remediation foroutOfService
strategy. Then there is a failure of getting the node boot time in the BeforeEach prior to CR creation https://github.com/medik8s/fence-agents-remediation/blob/main/test/e2e/far_e2e_test.go#L114Changes made
getAvailableWorkerNodes
togetReadyWorkerNodes
by getting a list of all the ready worker nodes. Otherwise, we will try to remediate the node and fail to fetch its boot time at BeforeEach.getNodeRoleFromMachine
will add the node role to the list of nodes before every e2e testWhich issue(s) this PR fixes
ECOPROJECT-2337
Test plan