Remediate Ready Worker Node in E2E #159

razo7 · 2024-11-06T14:11:44Z

Why we need this PR

Fix a trailing error when a node becomes unresponsive from the resourceDeletion strategy and it is selected for remediation for outOfService strategy. Then there is a failure of getting the node boot time in the BeforeEach prior to CR creation https://github.com/medik8s/fence-agents-remediation/blob/main/test/e2e/far_e2e_test.go#L114

  [FAILED] failed to get boot time of the node
  Unexpected error:
      <wait.errInterrupted>: 
      timed out waiting for the condition
      {
          cause: <*errors.errorString | 0xc0003cd250>{
              s: "timed out waiting for the condition",
          },
      }
  occurred

Reboot doesn't succeed on the second node remediation (per strategy), as it can't get the boot time of the node https://github.com/medik8s/fence-agents-remediation/blob/main/test/e2e/far_e2e_test.go#L430 passing a timeout.

Timeout for node reboot has passed, even though FAR CR has been created 
The function passed to Eventually returned the following error: <wait.errInterrupted>: timed out waiting for the condition { cause: <*errors.errorString | 0xc0003cd250> { s: "timed out waiting for the condition", }, }

The tested pods for checking the boot time have been evicted due to FAR remediation taint, while they should tolerate it.

Changes made

Improve the function getAvailableWorkerNodes to getReadyWorkerNodes by getting a list of all the ready worker nodes. Otherwise, we will try to remediate the node and fail to fetch its boot time at BeforeEach.
Function getNodeRoleFromMachine will add the node role to the list of nodes before every e2e test
Fix listing machines with no associated node - too long and messy return message.
Extend RetryCount for (AWS) agent.
Move cluster information from BeforeEach to an independent test, and skip subsequent tests on failure.
Tested pods toleration should use the Exist operator over Equal as the toleration value is irrelevant

Which issue(s) this PR fixes

ECOPROJECT-2337

Test plan

Improve getAvailableWorkerNodes to getReadyWorkerNodes by getting a list of all the ready worker nodes. Otherwise we will try to remediate node and fail on fetching its boot time at BeforeEach. Furthemrmore getNodeRoleFromMachine function will add the node role at the list of nodes prior to every e2e test

Bad return error and continue inside if was removed in favor of else to the if statement

openshift-ci · 2024-11-06T14:11:49Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

razo7 · 2024-11-06T14:39:31Z

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

razo7 · 2024-11-09T09:39:15Z

/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

razo7 · 2024-11-11T14:02:45Z

/test 4.12-openshift-e2e
/test 4.13-openshift-e2e

razo7 · 2024-11-11T15:59:23Z

/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

RetryCount of 5 is too low for running the AWS agent. Leading to a context timed out, and stop trying to reboot the unhealthy node

… for test pods Move cluster information from BeforeEach to an independent test, and skip subsequent tests on failure. Tested pods toleration should use the Exist operator over Equal as the toleration value is irrelevant

razo7 · 2024-11-13T05:25:59Z

/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

slintes

general remark: we should fail tests when we don't run them, otherwise the risk to just be happy about fast and green tests, while something significant went wrong, is too high IMHO

test/e2e/far_e2e_test.go

slintes · 2024-11-13T08:08:22Z

test/e2e/far_e2e_test.go

+		}
+	}
+	if len(readyWorkerNodes.Items) < 1 {
+		Skip("There isn't an available (and ready) worker node in the cluster")


this should result in a failure IMHO

test/e2e/far_e2e_test.go

slintes · 2024-11-13T08:09:30Z

test/e2e/far_e2e_test.go

 }

 // pickRemediatedNode randomly returns a next remediated node from the current available nodes,
 // and then the node is removed from the list of available nodes
 func pickRemediatedNode(availableNodes *corev1.NodeList) *corev1.Node {
 	if len(availableNodes.Items) < 1 {
-		Fail("No available node found for remediation")
+		Skip("There isn't an available (and ready) worker node in the cluster")


why switch from Fail to Skip? This is hiding that we don't run tests...

Actually not sure if this If is needed as it was already checked (and will Fail) in https://github.com/medik8s/fence-agents-remediation/pull/159/files/8589575a67f85b8b82334a1ea5edd51f334659c6#diff-b9729ac4458c7637438e3bf558999b35adedcaf4d13af1e3eb752db501fd4997R272.
Do you see a value of checking that again after getReadyWorkerNodes function? It might be nice to make them independent 🤔

What about moving these checks to the place where both functions are called?
getWorkerNodes returns an empty list. Caller decides if that´s a problem. If yes, fail. If not, go on with pickNode...

test/e2e/far_e2e_test.go

Previously it was just skiping but it hides that we don't run tests and still getting a green result

slintes · 2024-11-13T15:48:18Z

/lgtm

Feel free to address my suggestion or unhold :)
/hold

Move the check of missing Ready worker node outside of the previous two functions. It simplifies the code

razo7 · 2024-11-14T06:37:53Z

/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

openshift-ci · 2024-11-14T10:05:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [razo7,slintes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

razo7 · 2024-11-14T12:27:29Z

It would be nice if we can merge openshift/release#58825 and then test the PR with OCP 4.18

razo7 · 2024-11-14T14:13:22Z

openshift/release#58825 tests are failing, and only when this PR will be merged then the tests will begin to be green again.

/retest

razo7 · 2024-11-14T14:13:33Z

/unhold

razo7 added 2 commits November 6, 2024 15:57

Fix listing machines with no associated node

c2baa2f

Bad return error and continue inside if was removed in favor of else to the if statement

openshift-ci bot added the do-not-merge/work-in-progress label Nov 6, 2024

openshift-ci bot added the approved label Nov 6, 2024

razo7 mentioned this pull request Nov 11, 2024

[DO NOT MERGE] Increase timeout to test e2e failures on OCPs #157

Closed

razo7 added 2 commits November 12, 2024 12:37

Extend RetryCount for AWS agent

b9cc66b

RetryCount of 5 is too low for running the AWS agent. Leading to a context timed out, and stop trying to reboot the unhealthy node

slintes requested changes Nov 13, 2024

View reviewed changes

openshift-ci bot assigned slintes Nov 13, 2024

Fail on missing Ready Worker nodes

156442e

Previously it was just skiping but it hides that we don't run tests and still getting a green result

openshift-ci bot added do-not-merge/hold lgtm labels Nov 13, 2024

Fail early on missing Ready Worker node

556f51a

Move the check of missing Ready worker node outside of the previous two functions. It simplifies the code

openshift-ci bot removed the lgtm label Nov 14, 2024

slintes approved these changes Nov 14, 2024

View reviewed changes

openshift-ci bot added the lgtm label Nov 14, 2024

razo7 marked this pull request as ready for review November 14, 2024 10:53

openshift-ci bot removed the do-not-merge/work-in-progress label Nov 14, 2024

openshift-ci bot requested review from clobrano and slintes November 14, 2024 10:53

openshift-ci bot removed the do-not-merge/hold label Nov 14, 2024

openshift-merge-bot bot merged commit d4f78cd into medik8s:main Nov 14, 2024
27 checks passed

razo7 mentioned this pull request Nov 15, 2024

Error aggregation for Validation Webhooks #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remediate Ready Worker Node in E2E #159

Remediate Ready Worker Node in E2E #159

razo7 commented Nov 6, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Nov 6, 2024

razo7 commented Nov 6, 2024

razo7 commented Nov 9, 2024

razo7 commented Nov 11, 2024

razo7 commented Nov 11, 2024

razo7 commented Nov 13, 2024

slintes left a comment

slintes Nov 13, 2024

slintes Nov 13, 2024

razo7 Nov 13, 2024

slintes Nov 13, 2024

slintes commented Nov 13, 2024

razo7 commented Nov 14, 2024

openshift-ci bot commented Nov 14, 2024

razo7 commented Nov 14, 2024

razo7 commented Nov 14, 2024

razo7 commented Nov 14, 2024

Remediate Ready Worker Node in E2E #159

Remediate Ready Worker Node in E2E #159

Conversation

razo7 commented Nov 6, 2024 • edited by openshift-ci bot Loading

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

openshift-ci bot commented Nov 6, 2024

razo7 commented Nov 6, 2024

razo7 commented Nov 9, 2024

razo7 commented Nov 11, 2024

razo7 commented Nov 11, 2024

razo7 commented Nov 13, 2024

slintes left a comment

Choose a reason for hiding this comment

slintes Nov 13, 2024

Choose a reason for hiding this comment

slintes Nov 13, 2024

Choose a reason for hiding this comment

razo7 Nov 13, 2024

Choose a reason for hiding this comment

slintes Nov 13, 2024

Choose a reason for hiding this comment

slintes commented Nov 13, 2024

razo7 commented Nov 14, 2024

openshift-ci bot commented Nov 14, 2024

razo7 commented Nov 14, 2024

razo7 commented Nov 14, 2024

razo7 commented Nov 14, 2024

razo7 commented Nov 6, 2024 •

edited by openshift-ci bot

Loading