Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remediate Ready Worker Node in E2E #159

Merged
merged 6 commits into from
Nov 14, 2024

Conversation

razo7
Copy link
Member

@razo7 razo7 commented Nov 6, 2024

Why we need this PR

  1. Fix a trailing error when a node becomes unresponsive from the resourceDeletion strategy and it is selected for remediation for outOfService strategy. Then there is a failure of getting the node boot time in the BeforeEach prior to CR creation https://github.com/medik8s/fence-agents-remediation/blob/main/test/e2e/far_e2e_test.go#L114
  [FAILED] failed to get boot time of the node
  Unexpected error:
      <wait.errInterrupted>: 
      timed out waiting for the condition
      {
          cause: <*errors.errorString | 0xc0003cd250>{
              s: "timed out waiting for the condition",
          },
      }
  occurred
  1. Reboot doesn't succeed on the second node remediation (per strategy), as it can't get the boot time of the node https://github.com/medik8s/fence-agents-remediation/blob/main/test/e2e/far_e2e_test.go#L430 passing a timeout.
Timeout for node reboot has passed, even though FAR CR has been created 
The function passed to Eventually returned the following error: <wait.errInterrupted>: timed out waiting for the condition { cause: <*errors.errorString | 0xc0003cd250> { s: "timed out waiting for the condition", }, }
  1. The tested pods for checking the boot time have been evicted due to FAR remediation taint, while they should tolerate it.

Changes made

  1. Improve the function getAvailableWorkerNodes to getReadyWorkerNodes by getting a list of all the ready worker nodes. Otherwise, we will try to remediate the node and fail to fetch its boot time at BeforeEach.
  2. Function getNodeRoleFromMachine will add the node role to the list of nodes before every e2e test
  3. Fix listing machines with no associated node - too long and messy return message.
  4. Extend RetryCount for (AWS) agent.
  5. Move cluster information from BeforeEach to an independent test, and skip subsequent tests on failure.
  6. Tested pods toleration should use the Exist operator over Equal as the toleration value is irrelevant

Which issue(s) this PR fixes

ECOPROJECT-2337

Test plan

Improve getAvailableWorkerNodes to getReadyWorkerNodes by getting a list of all the ready worker nodes. Otherwise we will try to remediate node and fail on fetching its boot time at BeforeEach. Furthemrmore getNodeRoleFromMachine function will add the node role at the list of nodes prior to every e2e test
Bad return error and continue inside if was removed in favor of else to the if statement
Copy link
Contributor

openshift-ci bot commented Nov 6, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved label Nov 6, 2024
@razo7
Copy link
Member Author

razo7 commented Nov 6, 2024

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Nov 9, 2024

/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Nov 11, 2024

/test 4.12-openshift-e2e
/test 4.13-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Nov 11, 2024

/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

RetryCount of 5 is too low for running the AWS agent. Leading to a context timed out, and stop trying to reboot the unhealthy node
… for test pods

Move cluster information from BeforeEach to an independent test, and skip subsequent tests on failure. Tested pods toleration should use the Exist operator over Equal as the toleration value is irrelevant
@razo7
Copy link
Member Author

razo7 commented Nov 13, 2024

/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general remark: we should fail tests when we don't run them, otherwise the risk to just be happy about fast and green tests, while something significant went wrong, is too high IMHO

test/e2e/far_e2e_test.go Show resolved Hide resolved
}
}
if len(readyWorkerNodes.Items) < 1 {
Skip("There isn't an available (and ready) worker node in the cluster")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should result in a failure IMHO

test/e2e/far_e2e_test.go Show resolved Hide resolved
}

// pickRemediatedNode randomly returns a next remediated node from the current available nodes,
// and then the node is removed from the list of available nodes
func pickRemediatedNode(availableNodes *corev1.NodeList) *corev1.Node {
if len(availableNodes.Items) < 1 {
Fail("No available node found for remediation")
Skip("There isn't an available (and ready) worker node in the cluster")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why switch from Fail to Skip? This is hiding that we don't run tests...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not sure if this If is needed as it was already checked (and will Fail) in https://github.com/medik8s/fence-agents-remediation/pull/159/files/8589575a67f85b8b82334a1ea5edd51f334659c6#diff-b9729ac4458c7637438e3bf558999b35adedcaf4d13af1e3eb752db501fd4997R272.
Do you see a value of checking that again after getReadyWorkerNodes function? It might be nice to make them independent 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving these checks to the place where both functions are called?
getWorkerNodes returns an empty list. Caller decides if that´s a problem. If yes, fail. If not, go on with pickNode...

test/e2e/far_e2e_test.go Show resolved Hide resolved
Previously it was just skiping but it hides that we don't run tests and still getting a green result
@slintes
Copy link
Member

slintes commented Nov 13, 2024

/lgtm

Feel free to address my suggestion or unhold :)
/hold

Move the check of missing Ready worker node outside of the previous two functions. It simplifies the code
@openshift-ci openshift-ci bot removed the lgtm label Nov 14, 2024
@razo7
Copy link
Member Author

razo7 commented Nov 14, 2024

/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
/test 4.17-openshift-e2e

@openshift-ci openshift-ci bot added the lgtm label Nov 14, 2024
Copy link
Contributor

openshift-ci bot commented Nov 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@razo7 razo7 marked this pull request as ready for review November 14, 2024 10:53
@razo7
Copy link
Member Author

razo7 commented Nov 14, 2024

It would be nice if we can merge openshift/release#58825 and then test the PR with OCP 4.18

@razo7
Copy link
Member Author

razo7 commented Nov 14, 2024

openshift/release#58825 tests are failing, and only when this PR will be merged then the tests will begin to be green again.

/retest

@razo7
Copy link
Member Author

razo7 commented Nov 14, 2024

/unhold

@openshift-merge-bot openshift-merge-bot bot merged commit d4f78cd into medik8s:main Nov 14, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants