Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set Node Unhealthy in E2E by Stopping Kubelet #68

Merged
merged 3 commits into from
Jul 31, 2023

Conversation

razo7
Copy link
Member

@razo7 razo7 commented Jul 24, 2023

The E2E test should emphasize a realistic scenario of only when a node is unhealthy FAR CR should be created. We achieve that by stopping kubelet on the desired node.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 24, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@razo7
Copy link
Member Author

razo7 commented Jul 24, 2023

/test 4.13-openshift-e2e

razo7 added a commit to razo7/fence-agents-remediation that referenced this pull request Jul 24, 2023
razo7 added a commit to razo7/fence-agents-remediation that referenced this pull request Jul 24, 2023
@razo7
Copy link
Member Author

razo7 commented Jul 24, 2023

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Jul 24, 2023

/test 4.14-openshift-e2e

test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
@razo7
Copy link
Member Author

razo7 commented Jul 25, 2023

/test 4.14-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Jul 26, 2023

/retest

@@ -201,17 +206,19 @@ func buildNodeParameters(clusterPlatformType configv1.PlatformType) (map[v1alpha
return testNodeParam, nil
}

// randomizeWorkerNode returns a worker node name which is different than the previous one
// randomizeWorkerNode returns a worker node that his name is different than the previous one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this returns a random node out of the given node list.

that his name is different than the previous one

I don't understand this part

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we randomly pick a worker node and there are only 3 available nodes then we can pick the same node, even though we use a different seed for the random function. Therefore, we store the name of the last unhealthy node so we won't try to create second CR for the same node.

previous one

Refers to he previous remediation, so on the first remediation previousNodeName is nil, and on the second remediation it stores the name of the previous unheathy node in order to not create a CR on that node again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now.
I advise against such implementation.

  • IMO it's bad practice to use global var in such way (not concurrent safe and confusing)
  • I think the method is confusing
  • we have no idea how many iteration will occur before the for condition is broker because we can always get randomly the same node, also seems like a risk of a deadlock.

Instead I suggest the following , let randomizeWorkerNode do just that - get a random node out of the list:

func randomizeWorkerNode(nodes *corev1.NodeList) *corev1.Node {
	// Generate a random seed based on the current time
	r := rand.New(rand.NewSource(time.Now().UnixNano()))
	// Randomly select a worker node
	return &nodes.Items[r.Intn(len(nodes.Items))]
}

And just pass to it an already filtered list (filter out the node you don't want it to get).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, I hope the new change helps.

@mshitrit
Copy link
Member

/lgtm
/hold

test/e2e/utils/command.go Outdated Show resolved Hide resolved
test/e2e/utils/command.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot removed the lgtm label Jul 26, 2023
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last thing... other than that lgtm

far := createFAR(nodeName, fenceAgent, testShareParam, testNodeParam)
DeferCleanup(deleteFAR, far)
})
When("running FAR to reboot one node", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably caused by rebasing etc., but I think this When() block can be removed. We test 2 remediations in the When() block below, that's enough I think :)

The E2E test should emphasize a realistic scenario of only when a node is unhealthy FAR CR should be created. We achieve that by stopping kubelet on the desired node
Test one remediation and two remediations while filtering the first node from selection after the first remediation
@openshift-ci openshift-ci bot added the lgtm label Jul 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Function declaration and comments
@openshift-ci openshift-ci bot removed the lgtm label Jul 30, 2023
@razo7
Copy link
Member Author

razo7 commented Jul 30, 2023

/unhold

@razo7
Copy link
Member Author

razo7 commented Jul 30, 2023

/retest

1 similar comment
@razo7
Copy link
Member Author

razo7 commented Jul 30, 2023

/retest

@slintes
Copy link
Member

slintes commented Jul 31, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 31, 2023
@openshift-merge-robot openshift-merge-robot merged commit 830e87c into medik8s:main Jul 31, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants