Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/e2e: Add debug of failed pods #2216

Merged

Conversation

stevenhorsman
Copy link
Member

@stevenhorsman stevenhorsman commented Dec 18, 2024

Print logs of failed pods and CAA in case of failures to help with debugging in the case of failures

@stevenhorsman stevenhorsman added the test_e2e_libvirt Run Libvirt e2e tests label Dec 18, 2024
@stevenhorsman
Copy link
Member Author

Typically - when I want tests to fail they all pass! Re-running...

@stevenhorsman stevenhorsman force-pushed the debug-failed-pods branch 2 times, most recently from 692fef8 to b25dcff Compare December 19, 2024 15:38
@stevenhorsman
Copy link
Member Author

https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/12415908135/job/34663884546?pr=2216 shows an example (with deliberately failing tests of it selecting the correct bit of the CAA log related to the pod that fails.

@stevenhorsman
Copy link
Member Author

https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/12417178259/job/34668109916?pr=2216 shows that the CAA log is now trimmed to only show the messages related to the pod that failed.

@stevenhorsman stevenhorsman marked this pull request as ready for review December 20, 2024 09:51
@stevenhorsman stevenhorsman requested a review from a team as a code owner December 20, 2024 09:51
@mkulke
Copy link
Collaborator

mkulke commented Dec 20, 2024

can you maybe screen shot an example? I fail to see spot anything in the debug step output, but maybe I'm looking in the wrong place

@stevenhorsman
Copy link
Member Author

stevenhorsman commented Dec 20, 2024

can you maybe screen shot an example? I fail to see spot anything in the debug step output, but maybe I'm looking in the wrong place

Sorry, it's not in the debug step, but inline with the tests (I'm not sure there is a way to separate and report later, as by the time we debug we've lost the specifics). As an example you can search the test output for --- FAIL: TestLibvirtCreatePeerPodAndCheckWorkDirLogs (40.12s) and see that just before before it we have:
image
which is the pod log, the pod events that were in warning/error (none in this case as I just changed the expected output to force a test failure) and the CAA logs starting with the request to create the sandbox for the pod

@mkulke
Copy link
Collaborator

mkulke commented Dec 20, 2024

can you maybe screen shot an example? I fail to see spot anything in the debug step output, but maybe I'm looking in the wrong place

Sorry, it's not in the debug step, but inline with the tests (I'm not sure there is a way to separate and report later, as by the time we debug we've lost the specifics)

I guess the logs can be written to file in a ./fail folder to be printed out in the debug step, but having them in the test output also works

func GetPodsFromJob(ctx context.Context, t *testing.T, client klient.Client, job *batchv1.Job) (*v1.PodList, error) {
clientset, err := kubernetes.NewForConfig(client.RESTConfig())
if err != nil {
return nil, fmt.Errorf("GetPodFromJob: get Kubernetes clientSet failed: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stevenhorsman !

s/GetPodFromJob/GetPodsFromJob/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns a PodList, so thats why I called it GetPods as it could return multiple

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say that the function's name being printed should be GetPodsFromJob

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I didn't look at the line properly!


pods, err := clientset.CoreV1().Pods(job.Namespace).List(context.TODO(), metav1.ListOptions{LabelSelector: "job-name=" + job.Name})
if err != nil {
return nil, fmt.Errorf("GetPodFromJob: get pod list failed: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed

}

// Find the logs starting with the pod
pod_matcher := regexp.MustCompile(`[0-9]{4}\/[0-9]{2}\/[0-9]{2} ([0-1]?[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9] \[adaptor\/cloud\] create a sandbox [0-9|a-f]* for pod ` + pod.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex is pure evil :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I'll split it up to be clearer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. It just caught my attention, I used to enjoy these crazy regex :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've provided an example of what we are trying to match against and broken out the date and time matcher, so I think it should be more understandable now. Just re-testing with a deliberate error again as regex is prone to failures and I don't test my manual testing as I'm on a run of mistakes!

Copy link
Member

@wainersm wainersm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stevenhorsman !

If a test fails, log debug info for the pod/job
- The log of the pod
- The extract of CAA log related to that pod
- The warning/error events for that pod

Signed-off-by: stevenhorsman <[email protected]>
@stevenhorsman
Copy link
Member Author

https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/12434706844/job/34719400000?pr=2216 shows that the CAA pod specific logging is still working, so my regex looks okay

If we just hit timeout whilst waiting for pods to get to
into the expected state, or during testing/clean-up
of resources, then `Error` instead of `Fatal`
as we don't want to skip the logging and cleanup.

Signed-off-by: stevenhorsman <[email protected]>
@stevenhorsman stevenhorsman merged commit cbbbc2f into confidential-containers:main Jan 3, 2025
42 checks passed
@stevenhorsman stevenhorsman deleted the debug-failed-pods branch January 3, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test_e2e_libvirt Run Libvirt e2e tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants