OOMKilled error for gitclone does not say which container was killed? init, main, wait ? #10007

tooptoop4 · 2022-11-10T00:11:40Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

i have 5g memory set for this step, sometimes works, sometimes doesn't

not sure if i need to increase init or wait container memory too?

is it possible to make the error message say which container was OOMkilled?

Version

3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

https://github.com/argoproj/argo-workflows/blob/9f5759b5bd2a01d0f2930faa20ad5a769395eb99/examples/input-artifact-git.yaml like this one

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

tooptoop4 · 2024-09-30T20:08:02Z

my guess is

argo-workflows/workflow/controller/operator.go

Lines 1581 to 1657 in 54621cc

    
           func (woc *wfOperationCtx) inferFailedReason(pod *apiv1.Pod, tmpl *wfv1.Template) (wfv1.NodePhase, string) { 
        
           	if pod.Status.Message != "" { 
        
           		// Pod has a nice error message. Use that. 
        
           		return wfv1.NodeFailed, pod.Status.Message 
        
           	} 
        
           	// We only get one message to set for the overall node status. 
        
           	// If multiple containers failed, in order of preference: 
        
           	// init containers (will be appended later), main (annotated), main (exit code), wait, sidecars. 
        
           	order := func(n string) int { 
        
           		switch { 
        
           		case tmpl.IsMainContainerName(n): 
        
           			return 1 
        
           		case n == common.WaitContainerName: 
        
           			return 2 
        
           		default: 
        
           			return 3 
        
           		} 
        
           	} 
        
           	ctrs := pod.Status.ContainerStatuses 
        
           	sort.Slice(ctrs, func(i, j int) bool { return order(ctrs[i].Name) < order(ctrs[j].Name) }) 
        
           	// Init containers have the highest preferences over other containers. 
        
           	ctrs = append(pod.Status.InitContainerStatuses, ctrs...) 
        
           	// When there isn't any containstatus (such as no stock in public cloud), return Failure. 
        
           	if len(ctrs) == 0 { 
        
           		return wfv1.NodeFailed, fmt.Sprintf("can't find failed message for pod %s namespace %s", pod.Name, pod.Namespace) 
        
           	} 
        
           	for _, ctr := range ctrs { 
        
           		// Virtual Kubelet environment will not set the terminate on waiting container 
        
           		// https://github.com/argoproj/argo-workflows/issues/3879 
        
           		// https://github.com/virtual-kubelet/virtual-kubelet/blob/7f2a02291530d2df14905702e6d51500dd57640a/node/sync.go#L195-L208 
        
           		if ctr.State.Waiting != nil { 
        
           			return wfv1.NodeError, fmt.Sprintf("Pod failed before %s container starts due to %s: %s", ctr.Name, ctr.State.Waiting.Reason, ctr.State.Waiting.Message) 
        
           		} 
        
           		t := ctr.State.Terminated 
        
           		if t == nil { 
        
           			// We should never get here 
        
           			log.Warnf("Pod %s phase was Failed but %s did not have terminated state", pod.Name, ctr.Name) 
        
           			continue 
        
           		} 
        
           		if t.ExitCode == 0 { 
        
           			continue 
        
           		} 
        
           		msg := fmt.Sprintf("%s (exit code %d)", t.Reason, t.ExitCode) 
        
           		if t.Message != "" { 
        
           			msg = fmt.Sprintf("%s: %s", msg, t.Message) 
        
           		} 
        
           		switch { 
        
           		case ctr.Name == common.InitContainerName: 
        
           			return wfv1.NodeError, msg 
        
           		case tmpl.IsMainContainerName(ctr.Name): 
        
           			return wfv1.NodeFailed, msg 
        
           		case ctr.Name == common.WaitContainerName: 
        
           			return wfv1.NodeError, msg 
        
           		default: 
        
           			if t.ExitCode == 137 || t.ExitCode == 143 { 
        
           				// if the sidecar was SIGKILL'd (exit code 137) assume it was because argoexec 
        
           				// forcibly killed the container, which we ignore the error for. 
        
           				// Java code 143 is a normal exit 128 + 15 https://github.com/elastic/elasticsearch/issues/31847 
        
           				log.Infof("Ignoring %d exit code of container '%s'", t.ExitCode, ctr.Name) 
        
           			} else { 
        
           				return wfv1.NodeFailed, msg 
        
           			} 
        
           		} 
        
           	} 
        
           	// If we get here, we have detected that the main/wait containers succeed but the sidecar(s) 
        
           	// were  SIGKILL'd. The executor may have had to forcefully terminate the sidecar (kill -9), 
        
           	// resulting in a 137 exit code (which we had ignored earlier). If failMessages is empty, it 
        
           	// indicates that this is the case and we return Success instead of Failure. 
        
           	return wfv1.NodeSucceeded, "" 
        
           }

could append (init or wait ctr) to the message

tooptoop4 added the type/bug label Nov 10, 2022

alexec added type/feature Feature request and removed type/bug labels Nov 16, 2022

tooptoop4 linked a pull request Oct 19, 2024 that will close this issue

feat: include container name in error message. Fixes #10007 #13790

Open

agilgur5 added the area/controller Controller issues, panics label Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOMKilled error for gitclone does not say which container was killed? init, main, wait ? #10007

OOMKilled error for gitclone does not say which container was killed? init, main, wait ? #10007

tooptoop4 commented Nov 10, 2022

tooptoop4 commented Sep 30, 2024

OOMKilled error for gitclone does not say which container was killed? init, main, wait ? #10007

OOMKilled error for gitclone does not say which container was killed? init, main, wait ? #10007

Comments

tooptoop4 commented Nov 10, 2022

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

tooptoop4 commented Sep 30, 2024