Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOMKilled error for gitclone does not say which container was killed? init, main, wait ? #10007

Open
2 of 3 tasks
tooptoop4 opened this issue Nov 10, 2022 · 1 comment · May be fixed by #13790
Open
2 of 3 tasks

OOMKilled error for gitclone does not say which container was killed? init, main, wait ? #10007

tooptoop4 opened this issue Nov 10, 2022 · 1 comment · May be fixed by #13790
Labels
area/controller Controller issues, panics type/feature Feature request

Comments

@tooptoop4
Copy link
Contributor

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

image

i have 5g memory set for this step, sometimes works, sometimes doesn't

not sure if i need to increase init or wait container memory too?

is it possible to make the error message say which container was OOMkilled?

Version

3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

https://github.com/argoproj/argo-workflows/blob/9f5759b5bd2a01d0f2930faa20ad5a769395eb99/examples/input-artifact-git.yaml like this one

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

@alexec alexec added type/feature Feature request and removed type/bug labels Nov 16, 2022
@tooptoop4
Copy link
Contributor Author

my guess is

func (woc *wfOperationCtx) inferFailedReason(pod *apiv1.Pod, tmpl *wfv1.Template) (wfv1.NodePhase, string) {
if pod.Status.Message != "" {
// Pod has a nice error message. Use that.
return wfv1.NodeFailed, pod.Status.Message
}
// We only get one message to set for the overall node status.
// If multiple containers failed, in order of preference:
// init containers (will be appended later), main (annotated), main (exit code), wait, sidecars.
order := func(n string) int {
switch {
case tmpl.IsMainContainerName(n):
return 1
case n == common.WaitContainerName:
return 2
default:
return 3
}
}
ctrs := pod.Status.ContainerStatuses
sort.Slice(ctrs, func(i, j int) bool { return order(ctrs[i].Name) < order(ctrs[j].Name) })
// Init containers have the highest preferences over other containers.
ctrs = append(pod.Status.InitContainerStatuses, ctrs...)
// When there isn't any containstatus (such as no stock in public cloud), return Failure.
if len(ctrs) == 0 {
return wfv1.NodeFailed, fmt.Sprintf("can't find failed message for pod %s namespace %s", pod.Name, pod.Namespace)
}
for _, ctr := range ctrs {
// Virtual Kubelet environment will not set the terminate on waiting container
// https://github.com/argoproj/argo-workflows/issues/3879
// https://github.com/virtual-kubelet/virtual-kubelet/blob/7f2a02291530d2df14905702e6d51500dd57640a/node/sync.go#L195-L208
if ctr.State.Waiting != nil {
return wfv1.NodeError, fmt.Sprintf("Pod failed before %s container starts due to %s: %s", ctr.Name, ctr.State.Waiting.Reason, ctr.State.Waiting.Message)
}
t := ctr.State.Terminated
if t == nil {
// We should never get here
log.Warnf("Pod %s phase was Failed but %s did not have terminated state", pod.Name, ctr.Name)
continue
}
if t.ExitCode == 0 {
continue
}
msg := fmt.Sprintf("%s (exit code %d)", t.Reason, t.ExitCode)
if t.Message != "" {
msg = fmt.Sprintf("%s: %s", msg, t.Message)
}
switch {
case ctr.Name == common.InitContainerName:
return wfv1.NodeError, msg
case tmpl.IsMainContainerName(ctr.Name):
return wfv1.NodeFailed, msg
case ctr.Name == common.WaitContainerName:
return wfv1.NodeError, msg
default:
if t.ExitCode == 137 || t.ExitCode == 143 {
// if the sidecar was SIGKILL'd (exit code 137) assume it was because argoexec
// forcibly killed the container, which we ignore the error for.
// Java code 143 is a normal exit 128 + 15 https://github.com/elastic/elasticsearch/issues/31847
log.Infof("Ignoring %d exit code of container '%s'", t.ExitCode, ctr.Name)
} else {
return wfv1.NodeFailed, msg
}
}
}
// If we get here, we have detected that the main/wait containers succeed but the sidecar(s)
// were SIGKILL'd. The executor may have had to forcefully terminate the sidecar (kill -9),
// resulting in a 137 exit code (which we had ignored earlier). If failMessages is empty, it
// indicates that this is the case and we return Success instead of Failure.
return wfv1.NodeSucceeded, ""
}
could append (init or wait ctr) to the message

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/feature Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants