3.4.9 - Failed to get a template #11441

ckyongrtl · 2023-07-25T09:49:14Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I'm not sure how to exactly report this, as it is more a sporadic bug, but I'm hoping to get some pointers or tips here. I first asked in the Slack Community, but someone told me to raise a bug ticket here. So here goes:

The issue

After we migrated from 3.3.7 to 3.4.8 (and subsequently to 3.4.9), we've started seeing some errors in our workflow-controller. There's two errors that keep popping up:

time="2023-07-25T07:20:19.034Z" level=error msg="Mark error node" error="failed to get a template" namespace=watch nodeName="video-processing-pwnzp(0).start-ingest-workflow(0).delete-files-from-storage(0)" workflow=video-processing-pwnzp pod=argoworkflows-workflow-controller-77cd69fc74-hxzw4 filename=/var/log/pods/argo_argoworkflows-workflow-controller-77cd69fc74-hxzw4_15b23469-10bb-45d7-b38b-2d85df6d1e49/controller/0.log node_name=aks-standardhigh-16716760-vmss00008a

And the other one (should I create a separate issue for it?):

time="2023-07-25T01:18:47.145Z" level=error msg="Mark error node" error="failed to look-up entrypoint/cmd for image \"myacr.azurecr.io/checkfile:20230724.5\", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: GET https://myacr.azurecr.io/oauth2/token?scope=repository%3Acheckfile%3Apull&service=myacr.azurecr.io: UNAUTHORIZED: authentication required, visit https://aka.ms/acr/authorization for more information." namespace=watch nodeName="clip-created-fppgr.check-file(0)" workflow=clip-created-fppgr pod=argoworkflows-workflow-controller-77cd69fc74-hxzw4 filename=/var/log/pods/argo_argoworkflows-workflow-controller-77cd69fc74-hxzw4_15b23469-10bb-45d7-b38b-2d85df6d1e49/controller/0.log node_name=aks-standardhigh-16716760-vmss00008a

After a retry, the workflow continues and succeeds. Most of our workflows are fine, but sometimes we get this error, and we need to manually retry them to continue operations.

Workaround

Add extensive retry policies as such:

retryStrategy:
  retryPolicy: "OnError"
  limit: 5
  backoff:
    duration: "2" # default unit is seconds
    factor: 2
    maxDuration: "1m"

Environment and setup

Azure Kubernetes Services, Kubernetes version 1.25.6.
Private images are hosted on Azure Container Registry, connection from AKS through Managed Identity Flow.
Replicas of WF controller and server set to recommended settings:

Controller with 2 replicas, PDB with MaxUnavailable: 1
Server with 3 replicas, HPA set to 3 to 5 replicas.

Expected behaviour

Images should be pulled and containers should spin up.

Version

3.4.9

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This template is called by another template, which is called by another Workflow.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: ingest
spec:
  templates:
  - name: ingest-workflow
    dag:
      tasks:
      - name: external-template-script
        templateRef:
          name: run-script
          template: run-script
      - name: private-image-template
        templateRef:
          name: create-item
          template: create-item
---
# Template dependencies for the above.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
    name: run-script
spec:
    templates:
      - name: run-script
        activeDeadlineSeconds: 900
        script:
          name: copyfiles
          image: bash
          command: [bash]
          source: |
            echo 'Copying...'
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
    name: create-item
spec:
    templates:
      - name: create-item
        activeDeadlineSeconds: 900
        script:
          name: my-private-code
          image: myacr/private-image-with-entrypoint

Logs from the workflow controller

See description above

Logs from in your workflow's wait container

N/A. Containers aren't even started.

The text was updated successfully, but these errors were encountered:

alexec · 2023-07-25T14:16:21Z

The code that gets the image needs to have a retry. That could be added here.

argo-workflows/workflow/controller/entrypoint/image.go

Line 29 in a76674c

return &cacheIndex{

CK-Yong · 2023-07-25T18:30:29Z

@alexec Thanks for your response! I'm definitely willing to contribute to this. I do have a question regarding this:

Would it be a good idea to make the retries configurable? For instance through the controller ConfigMap? Or would we rather just opt for a simple retry with some fixed values?

Edit: I do see the following code in workflow/controller/operator.go:

	err := waitutil.Backoff(retry.DefaultRetry, func() (bool, error) {
		err := woc.controller.kubeclientset.PolicyV1().PodDisruptionBudgets(woc.wf.Namespace).Delete(ctx, woc.wf.Name, metav1.DeleteOptions{})
		if apierr.IsNotFound(err) {
			return true, nil
		}
		return !errorsutil.IsTransientErr(err), err
	})

Is this the standard of performing such retries? If so, I will implement it when I have the time. Thanks again!

CK-Yong · 2023-07-27T16:57:18Z

I managed to run ArgoWF locally, and it seems that this problem is not occurring, and the same error would lead to a retry. We will try using the latest in our testing environment, and see if we can reproduce the error.

tooptoop4 · 2023-07-28T04:16:59Z

@CK-Yong are u getting same issue in 3.4.8 and 3.4.9? or just issue in 3.4.9?

CK-Yong · 2023-07-28T12:10:44Z

@CK-Yong are u getting same issue in 3.4.8 and 3.4.9? or just issue in 3.4.9?

This issue was first seen in 3.4.8, shortly after we deployed it. 3.4.9 also shows this issue, but the errors slowly drop down after it ran in our live servers for a couple of days (presumably due to caching). Currently, the error="failed to look-up entrypoint/cmd for image \"myacr.azurecr.io/checkfile:20230724.5\" is not popping up anymore, but the Failed to get a template error is.

Strangely enough, I just tested this locally by pointing a WF to an image in a private repo. This is expected to fail with an Unauthorized code, but in this case, it performs an ImagePullBackOff propertly. Tested this on 3.4.9 and master.

ckyongrtl · 2023-12-04T16:18:11Z

Have an update regarding this: We found that this issue occurs when there's a lot of nodes spinning up on our cluster. This happened because we had a lot of containers without any resource quotas specified. What we did to resolve this, was to introduce a LimitRange so that pods were assigned a default resource quota.

Since then, the issue has not presented itself anymore. Will reopen the ticket when we encounter this again.

ckyongrtl added the type/bug label Jul 25, 2023

caelan-io added type/regression Regression from previous behavior (a specific type of bug) solution/suggested A solution to the bug has been suggested. Someone needs to implement it. labels Aug 31, 2023

agilgur5 added the area/controller Controller issues, panics label Sep 1, 2023

juliev0 added the P3 Low priority label Sep 7, 2023

agilgur5 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Nov 15, 2023

ckyongrtl closed this as completed Dec 4, 2023

agilgur5 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2024

agilgur5 added the solution/unimplemented This was not implemented, but may be in the future label Feb 9, 2024

tooptoop4 mentioned this issue Nov 4, 2024

fix: retries for lookup entrypoint/template. Fixes: #11441 #13862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.4.9 - Failed to get a template #11441

3.4.9 - Failed to get a template #11441

ckyongrtl commented Jul 25, 2023 •

edited

Loading

alexec commented Jul 25, 2023

CK-Yong commented Jul 25, 2023 •

edited

Loading

CK-Yong commented Jul 27, 2023

tooptoop4 commented Jul 28, 2023

CK-Yong commented Jul 28, 2023 •

edited

Loading

ckyongrtl commented Dec 4, 2023 •

edited

Loading

3.4.9 - Failed to get a template #11441

3.4.9 - Failed to get a template #11441

Comments

ckyongrtl commented Jul 25, 2023 • edited Loading

Pre-requisites

What happened/what you expected to happen?

The issue

Workaround

Environment and setup

Expected behaviour

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

alexec commented Jul 25, 2023

CK-Yong commented Jul 25, 2023 • edited Loading

CK-Yong commented Jul 27, 2023

tooptoop4 commented Jul 28, 2023

CK-Yong commented Jul 28, 2023 • edited Loading

ckyongrtl commented Dec 4, 2023 • edited Loading

ckyongrtl commented Jul 25, 2023 •

edited

Loading

CK-Yong commented Jul 25, 2023 •

edited

Loading

CK-Yong commented Jul 28, 2023 •

edited

Loading

ckyongrtl commented Dec 4, 2023 •

edited

Loading