Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.4.9 - Failed to get a template #11441

Closed
3 tasks done
ckyongrtl opened this issue Jul 25, 2023 · 6 comments · May be fixed by #13862
Closed
3 tasks done

3.4.9 - Failed to get a template #11441

ckyongrtl opened this issue Jul 25, 2023 · 6 comments · May be fixed by #13862
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/suggested A solution to the bug has been suggested. Someone needs to implement it. solution/unimplemented This was not implemented, but may be in the future type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@ckyongrtl
Copy link

ckyongrtl commented Jul 25, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I'm not sure how to exactly report this, as it is more a sporadic bug, but I'm hoping to get some pointers or tips here. I first asked in the Slack Community, but someone told me to raise a bug ticket here. So here goes:

The issue

After we migrated from 3.3.7 to 3.4.8 (and subsequently to 3.4.9), we've started seeing some errors in our workflow-controller. There's two errors that keep popping up:

time="2023-07-25T07:20:19.034Z" level=error msg="Mark error node" error="failed to get a template" namespace=watch nodeName="video-processing-pwnzp(0).start-ingest-workflow(0).delete-files-from-storage(0)" workflow=video-processing-pwnzp pod=argoworkflows-workflow-controller-77cd69fc74-hxzw4 filename=/var/log/pods/argo_argoworkflows-workflow-controller-77cd69fc74-hxzw4_15b23469-10bb-45d7-b38b-2d85df6d1e49/controller/0.log node_name=aks-standardhigh-16716760-vmss00008a

And the other one (should I create a separate issue for it?):

time="2023-07-25T01:18:47.145Z" level=error msg="Mark error node" error="failed to look-up entrypoint/cmd for image \"myacr.azurecr.io/checkfile:20230724.5\", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: GET https://myacr.azurecr.io/oauth2/token?scope=repository%3Acheckfile%3Apull&service=myacr.azurecr.io: UNAUTHORIZED: authentication required, visit https://aka.ms/acr/authorization for more information." namespace=watch nodeName="clip-created-fppgr.check-file(0)" workflow=clip-created-fppgr pod=argoworkflows-workflow-controller-77cd69fc74-hxzw4 filename=/var/log/pods/argo_argoworkflows-workflow-controller-77cd69fc74-hxzw4_15b23469-10bb-45d7-b38b-2d85df6d1e49/controller/0.log node_name=aks-standardhigh-16716760-vmss00008a

After a retry, the workflow continues and succeeds. Most of our workflows are fine, but sometimes we get this error, and we need to manually retry them to continue operations.

Workaround

Add extensive retry policies as such:

retryStrategy:
  retryPolicy: "OnError"
  limit: 5
  backoff:
    duration: "2" # default unit is seconds
    factor: 2
    maxDuration: "1m"

Environment and setup

Azure Kubernetes Services, Kubernetes version 1.25.6.
Private images are hosted on Azure Container Registry, connection from AKS through Managed Identity Flow.
Replicas of WF controller and server set to recommended settings:

  • Controller with 2 replicas, PDB with MaxUnavailable: 1
  • Server with 3 replicas, HPA set to 3 to 5 replicas.

Expected behaviour

Images should be pulled and containers should spin up.

Version

3.4.9

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This template is called by another template, which is called by another Workflow.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: ingest
spec:
  templates:
  - name: ingest-workflow
    dag:
      tasks:
      - name: external-template-script
        templateRef:
          name: run-script
          template: run-script
      - name: private-image-template
        templateRef:
          name: create-item
          template: create-item
---
# Template dependencies for the above.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
    name: run-script
spec:
    templates:
      - name: run-script
        activeDeadlineSeconds: 900
        script:
          name: copyfiles
          image: bash
          command: [bash]
          source: |
            echo 'Copying...'
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
    name: create-item
spec:
    templates:
      - name: create-item
        activeDeadlineSeconds: 900
        script:
          name: my-private-code
          image: myacr/private-image-with-entrypoint

Logs from the workflow controller

See description above

Logs from in your workflow's wait container

N/A. Containers aren't even started.
@alexec
Copy link
Contributor

alexec commented Jul 25, 2023

The code that gets the image needs to have a retry. That could be added here.

@CK-Yong
Copy link

CK-Yong commented Jul 25, 2023

@alexec Thanks for your response! I'm definitely willing to contribute to this. I do have a question regarding this:

Would it be a good idea to make the retries configurable? For instance through the controller ConfigMap? Or would we rather just opt for a simple retry with some fixed values?

Edit: I do see the following code in workflow/controller/operator.go:

	err := waitutil.Backoff(retry.DefaultRetry, func() (bool, error) {
		err := woc.controller.kubeclientset.PolicyV1().PodDisruptionBudgets(woc.wf.Namespace).Delete(ctx, woc.wf.Name, metav1.DeleteOptions{})
		if apierr.IsNotFound(err) {
			return true, nil
		}
		return !errorsutil.IsTransientErr(err), err
	})

Is this the standard of performing such retries? If so, I will implement it when I have the time. Thanks again!

@CK-Yong
Copy link

CK-Yong commented Jul 27, 2023

I managed to run ArgoWF locally, and it seems that this problem is not occurring, and the same error would lead to a retry. We will try using the latest in our testing environment, and see if we can reproduce the error.

@tooptoop4
Copy link
Contributor

@CK-Yong are u getting same issue in 3.4.8 and 3.4.9? or just issue in 3.4.9?

@CK-Yong
Copy link

CK-Yong commented Jul 28, 2023

@CK-Yong are u getting same issue in 3.4.8 and 3.4.9? or just issue in 3.4.9?

This issue was first seen in 3.4.8, shortly after we deployed it. 3.4.9 also shows this issue, but the errors slowly drop down after it ran in our live servers for a couple of days (presumably due to caching). Currently, the error="failed to look-up entrypoint/cmd for image \"myacr.azurecr.io/checkfile:20230724.5\" is not popping up anymore, but the Failed to get a template error is.

Strangely enough, I just tested this locally by pointing a WF to an image in a private repo. This is expected to fail with an Unauthorized code, but in this case, it performs an ImagePullBackOff propertly. Tested this on 3.4.9 and master.

image
image

@caelan-io caelan-io added type/regression Regression from previous behavior (a specific type of bug) solution/suggested A solution to the bug has been suggested. Someone needs to implement it. labels Aug 31, 2023
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Sep 1, 2023
@juliev0 juliev0 added the P3 Low priority label Sep 7, 2023
@agilgur5 agilgur5 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Nov 15, 2023
@ckyongrtl
Copy link
Author

ckyongrtl commented Dec 4, 2023

Have an update regarding this: We found that this issue occurs when there's a lot of nodes spinning up on our cluster. This happened because we had a lot of containers without any resource quotas specified. What we did to resolve this, was to introduce a LimitRange so that pods were assigned a default resource quota.

Since then, the issue has not presented itself anymore. Will reopen the ticket when we encounter this again.

@agilgur5 agilgur5 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2024
@agilgur5 agilgur5 added the solution/unimplemented This was not implemented, but may be in the future label Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/suggested A solution to the bug has been suggested. Someone needs to implement it. solution/unimplemented This was not implemented, but may be in the future type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants