-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.4.9 - Failed to get a template #11441
Comments
The code that gets the image needs to have a retry. That could be added here.
|
@alexec Thanks for your response! I'm definitely willing to contribute to this. I do have a question regarding this: Would it be a good idea to make the retries configurable? For instance through the controller ConfigMap? Or would we rather just opt for a simple retry with some fixed values? Edit: I do see the following code in err := waitutil.Backoff(retry.DefaultRetry, func() (bool, error) {
err := woc.controller.kubeclientset.PolicyV1().PodDisruptionBudgets(woc.wf.Namespace).Delete(ctx, woc.wf.Name, metav1.DeleteOptions{})
if apierr.IsNotFound(err) {
return true, nil
}
return !errorsutil.IsTransientErr(err), err
}) Is this the standard of performing such retries? If so, I will implement it when I have the time. Thanks again! |
I managed to run ArgoWF locally, and it seems that this problem is not occurring, and the same error would lead to a retry. We will try using the |
@CK-Yong are u getting same issue in 3.4.8 and 3.4.9? or just issue in 3.4.9? |
This issue was first seen in 3.4.8, shortly after we deployed it. 3.4.9 also shows this issue, but the errors slowly drop down after it ran in our live servers for a couple of days (presumably due to caching). Currently, the Strangely enough, I just tested this locally by pointing a WF to an image in a private repo. This is expected to fail with an Unauthorized code, but in this case, it performs an ImagePullBackOff propertly. Tested this on 3.4.9 and |
Have an update regarding this: We found that this issue occurs when there's a lot of nodes spinning up on our cluster. This happened because we had a lot of containers without any resource quotas specified. What we did to resolve this, was to introduce a Since then, the issue has not presented itself anymore. Will reopen the ticket when we encounter this again. |
Pre-requisites
:latest
What happened/what you expected to happen?
I'm not sure how to exactly report this, as it is more a sporadic bug, but I'm hoping to get some pointers or tips here. I first asked in the Slack Community, but someone told me to raise a bug ticket here. So here goes:
The issue
After we migrated from 3.3.7 to 3.4.8 (and subsequently to 3.4.9), we've started seeing some errors in our workflow-controller. There's two errors that keep popping up:
And the other one (should I create a separate issue for it?):
After a retry, the workflow continues and succeeds. Most of our workflows are fine, but sometimes we get this error, and we need to manually retry them to continue operations.
Workaround
Add extensive retry policies as such:
Environment and setup
Azure Kubernetes Services, Kubernetes version
1.25.6
.Private images are hosted on Azure Container Registry, connection from AKS through Managed Identity Flow.
Replicas of WF controller and server set to recommended settings:
MaxUnavailable: 1
Expected behaviour
Images should be pulled and containers should spin up.
Version
3.4.9
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: