-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollout failing with msg "the object has been modified; please apply your changes to the latest version" #3080
Comments
Hi @zachaller I am still seeing this |
@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing? |
Yeah our rollout got stuck and I saw this message over and over. The behaviour we saw was:
|
I'm also seeing a LOT of these in the controller logs with a very similar setup to @bpoland and we're also seeing rollouts get tingstuck unless we manually retry the failures |
Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it. |
@pdeva Are you able to reliably reproduce this. Also your image has a bunch of issues with the VirtualService but then you also show a log line on the ReplicaSet so it could be something else also modifying the VirtualService |
@zachaller We have policy agents but it seemed to work fine on v1.3.0 which we just upgraded from. I managed to find a rollout stuck in progress because it seemed like it wasn't updating the replica count in the new replica set. As a follow up we rolled back to v1.3.0 and everything started working again |
We have a linkerd injector which adds a container, maybe that is related? Similar to @mclarke47 though, we have not experienced this previously (currently trying to upgrade from 1.4.1) |
We are also seeing this happen a lot more. Yesterday HPA increased the number of replicas, but the Rollout did not bring up more pods. The Rollout object itself had the correct number set, it's just the new pods weren't coming up. Killing Argo Rollouts controller always fixes these stuck cases. It's definitely happening a lot more with the 1.6 version than before. |
Question, would something like HPA modifying the number of replicas count as something that modifies the replicaset and might cause this issue? |
Here's an example. We started seeing these message at: 2023-11-14T22:10:00Z
And they continued, this is last one at 2023-11-15T00:44:37Z
That's two hours. And it only started working again when we killed the argo controller pod. Would it be possible to include in the message what changed? Perhaps that will lead to some clue as to why this is happening? This is the first message (referencing the same replicaset) after the controller restarted:
Can you please explain a bit more about these conflicts that cause the "the object has been modified" errors? What is a common cause? How is the controller meant to deal with them? Presumably nothing was modifying this replicaset for 2 hours straight... is the idea that the controller modifies it, and then something else and also modifies it (maybe reverting something) and that's what the controller notices? This is now happening to us daily so anything we can do to help figure this out, please let us know. We are on 1.6.2. Thank you |
btw, we also run gatekeeper, but it only has one mutating webhook which has to do with HPA, this is what it looks like:
So in theory this shouldn't be touching the replicaset at all. The other webhooks are constraints that have to do with labels and annotations, nothing that would mess with a running pod. |
So that last event that happened before the errors started was HPA taking down one replica. Which maybe was the trigger and what change in the replicaset since argo saw it last, but somehow it didn't manage to reconcile that properly. This is the HPA view: It looks like it was trying to increase the number of replicas. I wonder if this is Argo and HPA fighting it out then? |
Note that while HPA shows current and desired replicas = 124, the actual number of replicas was 112. So this is similar to what I saw a couple of days ago where HPA said "bring up more replicas" and argo did not. I assume the "current replicas" comes from the controller (argo in this case). And I can confirm that I did see the Rollout object have the correct desired number of pods, while the number or running pods was smaller. |
I want to just comment I think we are also seeing some issues as well with one of our clusters in regards to this so spending some time looking into it. |
Do any of you use notifications within your rollouts specs? Trying to see if there is a correlation between notifications updating the replicate spec. |
yes.
…On Tue, Nov 28, 2023 at 11:56 AM Zach Aller ***@***.***> wrote:
Do you guys use notifications within your rollouts specs?
—
Reply to this email directly, view it on GitHub
<#3080 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAD4PPZ6R2VINHAEXQQLJYDYGYJUDAVCNFSM6AAAAAA5VFK4BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQGI4TINZVG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
We don't use notifications currently. |
No notification use here |
I have faced the similar problems in our cluster. 1. Rollout is stuck while canary updateWhen a Rollout is updated, both old and new ReplicaSet are running and then Rollout is stuck.
Here is a snippet of kubectl get replicaset. Hash
I deleted the old ReplicaSet and then Rollout status became Healthy. 2. Rollout status becomes Degraded even if pods are runningWhen a Rollout is updated, it becomes the degraded status even if all new pods are running.
I could refresh the status of Rollout by restating the argo-rollouts-controller. |
Thank you for fixing this (hopefully for good)! What is the eta for when this might make it into a release? |
Just released it, can you try it out? It will still log the conflict but rollouts should not get stuck anymore. |
I also probably found the root cause of the conflicts just not sure how to deal with it yet, but they also should not cause any issues because they do get retried and we have had this code for a while now #3218 |
I have updated argo-rollouts to v1.6.3 and this problem seems resolved. |
We've also upgraded and haven't seen the issue again since. Thanks! |
hey folks, we're also seeing this on the latest Argo Rollouts version (the unreleased 1.7.x). In our case, we have a process which annotates (and labels)
In some cases, Argo Rollout Controller seems to lock up and stop reporting any data for that particular |
@NaurisSadovskis Did you also see the issue on 1.5, the logs would be different and not log the error but could you also see if rollouts got stuck? |
I updated the controller to v1.6.4 and this problem occurs again. For a workaround, we run a CronJob to restart the controller every day. |
@NaurisSadovskis would you be able to test a version with this patch: #3272 |
@zachaller updated and the problem persists - more specifically controller is active, but it gets stuck on rolling out the new ReplicaSet. the @int128's solution of restarting the controller solves this again. |
Just experienced this. 1.6.6. Definitely seems related to HPA; the replicaset was scaled up at the time. edit: argocd 2.10.2, on EKS 1.28. Restarting the controller fixed it. |
Does it make sense to reopen this? |
We're also seeing this issue, using latest release. |
@zachaller can we reopen this issue? We are also continuing to hit it |
We are facing the same issue on v1.6.6 |
Hey, coming here to say that we face the exact same issue. When HPA that scales a rollout, this will come into conflict with the argo-rollouts controller. |
I think this is the same issue so maybe we should keep posting over there: #3316 |
Unfortunately, we still need to restart the controllers of our clusters every hour. Here is an example of CronJob. # Restart argo-rollouts every hour to avoid the following error:
# https://github.com/argoproj/argo-rollouts/issues/3080#issuecomment-1835809731
apiVersion: batch/v1
kind: CronJob
metadata:
name: argo-rollouts-periodically-restart
spec:
schedule: "0 * * * *" # every hour
jobTemplate:
spec:
backoffLimit: 2
ttlSecondsAfterFinished: 259200 # 3 days
template:
spec:
restartPolicy: Never
serviceAccountName: argo-rollouts-periodically-restart
containers:
- name: kubectl-rollout-restart
image: public.ecr.aws/bitnami/kubectl:1.29.1
command:
- kubectl
- --v=3
- --namespace=argo-rollouts
- rollout
- restart
- deployment/argo-rollouts
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- "ALL"
resources:
limits:
memory: 64Mi
requests:
cpu: 10m
memory: 64Mi
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: argo-rollouts-periodically-restart
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: argo-rollouts-periodically-restart
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: argo-rollouts-periodically-restart
subjects:
- kind: ServiceAccount
name: argo-rollouts-periodically-restart
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: argo-rollouts-periodically-restart
rules:
# https://stackoverflow.com/a/68980720
- apiGroups: ["apps"]
resources: ["deployments"]
resourceNames: ["argo-rollouts"]
verbs: ["get", "patch"] |
We see the same issue on 1.6.6, with the rollout using HPA. Restarting the argo-rollouts pods, made the issue go away for the application. Logs from the rollouts controller.
|
Rollouts v1.7.1 should fix this if people see this on v1.7.x please report? |
Good afternoon, I am testing it with 1.7.1 at the moment there are few retries, I have not yet had to restart the argorollout pods.
|
Is this error something similar?
|
Checklist:
Describe the bug
updates to services in argo rollouts are failing suddenly with this msg for no reason. the only change we made was change the image tag of the Rollout
To Reproduce
it fails and gets in this state when mutliple rollout image tags are updated at once. if we then do a
rollout retry
one service at a time, each service succeeds.Expected behavior
Rollout should succeed. has no reason to fail since the only thing chaged is updated image tag
Screenshots
Version
1.6.0
Logs
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: