Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource conflict regression from v0.60.0 #1037

Open
universam1 opened this issue Dec 2, 2024 · 9 comments
Open

Resource conflict regression from v0.60.0 #1037

universam1 opened this issue Dec 2, 2024 · 9 comments
Assignees
Labels
bug This issue describes a defect or unexpected behavior carvel accepted This issue should be considered for future work and that the triage process has been completed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@universam1
Copy link

What steps did you take:

We are unable to use any version newer than v0.59.4 with app-deploy, failing with resource conflict (approved diff no longer matches).

By method of elimination we have tested following versions:

0.63.3: FAIL
0.62.1: FAIL
0.61.0: FAIL
0.60.2: FAIL
0.60.0: FAIL
0.59.4: SUCCESS

What happened:

We are deploying full cluster config from scratch via kapp app-deploy, in sum ~800 resources, within a single app. This works amazingly well with Kapp, way better than Helm!

However, since v0.60.0 on the first apply we encounter this error:

  - update daemonset/aws-node (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource daemonset/aws-node (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on daemonsets.apps \"aws-node\": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations:
  4,  3 -     deprecated.daemonset.template.generation: \"1\"
  9,  7 -     app.kubernetes.io/managed-by: Helm
 11,  8 -     app.kubernetes.io/version: v1.19.0
 12,  8 -     helm.sh/chart: aws-vpc-cni-1.19.0
 14,  9 +     kapp.k14s.io/app: \"1733129085919676830\"
 14, 10 +     kapp.k14s.io/association: v1.ca251169611f162ef5186bbf4f512ca0
326,323 -   revisionHistoryLimit: 10
332,328 -       creationTimestamp: null
337,332 +         kapp.k14s.io/app: \"1733129085919676830\"
337,333 +         kapp.k14s.io/association: v1.ca251169611f162ef5186bbf4f512ca0
356,353 -                 - hybrid
357,353 -                 - auto
362,357 -         - name: ANNOTATE_POD_IP
363,357 -           value: \"false\"
384,377 -         - name: CLUSTER_NAME
385,377 -           value: o11n-eks-int-4151
399,390 -           value: \"false\"
400,390 +           value: \"true\"
402,393 +         - name: MINIMUM_IP_TARGET
402,394 +           value: \"25\"
405,398 -           value: v1.19.0
406,398 -         - name: VPC_ID
407,398 -           value: vpc-23837a4a
408,398 +           value: v1.18.2
409,400 -           value: \"1\"
410,400 +           value: \"0\"
410,401 +         - name: WARM_IP_TARGET
410,402 +           value: \"5\"
411,404 -           value: \"1\"
412,404 +           value: \"0\"
422,415 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni:v1.19.0-eksbuild.1
423,415 -         imagePullPolicy: IfNotPresent
424,415 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.18.2
431,423 -           failureThreshold: 3
433,424 -           periodSeconds: 10
434,424 -           successThreshold: 1
440,429 -           protocol: TCP
448,436 -           failureThreshold: 3
450,437 -           periodSeconds: 10
451,437 -           successThreshold: 1
455,440 -             cpu: 25m
456,440 +             cpu: 50m
456,441 +             memory: 80Mi
461,447 -         terminationMessagePath: /dev/termination-log
462,447 -         terminationMessagePolicy: File
489,473 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon/aws-network-policy-agent:v1.1.5-eksbuild.1
490,473 -         imagePullPolicy: Always
491,473 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.1.2
494,477 -             cpu: 25m
495,477 +             cpu: 50m
495,478 +             memory: 80Mi
500,484 -         terminationMessagePath: /dev/termination-log
501,484 -         terminationMessagePolicy: File
511,493 -       dnsPolicy: ClusterFirst
519,500 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni-init:v1.19.0-eksbuild.1
520,500 -         imagePullPolicy: Always
521,500 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.18.2
524,504 -             cpu: 25m
525,504 +             cpu: 50m
525,505 +             memory: 80Mi
527,508 -         terminationMessagePath: /dev/termination-log
528,508 -         terminationMessagePolicy: File
533,512 -       restartPolicy: Always
534,512 -       schedulerName: default-scheduler
536,513 -       serviceAccount: aws-node
544,520 -           type: \"\"
548,523 -           type: \"\"
552,526 -           type: \"\"
568,541 -       maxSurge: 0


  - update daemonset/kube-proxy (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource daemonset/kube-proxy (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on daemonsets.apps \"kube-proxy\": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations:
  4,  3 -     deprecated.daemonset.template.generation: \"1\"
 10,  8 +     kapp.k14s.io/app: \"1733129085919676830\"
 10,  9 +     kapp.k14s.io/association: v1.5c5a114581f350e2b57df0ed7799471d
134,134 +         kapp.k14s.io/app: \"1733129085919676830\"
134,135 +         kapp.k14s.io/association: v1.5c5a114581f350e2b57df0ed7799471d
153,155 -                 - auto
159,160 -         - --hostname-override=$(NODE_NAME)
160,160 -         env:
161,160 -         - name: NODE_NAME
162,160 -           valueFrom:
163,160 -             fieldRef:
164,160 -               apiVersion: v1
165,160 -               fieldPath: spec.nodeName
166,160 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/kube-proxy:v1.29.10-minimal-eksbuild.3
167,160 +         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/kube-proxy:v1.29.7-eksbuild.2
171,165 -             cpu: 100m
172,165 +             cpu: 50m
172,166 +             memory: 45Mi

My assumption is that a webhook or a controller might interfere here with Kapp on certain fields.
However, we need to be able to configure the EKS cluster via Kapp even under a temporary clash.

What did you expect:
Kapp to retry

@praveenrewar

Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

@universam1 universam1 added bug This issue describes a defect or unexpected behavior carvel triage This issue has not yet been reviewed for validity labels Dec 2, 2024
@carvel-bot carvel-bot added this to Carvel Dec 2, 2024
@praveenrewar
Copy link
Member

Thank you for creating the issue @universam1!

We are deploying full cluster config from scratch via kapp app-deploy, in sum ~800 resources, within a single app. This works amazingly well with Kapp

🙏🏻

However, since v0.60.0 on the first apply we encounter this error:

  • Do these resources already exist on the cluster? (and hence an update)
  • Would you be able to share the complete output for these 2 resources with the --diff-changes flag, then we can see what the original diff was and and compare it with recalculated diff.
  • I will try to figure out what could have caused this regression in v0.60.0, I just took a quick look at the release notes for v0.60.0 but couldn't make out what could have caused it, I will take a closer look in some time.
  • Did the same issue happen even after retrying?

Kapp to retry

One of the principles for kapp is that it guarantees that it will only apply the changes that have been approved by the user. If we want to retry on this particular error, it would mean getting a confirmation from the user again, which might not be a great user experience. It would be ideal to retry the kapp deploy from outside, i.e via some pipeline or some controller like kapp-controller.

@universam1
Copy link
Author

Thank you for creating the issue @universam1!
Likewise for the quick response!

  • Do these resources already exist on the cluster? (and hence an update)

Maybe! The scenario is a brand new, vanilla EKS cluster, just right after the Cloudformation returned the success request, we call Kapp to deploy the core services. Those core services include updates to existing daemonsets. Apparently, but this is not clear, EKS might have delayed deployments that might happen during Kapp runtime.

  • Would you be able to share the complete output for these 2 resources with the --diff-changes flag, then we can see what the original diff was and and compare it with recalculated diff.

Since this is transient error, quite hard to generate, but I'll try.

  • Did the same issue happen even after retrying?

Probably not. Hard to test since we are in CI pipeline here. But it seems like it does suceed after retrying. However, we cannot do that in CI due to one-time session zero credentials to EKS which is "it either succeeded or not" problem.

One of the principles for kapp is that it guarantees that it will only apply the changes that have been approved by the user. If we want to retry on this particular error, it would mean getting a confirmation from the user again, which might not be a great user experience. It would be ideal to retry the kapp deploy from outside, i.e via some pipeline or some controller like kapp-controller.

Please consider that we are not in an interactive session here but in CI pipeline, running app-deploy. The gitOps setup is mandatory by all means! And we cannot restart the pipeline at this point, we have to succeed or the cluster is broken forever.

I agree, in an interactive session it makes sense to require another user interaction, but here in a headless mode in CI, Kapp should have an option to enforce a desired state!

@praveenrewar
Copy link
Member

Those core services include updates to existing daemonsets. Apparently, but this is not clear, EKS might have delayed deployments that might happen during Kapp runtime.

Yeah, that could be the reason.

Since this is transient error, quite hard to generate, but I'll try.

I see, thanks, if we can check both the original diff and the recalculated diff, it would help us in determining the exact fields due to which the diff is changing and we can probably add rebase rules to ignore those fields.

Probably not. Hard to test since we are in CI pipeline here. But it seems like it does suceed after retrying.

Curious to know how you were able to pinpoint the exact version of kapp with the issue.

I agree, in an interactive session it makes sense to require another user interaction, but here in a headless mode in CI, Kapp should have an option to enforce a desired state!

I agree that such an option would be useful and I have seen a few similar requests in the past. I think it would be good to first determine the root cause and see if a rebase rule would help else we can think of the best way retry in such cases.

@universam1
Copy link
Author

Those core services include updates to existing daemonsets. Apparently, but this is not clear, EKS might have delayed deployments that might happen during Kapp runtime.

Yeah, that could be the reason.

I have more results from testing, and the problem is not scoped to managed resources. It happens also for resources that are solely owned by Kapp! And it is reproducable. Let me attach examples below.

I see, thanks, if we can check both the original diff and the recalculated diff, it would help us in determining the exact fields due to which the diff is changing and we can probably add rebase rules to ignore those fields.

See following examples, those resources are Kapp owned and not touched by any other operator. This is the output of a 3rd retry (I was able to implement CI job retries)!

original diff
@@ update deployment/skipper-ingress (apps/v1) namespace: kube-system @@
  ...
205,205   spec:
206     -   progressDeadlineSeconds: 600
207,206     replicas: 2
208     -   revisionHistoryLimit: 10
209,207     selector:
210,208       matchLabels:
  ...
215,213         maxUnavailable: 0
216     -     type: RollingUpdate
217,214     template:
218,215       metadata:
219     -       creationTimestamp: null
220,216         labels:
221,217           application: skipper-ingress
  ...
289,285           image: registry.opensource.zalan.do/teapot/skipper:v0.21.223
290     -         imagePullPolicy: IfNotPresent
291,286           name: skipper
292,287           ports:
  ...
294,289             name: ingress-port
295     -           protocol: TCP
296,290           - containerPort: 9998
297,291             name: redirect-port
298     -           protocol: TCP
299,292           - containerPort: 9911
300,293             name: metrics-port
301     -           protocol: TCP
302,294           readinessProbe:
303     -           failureThreshold: 3
304,295             httpGet:
305,296               path: /kube-system/healthz
  ...
308,299             initialDelaySeconds: 5
309     -           periodSeconds: 10
310     -           successThreshold: 1
311,300             timeoutSeconds: 1
312,301           resources:
  ...
315,304               memory: 200Mi
316     -         terminationMessagePath: /dev/termination-log
317     -         terminationMessagePolicy: File
318,305           volumeMounts:
319,306           - mountPath: /etc/skipper-cert
  ...
329,316             name: skipper-init
330     -       dnsPolicy: ClusterFirst
331,317         priorityClassName: system-cluster-critical
332     -       restartPolicy: Always
333     -       schedulerName: default-scheduler
334     -       securityContext: {}
335     -       serviceAccount: skipper-ingress
336,318         serviceAccountName: skipper-ingress
337     -       terminationGracePeriodSeconds: 30
338,319         tolerations:
339,320         - effect: NoExecute
  ...
346,327           secret:
347     -           defaultMode: 420
348,328             secretName: skipper-cert
349,329         - name: vault-tls
350,330           secret:
351     -           defaultMode: 420
352,331             secretName: vault-tls
353,332         - name: oidc-secret-file
354,333           secret:
355     -           defaultMode: 420
356,334             secretName: skipper-oidc-secret
357,335         - configMap:

@@ update poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system @@
  ...
  2,  2   metadata:
  3     -   annotations: {}
  4,  3     creationTimestamp: "2024-12-02T08:22:15Z"
  5,  4     generation: 1

@@ update prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring @@
...
2,  2   metadata:
3     -   annotations: {}
4,  3     creationTimestamp: "2024-12-02T08:22:20Z"
5,  4     generation: 1
recalculated diff
Error: 
  - update deployment/skipper-ingress (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource deployment/skipper-ingress (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on deployments.apps "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
207,207 -   progressDeadlineSeconds: 600
209,208 -   revisionHistoryLimit: 10
217,215 -     type: RollingUpdate
220,217 -       creationTimestamp: null
291,287 -         imagePullPolicy: IfNotPresent
296,291 -           protocol: TCP
299,293 -           protocol: TCP
302,295 -           protocol: TCP
304,296 -           failureThreshold: 3
310,301 -           periodSeconds: 10
311,301 -           successThreshold: 1
317,306 -         terminationMessagePath: /dev/termination-log
318,306 -         terminationMessagePolicy: File
331,318 -       dnsPolicy: ClusterFirst
333,319 -       restartPolicy: Always
334,319 -       schedulerName: default-scheduler
335,319 -       securityContext: {}
336,319 -       serviceAccount: skipper-ingress
338,320 -       terminationGracePeriodSeconds: 30
348,329 -           defaultMode: 420
352,332 -           defaultMode: 420
356,335 -           defaultMode: 420
  - update poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on poddisruptionbudgets.policy "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}
  - update horizontalpodautoscaler/skipper-ingress (autoscaling/v2) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource horizontalpodautoscaler/skipper-ingress (autoscaling/v2) namespace: kube-system: API server says: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
 95, 95 -       selectPolicy: Max
102,101 -       selectPolicy: Max
  - update prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring: Failed to update due to resource conflict (approved diff no longer matches): Updating resource prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring: API server says: Operation cannot be fulfilled on prometheuses.monitoring.coreos.com "k8s": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}
189,188 -   evaluationInterval: 30s
205,203 -   portName: web
233,230 -   scrapeInterval: 30s

Curious to know how you were able to pinpoint the exact version of kapp with the issue.

We are running v0.58.0 in production. Once upgrading to 0.63.3 we faced all integration pipelines failing, consistently. In order to determine the problematic release, I created versions of our CI tooling with all minor versions of Kapp btw. those two versions and discovered that the latest working version is v0.59.4. Now we are running this version in production.

I agree that such an option would be useful and I have seen a few similar requests in the past. I think it would be good to first determine the root cause and see if a rebase rule would help else we can think of the best way retry in such cases.

BTW. I was able to implement a Kapp - retry in our CI tool nevertheless. However, even that fails consistently and we are even with 3 retries unable to converge successfully! It just fails on other resources. So there is a principle regression.

@praveenrewar
Copy link
Member

praveenrewar commented Dec 3, 2024

Thanks a lot for the details @universam1!
Out of the 3 resources that you have shared, 2 of them have the same original diff and the recalculated diff, which is definitely weird and probably an issue. I have a hunch about a few changes that could have caused this in v0.60.0. I will try taking a closer look at those changes to see which one could be the root cause. Since, I am not able to reproduce the issue on my end I might need your help in validating the fix.

@universam1
Copy link
Author

universam1 commented Dec 3, 2024

Thank you @praveenrewar for you help! Happy to assist, let me know where I can help!
BTW. we are using Kapp as Go pkg in our CI tool, in case that matters.

@universam1
Copy link
Author

@praveenrewar One interesting detail comparing the logs is that the working versions of Kapp output a lot of Retryable error: and eventually succeed, while from v0.60 on not a single retryable log is omitted.
Could it be that this internal retry logic is not catching any more?

example retryable logs
8:09:40AM: create issuer/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource issuer/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create certificate/vault-secrets-webhook-webhook-tls (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource certificate/vault-secrets-webhook-webhook-tls (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create issuer/vault-secrets-webhook-selfsign (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource issuer/vault-secrets-webhook-selfsign (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create certificate/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource certificate/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create clusterissuer/selfsigned-issuer (cert-manager.io/v1) cluster
8:09:45AM:  ^ Retryable error: Creating resource clusterissuer/selfsigned-issuer (cert-manager.io/v1) cluster: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/hubble-server-certs (cert-manager.io/v1) namespace: kube-system
8:09:45AM:  ^ Retryable error: Creating resource certificate/hubble-server-certs (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/hubble-relay-client-certs (cert-manager.io/v1) namespace: kube-system
8:09:45AM:  ^ Retryable error: Creating resource certificate/hubble-relay-client-certs (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/cilium-selfsigned-ca (cert-manager.io/v1) namespace: cert-manager
8:09:45AM:  ^ Retryable error: Creating resource certificate/cilium-selfsigned-ca (cert-manager.io/v1) namespace: cert-manager: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:10:29AM: create certificate/aws-load-balancer-serving-cert (cert-manager.io/v1) namespace: kube-system
8:10:29AM:  ^ Retryable error: Creating resource certificate/aws-load-balancer-serving-cert (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:10:34AM: create issuer/self-signer (cert-manager.io/v1) namespace: kube-system

@praveenrewar
Copy link
Member

It might be that the conflict is happening before these retryable errors.

@renuy renuy added carvel accepted This issue should be considered for future work and that the triage process has been completed and removed carvel triage This issue has not yet been reviewed for validity labels Dec 6, 2024
@renuy renuy moved this to Prioritized Backlog in Carvel Dec 6, 2024
@renuy renuy added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 6, 2024
@renuy
Copy link
Contributor

renuy commented Jan 10, 2025

Hard to reproduce. (haven't been able to reproduce this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue describes a defect or unexpected behavior carvel accepted This issue should be considered for future work and that the triage process has been completed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
Status: Prioritized Backlog
Development

No branches or pull requests

3 participants