Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race in Workflow Controller: Periodic ConfigMap update races with Pod Creation Code #14027

Open
3 of 4 tasks
rtartler-blp opened this issue Dec 23, 2024 · 0 comments
Open
3 of 4 tasks
Labels
area/controller Controller issues, panics type/bug

Comments

@rtartler-blp
Copy link

rtartler-blp commented Dec 23, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

We've observed that the workflow controller was crashing several workflows with an error that is unrelated to the workflow input.

A full stacktrace can be found at https://gist.github.com/rtartler-blp/804126f5725c067290a91bc9699e192d.

cc: @Joibel

Version(s)

v3.4.17

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

The full workflow spec cannot be shared, but the relevant part of the workflow that specifies resource limits reads:

        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            memory: 128Mi

Our workflow controller configuration config map specifies:

# kubectl get cm workflow-controller-configmap -o yaml  | yq .data.config
parallelism: 200
namespaceParallelism: 200
mainContainer:
  imagePullPolicy: Always
  resources:
    limits:
      cpu: 200m
      memory: 512Mi
    requests:
      cpu: 200m
      memory: 512Mi
executor:
  imagePullPolicy: IfNotPresent
  resources:
    limits:
      cpu: 200m
      memory: 128Mi
    requests:
      cpu: 200m
      memory: 128Mi

Logs from the workflow controller

In the workflow controller log, we would see messages such as:

Pod "bvlprcdb-master-update-v1.0.14-b2s1-fffp8-bvlprcdb-dump-1440964031" is invalid: [spec.containers[0].resources.requests: Invalid value: "512Mi": must be less than or equal to memory limit of 128Mi, spec.initContainers[0].resources.requests: Invalid value: "512Mi": must be less than or equal to memory limit of 128Mi]
@blkperl blkperl added the area/controller Controller issues, panics label Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug
Projects
None yet
Development

No branches or pull requests

2 participants