Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The mutex is not released after the pod completes. #14002

Open
3 of 4 tasks
waring92 opened this issue Dec 16, 2024 · 3 comments
Open
3 of 4 tasks

The mutex is not released after the pod completes. #14002

waring92 opened this issue Dec 16, 2024 · 3 comments

Comments

@waring92
Copy link

waring92 commented Dec 16, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

I have a WorkflowTemplate that defines sequential workflows using a DAG.
The DAG is responsible solely for managing the sequence of workflows, and each step references a workflow using templateRef.

Some of these steps (workflows) implement their own mutex synchronization mechanisms, but these are not applied to the entire WorkflowTemplate. In other words, certain steps need synchronization, while others do not.

The issue arises when a workflow with its own mutex completes execution, even after its pod is finished.
At this point, another parallel workflow becomes stuck with the message:

Waiting for … lock. Lock status: 0/1

From my understanding, once the pod with the mutex has completed execution, the mutex should be released, allowing the next workflow to acquire the lock.
However, it appears that mutexes for all workflows within the template are only released after the entire WorkflowTemplate has completed execution.

Am I misunderstanding how the mutex synchronization works in this context?
Or is there a configuration or behavior I may have overlooked that ensures the mutex is released immediately after the specific workflow (or pod) finishes?

I register this issue with version 3.5.5, because there have been no updates regarding this feature.

Version(s)

v3.5.5

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

The following is the WorkflowTemplate

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: ...
  generateName: ...
  namespace: argo
spec:
  entrypoint: ...
  serviceAccountName: argo
  templates:
  - name: ...
    inputs:
      parameters:
      - ...
    dag:
      tasks:
      ...
      - name: get-workflow
        templateRef:
          name: specific-workflow
          template: specific-workflow-template
        arguments:
          parameters:
          - name: mutex_key
            value: "mutex/key"
            ...

And the following is the referred workflow.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: specific-workflow
  namespace: argo
spec:
  serviceAccountName: argo
  templates:
  - name: specific-workflow-template
    inputs:
      parameters:
      ...
      - name: mutex_key
        value: "default"
    synchronization:
      mutex:
        name: "{{inputs.parameters.mutex_key}}"
    script:
    ...

Logs from the workflow controller

time="2024-12-16T04:54:59.763Z" level=info msg="Could not acquire lock named: &{argo mutex-key  Mutex}" namespace=argo workflow=...

Logs from in your workflow's wait container

Waiting for argo/Mutex/mutex-key lock. Lock status: 0/1
@isubasinghe
Copy link
Member

isubasinghe commented Dec 16, 2024

@waring92 can you check if this is an issue in 3.5.11 please? I suspect that this issue is fixed there.
#13553 this fix is not in 3.5.5

@waring92
Copy link
Author

@waring92 can you check if this is an issue in 3.5.11 please? I suspect that this issue is fixed there. #13553 this fix is not in 3.5.5

Thank you for your reply.
But the same situation in v3.5.11

@waring92
Copy link
Author

After numerous attempts, I discovered that the mutex holding the "/" character in its name remains locked.
Is it working correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants