Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Expose argo synchronization options #6553

Open
jmcarp opened this issue Sep 13, 2021 · 19 comments · May be fixed by #11370
Open

[feature] Expose argo synchronization options #6553

jmcarp opened this issue Sep 13, 2021 · 19 comments · May be fixed by #11370

Comments

@jmcarp
Copy link
Contributor

jmcarp commented Sep 13, 2021

Feature Area

/area sdk

What feature would you like to see?

Argo includes useful synchronization features for workflows and templates: https://argoproj.github.io/argo-workflows/synchronization/. It would be helpful to expose these in kfp without having to manipulate manifests as yaml.

I'd be happy to work on this if the request makes sense. It seems like workflow and template level synchronization options would live on the pipeline config and container op classes respectively, but let me know if there's a better place to put them.

What is the use case or pain point?

Allow users to set a semaphore on a workflow or a step.

Is there a workaround currently?

Edit generated yaml.


Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

@jmcarp
Copy link
Contributor Author

jmcarp commented Sep 14, 2021

@Bobgy @capri-xiyue: if this request makes sense, let me know if there's a good place to add these options in the dsl, and I can write up a patch.

@zijianjoy
Copy link
Collaborator

Hello @jmcarp , what is the CUJ for setting limitation of synchronization on 1)workflow level, 2) step level?

@jmcarp
Copy link
Contributor Author

jmcarp commented Sep 17, 2021

Hi @zijianjoy, the typical use case in either case is to prevent concurrent execution of tasks that could otherwise change some shared resource. For example, if there's a race condition between tasks, or between multiple runs of the same task, it's helpful to use argo synchronization to prevent those tasks from running concurrently.

This is a useful argo feature, and it would be helpful to have it in kfp as well.

@capri-xiyue
Copy link
Contributor

capri-xiyue commented Sep 20, 2021

Hi @zijianjoy, the typical use case in either case is to prevent concurrent execution of tasks that could otherwise change some shared resource. For example, if there's a race condition between tasks, or between multiple runs of the same task, it's helpful to use argo synchronization to prevent those tasks from running concurrently.

This is a useful argo feature, and it would be helpful to have it in kfp as well.

Hi @jmcarp , in your use case, can you add synchronization logic for the shared resources? I think when it comes to shared resources, it also makes sense that the synchronization logic is added to the shared resources instead of the workflows.

@jmcarp
Copy link
Contributor Author

jmcarp commented Sep 21, 2021

I can implement my own synchronization, but that's the kind of feature I expect the workflow engine to provide--it's easy to get wrong, so ideally it's implemented once at the workflow layer and not independently by each user of the workflow tool. That's why tools like argo workflows and apache airflow include synchronization primitives, for example.

@jmcarp
Copy link
Contributor Author

jmcarp commented Sep 30, 2021

I see a 👍 from @Bobgy. Would you accept a patch for this?

@Bobgy
Copy link
Contributor

Bobgy commented Oct 19, 2021

I think that sounds like a fair argument, but regarding project status, we are transitioning to v2, so it's better reconsider this feature request after other critical features are ready.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022
@kdubovikov
Copy link

Is there any update on that feature? We are currently using kfp 1.8 and are evaluating possibilities of migration towards v2.x. However, we rely heavily on Argo synchronization features, which we were able to utilize by patching Argo YAML that Kubeflow compiler was outputting. With IR YAML this will become impossible, since we won't have direct control over Argo YAML

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 23, 2023
@kdubovikov
Copy link

@Bobgy hi. Since 2.0 is officially released for quite some time, should this feature that allows semaphore usage get more attention?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 16, 2024
@gregsheremeta
Copy link
Contributor

gonna pick this up / not stale

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 20, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 20, 2024
@gregsheremeta
Copy link
Contributor

working on this

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 20, 2024
@gregsheremeta
Copy link
Contributor

/assign @gregsheremeta

@gregsheremeta
Copy link
Contributor

It seems like workflow and template level synchronization options would live on the pipeline config and container op classes respectively, but let me know if there's a better place to put them.

I've been working on this, and I don't think template/component level is going to be trivial to get working.

Turns out all Components share a single template in the generated Workflow, and it looks roughly like this:

    - container:
        command:
          - should-be-overridden-during-runtime
          SNIP
        image: gcr.io/ml-pipeline/should-be-overridden-during-runtime
        name: ''
        resources: {}
        volumeMounts:
          - mountPath: /kfp-launcher
            SNIP
      initContainers:
        - command:
            - launcher-v2
            SNIP
      podSpecPatch: '{{inputs.parameters.pod-spec-patch}}'
      inputs:
        parameters:
          - name: pod-spec-patch
      name: system-container-impl
      metadata:
            SNIP
      outputs: {}
      volumes:
        - emptyDir: {}
          name: kfp-launcher
            SNIP

The Argo Workflows synchronization block for templates goes at that top level, alongside container, initContainer, etc. Since there is only one shared system-container-impl template for all user Components/Tasks, the semaphore or mutex name for a Component can't be set here.

We do have a dynamic way to set the things that change per each Component (command, image, volumes, etc.) -- podSpecPatch. Unfortunately, podSpecPatch is just what it sounds like ... it uses the PodSpec type to do the strategic merge on top of the (static) template at runtime. PodSpec has no synchronization fields (since that is a purely Argo Workflows concept), so there's no way to use PodSpec to dynamically set synchronization fields.

Possible paths forward that I can think of:

  1. Give up on Component/Task level synchronization and settle for Pipeline (Workflow) level.
  2. Move away from a single shared template for all Components. Seems like a large, complex change.
  3. Figure out some way other than podSpecPatch to dynamically inject things into an Argo Workflows template. Maybe something like this already exists that isn't podSpecPatch, or perhaps it would have to be an Argo Workflows enhancement.

@gregsheremeta
Copy link
Contributor

Not sure if any interested parties are still following this issue, but if so, I have some design questions.

Per my previous comment, I'm focusing on Pipeline (Workflow) level synchronization (locking) for now. I have a prototype working, but I have some questions about how we should move forward.

The pipeline code looks like this (note PipelineConfig is not yet merged, but can be found in #11112):

from kfp import dsl
from kfp.dsl import PipelineConfig


@dsl.component
def empty_component():
    pass

config = PipelineConfig()
config.set_semaphore_name("semaphore-mutex-pipeline")

@dsl.pipeline(name='semaphore-mutex-pipeline', pipeline_config=config)
def semaphore_mutex_pipeline():
    task = empty_component()
    task.set_caching_options(False)

The semaphore configmap looks like this:

kind: ConfigMap
apiVersion: v1
metadata:
  name: semaphore-mutex-pipeline
  namespace: kubeflow
data:
  semaphore: '1'

When I almost-simultaneously start 3 runs of this pipeline, I expect that the first one will start, and the other two will wait until the lock frees up; then the second one will start and complete; and then the last one will go. And this is what happens. This flow renders quite nicely in the Argo Workflows UI (reverse chronological order):
Screenshot from 2024-09-10 10-19-41
You can see that there are nice informational messages, and the run state icons are pretty clear.

When everything is fully complete, you can see how the start and end times were staggered.
Screenshot from 2024-09-10 10-21-49
Each run takes 40 seconds, and notice how the start and finish times are staggered by 40 seconds.

However, in KFP, we don't know anything about what's holding things up in Argo Workflows. So KFP thinks all 3 started at the same time. It thinks run 1 takes ~40 seconds. It thinks run 2 takes ~80 seconds. It thinks run 3 takes ~120 seconds. That renders like this:
Screenshot from 2024-09-10 10-25-39
It's not horrible, but it's not great either, and it makes me wonder if the very simplistic approach where we treat locks as "just another random thing to set on the workflow" is the right approach. If we want to build lock awareness into the KFP UI, that might require adding the lock concept as a first class citizen in the KFP (apiserver) API.

Another thing that's making me ask this question is what happens when the semaphore isn't on the cluster. The Workflow fails very quickly when this happens. In the Argo UI, we get a very nice message about the reason. In the KFP UI, there's just a generic pipeline failed message -- nothing about the lock configmap missing.

Do folks think it's acceptable to treat locks as "just another random thing to set on the workflow" and thereby have KFP be basically ignorant of what's going on, or would it be better to make KFP more aware of locking and add first-class support for it to the apiserver and the UI?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 12, 2024
@gregsheremeta
Copy link
Contributor

/remove-lifecycle stale
/lifecycle frozen

@google-oss-prow google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants