-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(backend): Add Parallelism Limit to ParallelFor tasks. Fixes #8718 #10798
Conversation
looks like a CI infra issue? /retest just to check |
@gmfrasca: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
tested with this pipeline: parallelfor.pyfrom kfp import compiler
from kfp import dsl
from kfp.dsl import Input, InputPath, Output, OutputPath, Dataset, Model, component
@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def preprocess(
message: Input[str],
output_model: Output[Model]
):
import random
line = "some_model"
print(f"Message: {message}")
with open(output_model.path, 'w') as output_file:
output_file.write('line: {}'.format(line))
output_model.metadata['accuracy'] = random.uniform(0, 1)
@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def train(
model: Input[Model],
epoch: Input[int],
trained_model: Output[Model],
):
import random
line = "some_model"
print(f"Train for epoch: {epoch}")
with open(trained_model.path, 'w') as output_file:
output_file.write('line: {}'.format(line))
trained_model.metadata['accuracy'] = random.uniform(0, 1)
@dsl.pipeline(pipeline_root='', name='tutorial-data-passing')
def data_passing_pipeline():
preprocess_task = preprocess(message="dataset").set_caching_options(enable_caching=False)
with dsl.ParallelFor(items=[1, 5, 10, 25], parallelism=2) as epochs:
train(model=preprocess_task.outputs['output_model'], epoch=epochs).set_caching_options(enable_caching=False)
if __name__ == '__main__':
compiler.Compiler().compile(data_passing_pipeline, __file__ + '.yaml')
Worked successfully: ~ $ kubectl -n ${kfp_ns} get workflow tutorial-data-passing-drsfz -o yaml | yq '.spec.templates[-2]' dag:
tasks:
- arguments:
parameters:
- name: component
value: '{{workflow.annotations.pipelines.kubeflow.org/components-comp-for-loop-2}}'
- name: parent-dag-id
value: '{{inputs.parameters.parent-dag-id}}'
- name: task
value: '{"componentRef":{"name":"comp-for-loop-2"},"dependentTasks":["preprocess"],"inputs":{"artifacts":{"pipelinechannel--preprocess-output_model":{"taskOutputArtifact":{"outputArtifactKey":"output_model","producerTask":"preprocess"}}}},"iteratorPolicy":{"parallelismLimit":2},"parameterIterator":{"itemInput":"pipelinechannel--loop-item-param-1","items":{"raw":"[1, 5, 10, 25]"}},"taskInfo":{"name":"for-loop-2"}}'
depends: preprocess.Succeeded
name: for-loop-2-driver
template: system-dag-driver
- arguments:
parameters:
- name: parent-dag-id
value: '{{tasks.for-loop-2-driver.outputs.parameters.execution-id}}'
- name: iteration-index
value: '{{item}}'
depends: for-loop-2-driver.Succeeded
name: for-loop-2-iterations
template: comp-for-loop-2-for-loop-2
withSequence:
count: '{{tasks.for-loop-2-driver.outputs.parameters.iteration-count}}'
- arguments:
parameters:
- name: component
value: '{{workflow.annotations.pipelines.kubeflow.org/components-comp-preprocess}}'
- name: task
value: '{"cachingOptions":{},"componentRef":{"name":"comp-preprocess"},"inputs":{"parameters":{"message":{"runtimeValue":{"constant":"dataset"}}}},"taskInfo":{"name":"preprocess"}}'
- name: container
value: '{{workflow.annotations.pipelines.kubeflow.org/implementations-comp-preprocess}}'
- name: parent-dag-id
value: '{{inputs.parameters.parent-dag-id}}'
name: preprocess-driver
template: system-container-driver
- arguments:
parameters:
- name: pod-spec-patch
value: '{{tasks.preprocess-driver.outputs.parameters.pod-spec-patch}}'
- default: "false"
name: cached-decision
value: '{{tasks.preprocess-driver.outputs.parameters.cached-decision}}'
depends: preprocess-driver.Succeeded
name: preprocess
template: system-container-executor
inputs:
parameters:
- name: parent-dag-id
metadata:
annotations:
sidecar.istio.io/inject: "false"
name: root
outputs: {}
parallelism: 2
Note the:
The UI feedback on this could be better: Currently all 4 iterations show executing at the same time (they also never really show the done checkmark even once they finish). But this problem existed prior to this, and seems out of scope for this task, and should be in a follow up issue. However I confirmed the pods are scheduled 2 at a time in this example (since parallelism = 2 in this example). |
Hrmm testing it with the above pipeline amended to: @dsl.pipeline(pipeline_root='', name='tutorial-data-passing')
def data_passing_pipeline():
preprocess_task = preprocess(message="dataset").set_caching_options(enable_caching=False)
with dsl.ParallelFor(items=[1, 5, 10, 25], parallelism=2) as epochs:
train(model=preprocess_task.outputs['output_model'], epoch=epochs).set_caching_options(enable_caching=False)
with dsl.ParallelFor(items=[6, 12, 24, 48], parallelism=4) as epochs:
train(model=preprocess_task.outputs['output_model'], epoch=epochs).set_caching_options(enable_caching=False) It looks like the workflow will use |
as @HumairAK noted it looks like there's a problem when there are multiple ParallelFor components in a single DAG, as the The obvious solution/workaround for this is to add another layer of abstraction on iterator tasks, (ie, root DAG calls a new "XYZ-iterator" DAG, which contains the |
Hi @gmfrasca, I think it's really important to have a limit that applies to the individual for loop, not the entire DAG. Have you considered using Argo's implementation of loop parallelism? Using that might even simplify the DAG. However, implementing this could lead to issues with other backends. |
Hey @hsteude - so this implementation actually already leverages the Argo loop parallelism mechanism. The issue here is that the current compiled architecture of a pipeline aggregates all KFP pipeline steps into sequential The workaround/DAG re-architecture I mentioned above would bump out each of these |
/test kubeflow-pipelines-samples-v2 |
9780650
to
1148242
Compare
@gmfrasca: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/lgtm |
I reviewed this PR a month ago and this PR is still opened. I think if would be good having one more |
/hold does hold work here? I'd like to review this before it's merged 😄 |
I've seen this in other pipelines too -- ones without loops. Might be anything with a sub-DAG. |
I'm caught off guard a bit because it looks like ParallelFor was mostly already working, except for the limit? And this PR fixes the limit and that's about it, right? Is #8718 the correct Issue to fix, then? Should it be tweaked to call out that only parallelization limit was left? Or should 8718 be closed and a new Issue opened? |
While running this PR in an attempt to validate it and understand it better, I found that I can reliably cause an apiserver panic. The same pipeline works fine on both upstream master of https://github.com/kubeflow/pipelines and https://github.com/opendatahub-io/data-science-pipelines. test pipeline (adapted from what Humair posted above):
compiled version of that pipeline:
The pipeline uploads fine and renders fine. However, when I run it, I get the following error in the UI:
I can see the following in the apiserver logs:
removing the
removing the loop entirely does help, i.e. no more panic:
leaving the hold in place :) |
1166016
to
e74b9ce
Compare
@gmfrasca this pipeline failed for the 2nd loop: pipeline.pyfrom kfp import compiler
from kfp import dsl
from kfp.dsl import Input, InputPath, Output, OutputPath, Dataset, Model, component
@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def preprocess(
message: Input[str],
output_model: Output[Model]
):
import random
line = "some_model"
print(f"Message: {message}")
with open(output_model.path, 'w') as output_file:
output_file.write('line: {}'.format(line))
output_model.metadata['accuracy'] = random.uniform(0, 1)
@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def train(
model: Input[Model],
epoch: Input[int],
trained_model: Output[Model],
):
import random
line = "some_model"
print(f"Train for epoch: {epoch}")
with open(trained_model.path, 'w') as output_file:
output_file.write('line: {}'.format(line))
trained_model.metadata['accuracy'] = random.uniform(0, 1)
@dsl.pipeline(pipeline_root='', name='tutorial-data-passing')
def data_passing_pipeline():
preprocess_task = preprocess(message="dataset").set_caching_options(enable_caching=False)
with dsl.ParallelFor(items=[1, 5, 10, 25], parallelism=2) as epochs:
train(model=preprocess_task.outputs['output_model'], epoch=epochs).set_caching_options(enable_caching=False)
with dsl.ParallelFor(items=[6, 12, 24, 48], parallelism=4) as epochs:
train(model=preprocess_task.outputs['output_model'], epoch=epochs).set_caching_options(enable_caching=False)
if __name__ == '__main__':
compiler.Compiler().compile(data_passing_pipeline, __file__ + '.yaml') With the error in a dag driver pod:
Seems like there is are multiple executions writing the same task name |
I'm thinking we should make some of these pipelines into integration tests to ensure we're catching these cases |
e74b9ce
to
1362de6
Compare
/lgtm After following Greg's comment in #10798 (comment), I had a successful pipeline run using his example code. |
@gmfrasca is there any pending work for this PR? |
/ok-to-test |
Approvals successfully granted for pending runs. |
tested it out side-by-side with master. Makes sense, works, code lgtm. Can you squash it and rebase it and then I'll tag it? Thanks! |
Signed-off-by: Giulio Frasca <[email protected]>
…mpiler Signed-off-by: Giulio Frasca <[email protected]>
…ted tasks Signed-off-by: Giulio Frasca <[email protected]>
Signed-off-by: Giulio Frasca <[email protected]>
Signed-off-by: Giulio Frasca <[email protected]>
- Passthrough ParentDagID rather than DriverExecutionID to iterator such that iteration item correctly detects dependentTasks. - Remove depends from iterator DAG as it is already handled by root-level task - Update Iterator template names/nomenclature for clarity - Update tests accordingly Signed-off-by: Giulio Frasca <[email protected]>
- Removes the Driver pod from the Iterator abstraction-layer template as it confuses MLMD and is purley an Argo implementation - Drivers still used on the Component and Iteration-item templates Signed-off-by: Giulio Frasca <[email protected]>
1362de6
to
f13c0c2
Compare
/lgtm yesssss |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HumairAK The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description of your changes:
Fixes #8718
Adds the
Parallelism
item to a DAG template if specified by a task's IteratorPolicy (ie ParallelFor w/ a parallelism limit).Checklist: