[Core Feature] Failure-Node support #1506

kumare3 · 2021-09-20T21:18:21Z

Motivation: Why do you think this is important?
Flyte backend supports a Failure-node for every workflow / sub-workflow. This is not currently exposed in flytekit (python or Java)

Goal: What should the final outcome look like, ideally?
Users should be able to define failure nodes for their workflows. An example for the python SDK is as follows

@task
def my_error_handler(...):
   ...

@workflow(on_error=my_error_handler)
def my_wf():
   ...

If my_wf() fails at any point during execution, it'll call my_error_handler() task and will pass some context (error info... etc.) to allow it to handle the error. The expectation is that my_error_handler() would do things like clean up resources, log/send customized notifications... etc. The thing it will NOT let you do is recover from failure... The execution of this workflow will still fail, be marked as failure and upstream callers will still be notified of its failure.

An example of sub-workflows:

@workflow(on_error=my_error_handler)
def my_sub_wf():
    ...

@workflow(failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE)
def my_parent_wf() -> str:
   n1 = my_sub_wf()
   n2 = my_sub_wf()
   n3 = my_sub_wf()

   return n3.out1

In this case, my_parent_wf will continue running even if any of the nodes fails. The overall status of the execution will again be marked as failure but it'll let as many nodes as possible to execute... Whenever my_sub_wf fails, it'll invoke an instance of my_error_handler task to cleanup resources... etc.

Describe alternatives you've considered
NA

[Optional] Propose: Link/Inline OR Additional context
More discussion in
#1012

Related flytekit java issue - #1012

kumare3 · 2021-09-20T21:18:41Z

cc @EngHabu / @cosmicBboy / @kanterov

fvde · 2023-03-30T08:06:54Z

As old as this issue may be, we would absolutely LOVE this and it has been a stable feature on other orchestration engines (such as Kubeflow) for many years.

dylanwilder · 2023-05-04T13:59:31Z

@kumare3 is this a flytekit only change or would it require changes to propellor to propagate error state

EngHabu · 2023-05-04T14:06:38Z

@dylanwilder the changes in the backend are mostly already done.. it's possible they have regressed because of the lack of end to end testing for it (because it's not implemented in flytekit).. would you be able to help with the flytekit side if things?

dylanwilder · 2023-05-04T17:36:52Z

Potentially we could pitch in since we'd like to see this, do you have anything outlining what's required?

EngHabu · 2023-05-10T14:23:48Z

@dylanwilder I think this is really close. Maybe attempt to use it in an example and start the debugging journey from there? Happy to be pulled in once you get flytekit to produce the spec in case you deem it a problem with the backend...

dylanwilder · 2023-05-17T15:00:33Z

Thanks will take a look and see!

kumare3 · 2023-05-19T06:11:23Z

@eapolinario is probably also looking into this

gitgraghu · 2023-05-22T21:17:43Z

wondering if there is a way to support inputs and outputs that are different from the workflow interface in the failure handler..below is an example use case we were trying to implement:

A validation task which creates a dataframe and fails the task if there are invalid values.
On workflow failure, return this validation dataframe as an output of the workflow.

EngHabu · 2023-10-24T20:46:18Z

Was just brainstorming with @pingsutw now on this... here are my thoughts on UX:

@workflow
def my_wf(a: int) -> str:
   b = my_task(a=a)
   flytekit.current_context().on_failure = clean_up(a=a, b=b)
   return b

@task
def clean_up(err: Error, a: Optional[int], b: Optional[str]) -> str:
   ...

Set failure node within the workflow code instead of the decorator to allow passing inputs (don't know if return b or clean_up(a=a, b=b) as an alternative syntax is too hacky?
clean_up must take Error as the first parameter and can take any number of extra inputs as long as they are all Optional. Propeller will fill them in if they are available or None otherwise... it's the implementor's job to handle those cases correctly within clean_up

wdyt @kumare3 @eapolinario @gitgraghu

pingsutw · 2023-10-26T19:51:11Z

Need some discussion about

if a subworkflow and the parent workflow both have failure handlers… do both run?
if i have parent_wf and sub_wf… both have failure handlers. i can see an argument to say - when i run parent_wf, if the error was in the sub_wf, just run the sub_wf failure handler, if the error happened in the parent_wf then run the parent_wf. or both run.

PRs for failure node. (still WIP)

also cc @cosmicBboy @wild-endeavor

EngHabu · 2023-10-28T16:19:02Z

I believe it should
The way I think about it is that first the subworkflow fails, so it kicks off the failure node handler. After it finishes, the subworkflow itself will report that it failed overall, then the parent node will handle failure as well..

wild-endeavor · 2024-05-14T19:32:33Z

is this okay to close?

kumare3 added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers and removed untriaged This issues has not yet been looked at by the Maintainers labels Sep 20, 2021

kumare3 mentioned this issue Jan 25, 2022

[flytekit-java] Support Workflow#failure_node #1012

Open

EngHabu added this to the 0.19.3 - Feb 2021 milestone Feb 2, 2022

EngHabu assigned kumare3 Feb 9, 2022

EngHabu changed the title ~~[Core Feature]Failure-Node support in flytekit~~ [Core Feature]Failure-Node support Feb 16, 2022

EngHabu changed the title ~~[Core Feature]Failure-Node support~~ [Core Feature] Failure-Node support Feb 16, 2022

EngHabu modified the milestones: 0.19.3 - Feb 2021, 1.0.0 - Phoenix! Feb 16, 2022

EngHabu modified the milestones: 1.0.0 - Phoenix!, 1.0.1 Mar 9, 2022

EngHabu removed this from the 1.0.1 milestone May 4, 2022

eapolinario added this to the 1.3.0-candidate milestone Oct 6, 2022

eapolinario mentioned this issue Oct 7, 2022

Add support failure node flyteorg/flytekit#840

Merged

kumare3 modified the milestones: 1.3.0-candidate, 1.4.0 Nov 15, 2022

cosmicBboy removed this from the 1.4.0 milestone Jan 18, 2023

eapolinario added the backlogged For internal use. Reserved for contributor team workflow. label Oct 27, 2023

pingsutw mentioned this issue Nov 17, 2023

Add support failure node #4308

Merged

Future-Outlier mentioned this issue Apr 6, 2024

[easy] Show failed_node_id in failure node local execution flyteorg/flytekit#2334

Merged

3 tasks

runllm bot mentioned this issue Jun 26, 2024

SubWorkflow Error handling #5513

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core Feature] Failure-Node support #1506

[Core Feature] Failure-Node support #1506

kumare3 commented Sep 20, 2021 •

edited

Loading

kumare3 commented Sep 20, 2021

fvde commented Mar 30, 2023

dylanwilder commented May 4, 2023

EngHabu commented May 4, 2023

dylanwilder commented May 4, 2023

EngHabu commented May 10, 2023

dylanwilder commented May 17, 2023

kumare3 commented May 19, 2023

gitgraghu commented May 22, 2023

EngHabu commented Oct 24, 2023

pingsutw commented Oct 26, 2023

EngHabu commented Oct 28, 2023

wild-endeavor commented May 14, 2024

[Core Feature] Failure-Node support #1506

[Core Feature] Failure-Node support #1506

Comments

kumare3 commented Sep 20, 2021 • edited Loading

kumare3 commented Sep 20, 2021

fvde commented Mar 30, 2023

dylanwilder commented May 4, 2023

EngHabu commented May 4, 2023

dylanwilder commented May 4, 2023

EngHabu commented May 10, 2023

dylanwilder commented May 17, 2023

kumare3 commented May 19, 2023

gitgraghu commented May 22, 2023

EngHabu commented Oct 24, 2023

pingsutw commented Oct 26, 2023

EngHabu commented Oct 28, 2023

wild-endeavor commented May 14, 2024

kumare3 commented Sep 20, 2021 •

edited

Loading