-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core Feature] Allow Some Dynamic Sub-workflows to Fail #878
Comments
+1 to this feature we have many use-cases for a dynamic_task that executes a list of subworkflows or subtasks that are all independent of each other, so there's no reason for all of them to get aborted when one has an error example: |
@giorgiawillits / @michaels-lyft, I want to understand the usecase a little better. As you guys say that the top workflow should continue even if a child workflow fails as they are independent. IMO this is a cleaner feature then adding weird behavior to the parent. The reason to wait today is because the parent can consume the outputs. We can probably easily model asyn workflow triggers, where outputs are not available. |
Thanks for following up on this @kumare3 ! For my use case, we would actually like a downstream workflow to consume the output of the parent workflow. The reason we are okay with some sub-workflows/sub-tasks failing is because we split work by date and we are okay with missing data from some dates as long as there's enough data at the end. With the async solution, as long as there's an API to wait and "join" all the detached processes then it should be fine for our use case. |
@michaels-lyft as a follow up would you be ok doing this yourself in a regular python task, or you prefer this to be an idiomatic solution in the programming models |
Unfortunately we can't migrate the sub-workflow to a regular python task since the workflow spins out several sub-tasks and sub-dynamic-tasks... Could this be implemented idiomatically so that |
We have a similar use-case, but haven't yet implemented it. Our expectation was that if we put FAIL_AFTER_EXECUTABLE_NODES_COMPLETE workflow execution will try to make as much progress as it can. If any node fails, all independent nodes should be attempted. In the case of sub-workflow, all nodes that depend on WorkflowNode should be skipped, but all independent nodes should be attempted. In our case async execution is not an option because we care about completion of each sub-workflow. Example:
B is sub-workflow consisting of:
If |
@kanterov / @michaels-lyft this is an interesting discussion. @michaels-lyft Closely reading and understanding. - I think @kanterov your execution should work. but @michaels-lyft I think for you a Whats your current state, can you use the new flytekit interface? |
Yes a
The workflow fans out jobs split by date and hour. We don't need data for every hour of every day to have enough data for the downstream job. As long as a certain % of the jobs succeed, we can start the downstream. The reason some jobs might fail could either be due to lack of data for that particular hour or flyte infra issues. |
@michaels-lyft we have started the exploration for this feature in current milestone, but, we will deliver it in the next milestone |
@kumare3 Thanks for the update! How long is a milestone? |
1 milestone is month long |
Hi @kumare3 Just wondering if there's an update on this issue? |
Hey @michaels-lyft, I wanted to clarify whether the try-finally behavior proposed will solve your problem. The behavior we are proposing is something like this:
If An example of sub-workflows:
In this case, Note that |
Hi @EngHabu, thanks for the proposal! I don't think it would work for our use case since downstream workflows rely on this workflow to be marked as successful in order for it to be executed. If this ended up being the implementation for try-catch, then I think we would need option 1 in the description in order for our use case to work. I can think of a very hacky workaround using the proposed try-catch implementation though. It would to set the downstream workflow as the |
Hey @michaels-lyft, can you elaborate on how will you implement that function to make it handle the error and return a successful run? specially if you have a workflow that returns an output. Like this: @workflow(failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE)
def my_parent_wf() -> str:
n1 = my_sub_wf()
n2 = my_sub_wf()
n3 = my_sub_wf()
return n3.out1 Let's say it fails in executing |
Our workflow looks more like:
Ideally, the If the above approach doesn't work (ie. requires more work to fix), then I think the try-except scenario would work as long as there's a
|
converting the error handler to a separate task |
@michaels-lyft you can launch subworkflows and then decide to fail or succeed without needing any additional work from our team. It just needs to follow some rules
If above 2 conditions are met then, you should be able to launch the workflows using a python task and An example of doing so would be (in the new API) @task
def foo():
remote = flytekit.remote()
lp = remote.fetch_launch_plan(...)
e1 = remote.execute(lp, inputs, name)
...
e1.wait()
e2.wait()
remote.execute(lp, inputs=e1.outputs...) This is completely ok, sadly the parallelizing and failure detection falls on you, but this is resilient. In the future we will be supporting a new task-type that will allow one to run this and we will automatically provide futures to ensure that things run and we can wait correctly and data dependency is automatically propagated. ping me if you have any questions. cc @EngHabu |
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏 |
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏 |
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. |
Motivation: Why do you think this is important?
We have a lot of scenarios where there's a dynamic_task that executes a list of sub-workflows or sub-dynamic_tasks. Today, when one of these sub-executions fail, it would cause the parent dynamic_task to fail even with
FAIL_AFTER_EXECUTABLE_NODES_COMPLETE
enabled on the parent workflow orallowed_failure_ratio
enabled on the dynamic_task (which only works with sub-python_tasks). This aborts any of the dependent downstream tasks automatically. But sometimes, we would like to allow certain % of the tasks to fail since we may not always need 100% of the output.Goal: What should the final outcome look like, ideally?
There should be an option to allow a % of the sub-workflows or sub-dynamic_tasks to fail.
Describe alternatives you've considered
Gone through the discussions at #191. Tried using
FAIL_AFTER_EXECUTABLE_NODES_COMPLETE
andallowed_failure_ratio
but neither works in this scenario.[Optional] Propose: Link/Inline OR Additional context
I can think of two ways to do this:
allowed_failure_ratio
to include sub-workflows and sub-dynamic_taskstry-catch
in the dynamic_task that we could use to catch any sub-execution failures and make the parent task succeed (as suggested here).The text was updated successfully, but these errors were encountered: