Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Aborted child workflows should better report root cause of abort #408

Closed
3 of 20 tasks
katrogan opened this issue Jul 10, 2020 · 12 comments · Fixed by flyteorg/flyteconsole#195
Closed
3 of 20 tasks
Assignees
Labels
bug Something isn't working ui Admin console user interface

Comments

@katrogan
Copy link
Contributor

Describe the bug
Currently if a parent workflow launches a child workflow and parent workflow fails it will auto-abort children workflows. On the parent overview page the node that launched the child workflow correctly reports Some node execution failed, auto-abort. However on the page for the actual child workflow the aborted task shows as Workflow aborted. without any indication why. This is a confusing experience for users debugging or looking at the failed child workflow.

Expected behavior
A clear and concise description of what you expected to happen.

Flyte component

  • Overall
  • Flyte Setup and Installation scripts
  • Flyte Documentation
  • Flyte communication (slack/email etc)
  • FlytePropeller
  • FlyteIDL (Flyte specification language)
  • Flytekit (Python SDK)
  • FlyteAdmin (Control Plane service)
  • FlytePlugins
  • DataCatalog
  • FlyteStdlib (common libraries)
  • FlyteConsole (UI)
  • Other

To Reproduce
Steps to reproduce the behavior:

  1. Create a workflow that launches a long-ish running child workflow
  2. Have this parent workflow fail a downstream task
  3. Observe the execution page for the child workflow.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
Flyte component

  • Sandbox (local or on one machine)
  • Cloud hosted
    • AWS
    • GCP
    • Azure
  • Baremetal
  • Other

Additional context
N/A

@katrogan katrogan added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 10, 2020
@gsimko-lyft
Copy link

👍

@pmahindrakar-oss
Copy link
Contributor

@katrogan I am unable to reproduce this issue.

For reproing the issue , i have used the following example from flytesnacks control_flow/subworkflows.py

And modified it to fail from the parent workflow after 45 sec . I always launch the workflow with value of a= 3 so that the parent throws an exception. Have added sleep in parent execution to allow enough time for the child workflow containers to be created and also a sleep in child workflow of 90 sec to keep them running and not be completed before the parent.

Let me know if there some other specific example that can repro it.

@task
def t1(a: int) -> typing.NamedTuple("OutputsBC", t1_int_output=int, c=str):
    if a == 3:
      time.sleep(45)
      raise Exception("Sorry, failing the parent workflow")
    time.sleep(90) 
    return a + 2, "world"

@workflow
def my_subwf(a: int = 42) -> (str, str):
    x, y = t1(a=a)
    u, v = t1(a=23)
    return y, v

@workflow
def parent_wf(a: int) -> (int, str, str):
    x, y = t1(a=a).with_overrides(node_name="node-t1-parent")
    u, v = my_subwf(a=10)
    return x, u, v

Here are the screenshots with two cases of the workflow execution
1] Parent workflow aborts due to failure in execution of a child task and child worklow is aborted due to this.
Screenshot 2021-05-20 at 4 03 50 PM

2] Execution is aborted while the parent and child workflows are executing

Screenshot 2021-05-20 at 4 09 22 PM

Also going through the propeller code to dig if there is a case i am missing here

@pmahindrakar-oss
Copy link
Contributor

Another data point here
There was an issue where the reason field was not being used at all during handling failed workflow execution.

It was fixed here
https://github.com/flyteorg/flytepropeller/pull/84/files#diff-854663e4d10286e61f445300917772f010c74cbdc602948568084d4ef894259cR97

As part of this PR flyteorg/flytepropeller#84

handleFailingWorkflow And HandleAbortedWorkflow are two different paths

Also
Log message : "Some node execution failed, auto-abort" is part of handleFailingWorkflow
Log message : "Workflow aborted" is part of the HandleAbortedWorkflow

Trying to see how these two converge

@katrogan
Copy link
Contributor Author

katrogan commented May 20, 2021

do you have screenshots of the child workflow? i think the issue is that error messaging 'Workflow aborted' is non-descriptive and doesn't indicate by whom it was aborted (the failing parent wf in thise case)

@pmahindrakar-oss
Copy link
Contributor

@katrogan can you guide me in how to look for the child workflow page.I assumed the screenshots I posted would cover this .Also there is graph view where also I saw the same data .Is there some place else ?

@pmahindrakar-oss
Copy link
Contributor

pmahindrakar-oss commented May 24, 2021

Ok . I was able to repro this with a slightly different example where we use the launchplan within another workflow


@task
def greet(day_of_week: str, number: int, am: bool) -> str:
    if day_of_week == "unknown_day" :
      time.sleep(45)
      raise Exception("Sorry, failing the parent workflow since i dont know how to greet for unknown_day")
    time.sleep(90)
    greeting = "Have a great " + day_of_week + " "
    greeting += "morning" if am else "evening"
    return greeting + "!" * number


@workflow
def go_greet(day_of_week: str, number: int, am: bool = False) -> str:
    return greet(day_of_week=day_of_week, number=number, am=am)


morning_greeting = LaunchPlan.create(
    "morning_greeting",
    go_greet,
    #fixed_inputs={"am": True},
    default_inputs={"number": 1, "am" : True},
)

@workflow
def child_workflow():
    # Let's see if we can convincingly pass a Turing test!
    today = datetime.datetime.today()
    greet(day_of_week="unknown_day", number=3, am=True)

    for n in range(7):
        day = today + datetime.timedelta(days=n)
        weekday = calendar.day_name[day.weekday()]
        if day.weekday() < 5:
            print(morning_greeting(day_of_week=weekday))
        else:
            # We're extra enthusiastic on weekends
            print(morning_greeting(number=3, day_of_week=weekday))

Parent workflow

Screenshot 2021-05-24 at 7 51 15 PM

Child workflow

Screenshot 2021-05-24 at 7 51 23 PM

cc : @kumare3

@pmahindrakar-oss
Copy link
Contributor

After debugging further with the current state the abort workflow relies on the DeleteTimestamp to determine if the workflow has been aborted and this cannot distinguish between user abort versus a failing parent task triggering a subworkflow execution abort .

The abort stage which is part of the control loop, it only has access to v1alpha1.FlyteWorkflow object which it compares desired and actual states to make changes.

When the parent Workflow fails , it calls abort with abort handler for the nodeExecutor which inturn call the subWorflow abort handler with the reason for all downstream nodes and ultimately recording this as

event.NodeExecutionEvent{
			Id:         nodeExecutionID,
			Phase:      core.NodeExecution_ABORTED,
			OccurredAt: ptypes.TimestampNow(),
			OutputResult: &event.NodeExecutionEvent_Error{
				Error: &core.ExecutionError{
					Code:    "NodeAborted",
					Message: reason,
				},
			},

with reason "Some node execution failed, auto-abort"

And this is shown in parent overview page

When this event is raised and auto-abort is called, a reason string is constructed to be sent to admin for Terminate request
eg : cascading abort as parent execution id [f38e7e5290f1d443fa5f] aborted, reason [Some node execution failed, auto-abort.]

But this information is not shown on the console. Ideally we should be pulling this in UI console.
Which is also mentioned by @EngHabu on this issue #232

Now on the child workflow page since after the control loop runs later, It sees the all child sub workflow as having a deleteTimeStamp that were terminated by parent and hence send another NodeExecutionEvent but this time with static reason 'Workflow aborted.'

No where the information of error is stored in the InMemory v1alpha1.FlyteWorkflow object which can be accessed through the control loop flow and hence when this executes, the error is stored as "Workflow aborted" in the event.NodeExecutionEvent

@pmahindrakar-oss
Copy link
Contributor

Following is the execution data of the child workflow

  "closure": {
        "abort_metadata": {
            "cause": "cascading abort as parent execution id [f38e7e5290f1d443fa5f] aborted, reason [Some node execution failed, auto-abort.]"
        },
        "phase": "ABORTED",

And this the node execution data for the same identifier.

            "closure": {
                "error": {
                    "code": "NodeAborted",
                    "message": "Workflow aborted."
                },
                "phase": "ABORTED",

UI shows the error coming from the node execution.

I see three options to fix this.
1] Flytepropeller sends the NodeExecutionEvent with abort metadata in the error object.

2] Flyteadmin get the current event but find if its an abort and then find the execution data for it and amends the error message to the value coming from the execution's abort metadata.

3] UI should show the execution error message view which shows both the node abort message and execution abort message which will make it clearer.

Cleaner in my opinion would be to show the abort metadata from the execution on the flyteconsole instead of modifying anything in propeller or admin.

@katrogan @kumare3 please suggest what would be appropriate here.

@katrogan
Copy link
Contributor Author

+1 for option 1) i like the idea of not having an external service (admin, console) deal with reconciling state. much simpler and straightforward to have it recorded correctly

@katrogan
Copy link
Contributor Author

after chatting with @kumare3 ignore my suggestion! I think admin as the control plane is better positioned to reconcile the errors message vs abort cause by traversing the parent lineage

@pmahindrakar-oss
Copy link
Contributor

pmahindrakar-oss commented May 31, 2021

hey @katrogan @kumare3 ,

Heres a simple abort usecase, the UI doesn't show the right message which was entered by user from flyteconsole.

heres the o/p from flytectl. So abort string of "Console termination " was entered from flyteconsole but it doesn't show up in the UI , since it show the node execution data from closure "error": { "code": "NodeAborted", "message": "Workflow aborted." },

But if we see the actual execution data , it does show the abort data .(Node execution closure . abort != Execution closure .abort)

In other cases where we have ERROR data populated which is basically failure cases in those cases
Node execution closure . error = = Execution closure .error AND UI correctly pull the execution closure . error.message field and displays it.

Reconciling in admin for the abort case where Node execution closure . abort != Execution closure .abort would be ok , But i think if we can have the UI show both the abort and error data from executions then that would ok too for now. I would leave the reconciliation case up to you guys.

I have setup time for discussing this . Shouldn't take much time

 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| NAME (79)            | ABORT DATA                                        | ERROR DATA                                                                                                      |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| ffaz4xni             | cascading abort as parent execution id            |                                                                                                                 |
|                      | [f1ac227769dfc48c481f] aborted, reason [Some node |                                                                                                                 |
|                      | execution failed, auto-abort.]                    |                                                                                                                 |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| f43be7092143744efb5e | Console termination                               |                                                                                                                 |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| fcdae8acc683240de84c |                                                   | [4/4] currentAttempt done. Last Error: USER::Traceback (most recent call last):                                 |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/bin/entrypoint.py", line 101, in _dispatch_execute       |
|                      |                                                   |     outputs = task_def.dispatch_execute(ctx, idl_input_literals)                                                |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/base_task.py", line 402, in dispatch_execute        |
|                      |                                                   |     native_inputs = TypeEngine.literal_map_to_kwargs(exec_ctx, input_literal_map, self.python_interface.inputs) |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 306, in literal_map_to_kwargs |
|                      |                                                   |     return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}             |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 306, in <dictcomp>            |
|                      |                                                   |     return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}             |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 284, in to_python_value       |
|                      |                                                   |     return transformer.to_python_value(ctx, lv, expected_python_type)                                           |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 186, in to_python_value       |
|                      |                                                   |     dc = expected_python_type.from_json(_json_format.MessageToJson(lv.scalar.generic))                          |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/dataclasses_json/api.py", line 76, in from_json                   |
|                      |                                                   |     return cls.from_dict(kvs, infer_missing=infer_missing)                                                      |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/dataclasses_json/api.py", line 83, in from_dict                   |
|                      |                                                   |     return _decode_dataclass(cls, kvs, infer_missing)                                                           |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/dataclasses_json/core.py", line 159, in _decode_dataclass         |
|                      |                                                   |     field_value = kvs[field.name]                                                                               |
|                      |                                                   | KeyError: 'y                                                                                                    |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| fdeb5b4f6771a47368bc |                                                   | [1/1] currentAttempt done. Last Error: USER::Traceback (most recent call last):                                 |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/bin/entrypoint.py", line 101, in _dispatch_execute       |
|                      |                                                   |     outputs = task_def.dispatch_execute(ctx, idl_input_literals)                                                |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/base_task.py", line 411, in dispatch_execute        |
|                      |                                                   |     raise e                                                                                                     |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/base_task.py", line 408, in dispatch_execute        |
|                      |                                                   |     native_outputs = self.execute(**native_inputs)                                                              |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/python_function_task.py", line 130, in execute      |
|                      |                                                   |     return self._task_function(**kwargs)                                                                        |
|                      |                                                   |   File "/root/core/flyte_basics/lp.py", line 74, in greet                                                       |
|                      |                                                   |     raise Exception("Sorry, failing the parent workflow since i dont know how to greet for unknown_day")        |
|                      |                                                   | Exception: Sorry, failing the parent workflow since i dont know how to greet for unknown_da                     |

MR to dump the error and abort data from flytectl is here flyteorg/flytectl#79

@pmahindrakar-oss
Copy link
Contributor

@jsonporter , Please review this issue. Essentially, the UI is currently not displaying the abortMetadata within an execution closure

https://github.com/flyteorg/flyteidl/blob/08a958c085c939101893f40fb1582e92b31c9a40/protos/flyteidl/admin/execution.proto#L116

It only show if the error field is populated
https://github.com/flyteorg/flyteidl/blob/08a958c085c939101893f40fb1582e92b31c9a40/protos/flyteidl/admin/execution.proto#L110

In my last update i have shown how i have implemented this in flytectl . The UI should do a conditional check based on which data is available from the oneof and use that to show the abort or error data.

I am assigning this to you .Please let me know if you have any questions regarding this .

@pmahindrakar-oss pmahindrakar-oss added ui Admin console user interface and removed untriaged This issues has not yet been looked at by the Maintainers labels Jun 4, 2021
@Pianist038801 Pianist038801 self-assigned this Sep 1, 2021
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
* Added hotfix for end2end test

Signed-off-by: Yuvraj <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* requirements update

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Set resources differently for SANDBOX vs prod

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* bump

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* use lower resources for sandbox

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* bump

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* bump

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* register without serialize

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* register without serialize

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* bump

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* bump

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Update requirements

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* wip

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Update eda requirements

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Cleanup

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* format

Signed-off-by: Haytham Abuelfutuh <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023
* Added hotfix for end2end test

Signed-off-by: Yuvraj <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
* Support envs when creating execution

Signed-off-by: Hongxin Liang <[email protected]>

* Update doc

Signed-off-by: Hongxin Liang <[email protected]>

---------

Signed-off-by: Hongxin Liang <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
* Support envs when creating execution

Signed-off-by: Hongxin Liang <[email protected]>

* Update doc

Signed-off-by: Hongxin Liang <[email protected]>

---------

Signed-off-by: Hongxin Liang <[email protected]>
austin362667 pushed a commit to austin362667/flyte that referenced this issue May 7, 2024
* Support envs when creating execution

Signed-off-by: Hongxin Liang <[email protected]>

* Update doc

Signed-off-by: Hongxin Liang <[email protected]>

---------

Signed-off-by: Hongxin Liang <[email protected]>
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024
* Support envs when creating execution

Signed-off-by: Hongxin Liang <[email protected]>

* Update doc

Signed-off-by: Hongxin Liang <[email protected]>

---------

Signed-off-by: Hongxin Liang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ui Admin console user interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants