-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow proper stack trace on eviction deadlock #530
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2467,6 +2467,8 @@ async def test_workflow_deadlock(client: Client): | |
async with new_worker( | ||
client, DeadlockedWorkflow, disable_safe_workflow_eviction=True | ||
) as worker: | ||
if worker._workflow_worker: | ||
worker._workflow_worker._deadlock_timeout_seconds = 1 | ||
deadlock_thread_event.clear() | ||
handle = await client.start_workflow( | ||
DeadlockedWorkflow.run, | ||
|
@@ -2488,7 +2490,7 @@ async def last_history_task_failure() -> str: | |
|
||
try: | ||
await assert_eq_eventually( | ||
"[TMPRL1101] Potential deadlock detected, workflow didn't yield within 2 second(s)", | ||
"[TMPRL1101] Potential deadlock detected, workflow didn't yield within 1 second(s)", | ||
last_history_task_failure, | ||
timeout=timedelta(seconds=5), | ||
interval=timedelta(seconds=1), | ||
|
@@ -2497,6 +2499,64 @@ async def last_history_task_failure() -> str: | |
deadlock_thread_event.set() | ||
|
||
|
||
@workflow.defn | ||
class EvictionDeadlockWorkflow: | ||
def __init__(self) -> None: | ||
self.val = 1 | ||
|
||
async def wait_until_positive(self): | ||
while True: | ||
await workflow.wait_condition(lambda: self.val > 0) | ||
self.val = -self.val | ||
|
||
async def wait_until_negative(self): | ||
while True: | ||
await workflow.wait_condition(lambda: self.val < 0) | ||
self.val = -self.val | ||
|
||
@workflow.run | ||
async def run(self): | ||
await asyncio.gather(self.wait_until_negative(), self.wait_until_positive()) | ||
|
||
|
||
async def test_workflow_eviction_deadlock(client: Client): | ||
# We are running the worker, but we can't ever shut it down on eviction | ||
# error so we send shutdown in the background and leave this worker dangling | ||
worker = new_worker(client, EvictionDeadlockWorkflow) | ||
if worker._workflow_worker: | ||
worker._workflow_worker._deadlock_timeout_seconds = 1 | ||
worker_task = asyncio.create_task(worker.run()) | ||
|
||
# Run workflow that deadlocks | ||
handle = await client.start_workflow( | ||
EvictionDeadlockWorkflow.run, | ||
id=f"workflow-{uuid.uuid4()}", | ||
task_queue=worker.task_queue, | ||
) | ||
|
||
async def last_history_task_failure() -> str: | ||
resp = await client.workflow_service.get_workflow_execution_history( | ||
GetWorkflowExecutionHistoryRequest( | ||
namespace=client.namespace, | ||
execution=WorkflowExecution(workflow_id=handle.id), | ||
), | ||
) | ||
for event in reversed(resp.history.events): | ||
if event.event_type == EventType.EVENT_TYPE_WORKFLOW_TASK_FAILED: | ||
return event.workflow_task_failed_event_attributes.failure.message | ||
return "<no failure>" | ||
|
||
await assert_eq_eventually( | ||
"[TMPRL1101] Potential deadlock detected, workflow didn't yield within 1 second(s)", | ||
last_history_task_failure, | ||
timeout=timedelta(seconds=5), | ||
interval=timedelta(seconds=1), | ||
) | ||
|
||
# Send cancel but don't wait | ||
worker_task.cancel() | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm finding that this test passes even with the bug fix reverted. (I know that some people disagree, but personally I think that it's a good idea to structure PRs like this as an initial commit with a failing test, followed by a commit that fixes the failure.)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, there's no easy way without extracting logs or adding other testing-only helpers to assert which reason exists for eviction failure. The test in this case was more for observed behavior as I struggled to justify adding injection points for this test. But if we think it's important, I can add hooks in the runtime code for this one test. |
||
|
||
class PatchWorkflowBase: | ||
def __init__(self) -> None: | ||
self._result = "<unset>" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it'd make more sense to just panic and crash at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is not really about changing this behavior, but we can have a general discussion on that if we want though I don't think it should hold up this issue.
To do this, we might need some Core support. Today Core won't shutdown a worker if there is an outstanding eviction, and we don't want to return an eviction if it can't properly tear down because we don't want to accept more work for this workflow. What will Core do if we fail the eviction instead of hang it?
Also, a deadlocked workflow fatal'ing the worker is a bit rough on users. Unfortunately deadlocked many times means failed-eviction-in-cache-forever from a Python POV. Technically we could move can't-evict workflows off to some other dict, but that's just shifting the memory burden, it doesn't help the end user. This whole situation should be rare, hence why I was hesitant to build in some continually-retry-eviction type of logic in case the user's non-deterministic stuff started working (rare/unlikely).
Now that I think about it, I think other SDKs like Go and Java have this issue too. If they can't tear down their coroutines/threads and it hangs, there's nothing they can do, but I don't think they fatal the worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you fail the eviction I believe it just gets re-issued. We could make a special case for this but I'd rather not.
And yeah, fair point about the other SDKs and how this is likely when you do deadlock. I'm fine with just eating up the slot.