Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] GeneratorExit inadvertently caught, and also core timer mismatch #325

Closed
cretz opened this issue Jun 5, 2023 · 1 comment · Fixed by #341
Closed

[Bug] GeneratorExit inadvertently caught, and also core timer mismatch #325

cretz opened this issue Jun 5, 2023 · 1 comment · Fixed by #341
Labels
bug Something isn't working

Comments

@cretz
Copy link
Member

cretz commented Jun 5, 2023

Describe the bug

There are two problems here:

GeneratorExit is caught somewhere

When a coroutine is closed, Python throws a GeneratorExit from the coroutine and expects that to be bubbled out. But somewhere we are not bubbling that out so Python throws a runtime exception on this. So we're getting:

RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <coroutine object _WorkflowInstanceImpl._apply_start_workflow..run_workflow at 0x000001F7B495E340>

With a stack trace.

Core timer error

When workflow is evicted, sometimes when it reruns from scratch it gets something like:

2023-06-05T22:04:56.500788Z WARN temporal_sdk_core::worker::workflow: Failing workflow task run_id=b4c01744-339b-4d39-98f8-10e0c7d3846c failure=Failure { failure: Some(Failure { message: "Fatal("Timer fired event did not have expected timer id 8, it was 2!")", source: "", stack_trace: "", encoded_attributes: None, cause: None, failure_info: Some(ApplicationFailureInfo(ApplicationFailureInfo { r#type: "", non_retryable: false, details: None })) }), force_cause: Unspecified }

Unsure if this is due to GeneratorExit above or if that failure just surfaced the problem.

Replication

import asyncio
import logging
from uuid import uuid4

from temporalio import workflow
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker


@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self) -> None:
        try:
            for _ in range(20):
                await asyncio.sleep(0.5)
        finally:
            await asyncio.sleep(1)

    @workflow.signal
    async def my_signal(self) -> None:
        pass


async def main():
    logging.basicConfig(level=logging.INFO)

    async with await WorkflowEnvironment.start_local() as env:
        logging.info("Starting worker")
        task_queue = f"tq-{uuid4()}"
        async with Worker(
            env.client,
            task_queue=task_queue,
            workflows=[MyWorkflow],
            max_cached_workflows=0,
        ):

            logging.info("Starting workflow")
            handle = await env.client.start_workflow(
                MyWorkflow.run,
                id=f"wf-{uuid4()}",
                task_queue=task_queue,
            )

            logging.info("Signalling every 300ms")
            for _ in range(20):
                await asyncio.sleep(0.3)
                await handle.signal(MyWorkflow.my_signal)

            logging.info("Waiting for workflow completion")
            await handle.result()
            logging.info("Workflow done")


if __name__ == "__main__":
    asyncio.run(main())

This is based on a user repo. This disables cache and makes a bunch of timers. But this causes constant evictions for each timer and signal sent which is causing the GeneratorExit issue. But that GeneratorExit issue is also causing this core error. I have not yet determined whether the GeneratorExit issue is surfacing a core bug or is just the bug itself w/ an eviction problem and core is somehow showing it.

@cretz cretz added the bug Something isn't working label Jun 5, 2023
@cretz
Copy link
Member Author

cretz commented Jun 6, 2023

Here's the issue for GeneratorExit as comments on the original workflow:

@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self) -> None:
        try:
            for _ in range(20):
                # With cache disabled, this is evicted which when GC'd causes
                # this to throw a GeneratorExit (which is usually ok)
                await asyncio.sleep(0.5)
        finally:
            # This tries to run on GeneratorExit, but fails with a RuntimeError
            # of "no running event loop". Since this finally was not written well
            # it swallows the GeneratorExit error making Python mad it didn't
            # receive the error
            await asyncio.sleep(1)

So, the finally block is causing an error which swallows the actual GeneratorExit making Python upset. No matter how many ways I've tried (e.g. trying to overwrite close on coroutine which are read only), I cannot prevent the coroutine.close() which does the GeneratorExit dance. From https://docs.python.org/3/reference/datamodel.html#coroutine.close:

Causes the coroutine to clean itself up and exit. If the coroutine is suspended, this method first delegates to the close() method of the iterator that caused the coroutine to suspend, if it has such a method. Then it raises GeneratorExit at the suspension point, causing the coroutine to immediately clean itself up. Finally, the coroutine is marked as having finished executing, even if it was never started.

Coroutine objects are automatically closed using the above process when they are about to be destroyed.

I am still investigating how I can try to tear down the event loop without it waking up. I am also still investigating how core timers can get out of sync here. My first thought was this is an old activation completing, but that doesn't seem accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant