Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-awaitable object makes asyncio.wait hang if there are async subprocess tasks #105288

Closed
keuin opened this issue Jun 4, 2023 · 7 comments
Closed
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@keuin
Copy link

keuin commented Jun 4, 2023

asyncio.waitaccepts a Iterable[Awaitable[_T]] as its first parameter. However it will hang forever if we passes an invalid Awaitable to it and there are subprocesses wrapped in asyncio.Task running.

To sum up, we have to meet these conditions at once:

  1. There are tasks for subprocesses managed by asyncio.subprocess. It does not matter if we asyncio.wait for it or not.
  2. The program attempts to perform asyncio.wait on iterators that can yield non-awaitable objects (such as integers).

Here is a litmus program that reproduces this problem:

import asyncio
from asyncio import subprocess


async def whoami():
    print('Finding out who am I...')
    proc = await subprocess.create_subprocess_exec(
        'whoami',
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
    )
    stdout, _ = await proc.communicate()
    print(f'I am {stdout}')


async def main():
    t1 = asyncio.create_task(asyncio.sleep(0))
    t2 = asyncio.create_task(whoami())
    # Both `[0, t1]` and `[0, t2]` can cause the problem
    await asyncio.wait([0, t2])


if __name__ == '__main__':
    asyncio.run(main())

When running it on Linux, this program is expected to terminate immediately after throwing an exception. Because we are attempting to wait for 0, which does not implement Awaitable[_T]. However the program will barely output:

Finding out who am I...

and hangs forever. Hitting Ctrl-C does terminate it, but the stack trace points to somewhere far away from the real cause:

Traceback (most recent call last):
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/litmus.py", line 19, in main
    await asyncio.wait([0, t2])
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 418, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 522, in _wait
    f.add_done_callback(_on_completion)
    ^^^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'add_done_callback'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/cpython/bisect/cpython/litmus.py", line 23, in <module>
    asyncio.run(main())
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 189, in run
    with Runner(debug=debug) as runner:
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 63, in __exit__
    self.close()
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 71, in close
    _cancel_all_tasks(loop)
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 201, in _cancel_all_tasks
    loop.run_until_complete(tasks.gather(*to_cancel, return_exceptions=True))
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py", line 640, in run_until_complete
    self.run_forever()
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Task was destroyed but it is pending!
task: <Task cancelling name='Task-3' coro=<whoami() running at /tmp/cpython/bisect/cpython/litmus.py:7> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[gather.<locals>._done_callback() at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:754]>

Process finished with exit code 130 (interrupted by signal 2: SIGINT)

It does not matter if we wait for [0, t1], or [0, t2]. As long as t2 (task of the subprocess) is created, the program will hang when awaiting on that list.

If we run it on previous versions of Python (such as the latest branch 3.10, 2c9b0f30), it will cause an exception thrown in the standard library, though the exact behavior seems to be undefined, since I have observed many different locations that throws some kind of exceptions like AttributeError in different locations, when bisecting using git. The correct output looks like:

/tmp/cpython/bisect/cpython/python /tmp/cpython/bisect/cpython/litmus.py 
Finding out who am I...
Traceback (most recent call last):
  File "/tmp/cpython/bisect/cpython/litmus.py", line 23, in <module>
    asyncio.run(main())
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py", line 664, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/litmus.py", line 19, in main
    await asyncio.wait([0, t2])
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 426, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 530, in _wait
    f.add_done_callback(_on_completion)
    ^^^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'add_done_callback'

Process finished with exit code 1

My test environment is Intel 12th gen core with the latest Archlinux & zen kernel (6.3.5-zen1-1-zen). I have rerun the test on an updated Debian 11 VM and got the same result. So I don't think it is related to distros and kernel versions.

Bisecting shows this issue was directly introduced in commit 7015e137: gh-88050: Fix asyncio subprocess to kill process cleanly when process is blocked (#32073), which attempted to fix a previous asyncio issue. After investigating the source code, I believe this was caused by this commit, which has two major problems:

  1. It doesn't validate the object before accessing its attribute pipe. If we look at method _try_finish, we find that method has validated the same objects, so this should be fixed. This causes an unrelated exception and interrupts the clean-up process _process_exited. However this piece of code has been removed in a subsequent commit. So it is not relevant today.
  2. It doesn't wake up waiters in self._exit_waiters after subprocess exited. This directly causes the program hang forever.

I will post my fix and detailed explanation in a PR shortly. Please let me know if there are problems.

Linked PRs

@kumaraditya303
Copy link
Contributor

Why are you passing a non awaitable object in the first place?

@keuin
Copy link
Author

keuin commented Jun 5, 2023

Why are you passing a non awaitable object in the first place?

As you can see, that's a misuse. It reveals two problems:

  1. The implementation is not exception safe. When processing invalid input, those waiters waiting for a background subprocess are not triggered correctly, which leads to an inconsistent state.
  2. As a standard library, the implementation should fail immediately against an invalid input, or define what should happen, rather than breaking previous behaviors.

keuin added a commit to keuin/cpython that referenced this issue Jun 5, 2023
keuin added a commit to keuin/cpython that referenced this issue Jun 7, 2023
@kumaraditya303
Copy link
Contributor

The issue here isn't because of asyncio.wait but rather a background fire and forget task which hangs if not cancelled correctly. Simplified code:

import asyncio
from asyncio import subprocess


async def whoami():
    print('Finding out who am I...')
    proc = await subprocess.create_subprocess_exec(
        'whoami',
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
    )
    print('Waiting for the process to finish...', proc)
    stdout, _ = await proc.communicate()
    print(f'I am {stdout}')


async def main():
    t2 = asyncio.create_task(whoami())
    await asyncio.sleep(0)
    # await asyncio.sleep(0) Uncomment to fix hang

if __name__ == '__main__':
    asyncio.run(main())

As seen in the code adding a sleep(0) which is essentially cycle event loop once before tearing it apart fixes it.

@keuin
Copy link
Author

keuin commented Jun 8, 2023

As seen in the code adding a sleep(0) which is essentially cycle event loop once before tearing it apart fixes it.

Thanks for your detailed explanation. But I think we are a little off-topic. Do you mean that I should "fix" the litmus program rather than fix asyncio? I don't think that makes sense because the litmus program is to show the difference behavior of asyncio, which I think is a bug of the cleaning-up process. The litmus program is not to be used or debugged. Its purpose is to show the internal state of the runtime.

Let go back to the track by printing the lifetime of the litmus program and compare before/after the commit 7015e137: gh-88050: Fix asyncio subprocess to kill process cleanly when process is blocked (#32073) has been made.

Let's a line of code print('Handle:', handle, file=sys.stderr) in base_events.py:1937, just at the line after handle = self._ready.popleft():

        ntodo = len(self._ready)
        for i in range(ntodo):
            handle = self._ready.popleft()
            print('Handle:', handle, file=sys.stderr) # <---- add a log here
            if handle._cancelled:
                continue

And add another line print('Waking up exit waiters...', file=sys.stderr), in file base_subprocess.py, just one line after the comment # wake up futures waiting for wait(), which looks like:

        # wake up futures waiting for wait()
        print('Waking up exit waiters...', file=sys.stderr) # <---- add a log here
        for waiter in self._exit_waiters:
            if not waiter.cancelled():
                waiter.set_result(returncode)
        self._exit_waiters = None

To synchronize all outputs, we also add file=sys.stderr in all print calls from the litmus program.

Current behavior

When running without the PR, it prints:

/tmp/cpython/bisect/cpython/python /tmp/cpython/bisect/cpython/litmus.py 
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7faf2641b6a0>()>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7faf26dad7b0>()>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7faf26418cd0>()>
Finding out who am I...
Handle: <Handle _run_until_complete_cb(<Task finishe...e_callback'")>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7faf2641b6a0>()>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7faf26445570>()>
Handle: <Handle Task.task_wakeup(<Future cancelled>)>
Handle: <Handle PidfdChildWatcher._do_wait(1976482, 7, <bound method... debug=False>>, (<_UnixSubproc...76482 running>,))>
child process pid 1976482 exit status already read:  will report returncode 255
Handle: <Handle gather.<locals>._done_callback(<Task cancell...tasks.py:643>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:757>
Handle: <Handle gather.<locals>._done_callback(<Task cancell...ocess.py:158>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:757>
Handle: <Handle BaseEventLoop.call_soon(<bound method...6482 running>>, 255)>
Handle: <Handle BaseSelectorEventLoop._read_from_self()>
Handle: <Handle BaseSubprocessTransport._process_exited(255)>

<---- program hangs here

It's clear that the main program hangs, while the subprocess has exited.

The last "good" revision, before 7015e137

Now let's do the same experiment on commit 0e72606d, which is the last revision on which the litmus program won't hang:

/tmp/cpython/bisect/cpython/python /tmp/cpython/bisect/cpython/litmus.py 
Finding out who am I...
Handle: <Handle <TaskStepMethWrapper object at 0x7fc3e3749540>()>
Handle: <Handle <TaskStepMethWrapper object at 0x7fc3e37494b0>()>
Handle: <Handle <TaskStepMethWrapper object at 0x7fc3e3749450>()>
Handle: <Handle _run_until_complete_cb(<Task finishe...e_callback'")>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Handle: <Handle <TaskStepMethWrapper object at 0x7fc3e3749540>()>
Handle: <Handle <TaskStepMethWrapper object at 0x7fc3e41c13c0>()>
Handle: <Handle Task.task_wakeup(<Future cancelled>)>
Handle: <Handle gather.<locals>._done_callback(<Task cancell...tasks.py:635>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:762>
Handle: <Handle gather.<locals>._done_callback(<Task cancell...ocess.py:158>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:762>
Handle: <Handle _UnixSelectorEventLoop._child_watcher_callback(1964499, 0, <_UnixSubproc...64499 running>)>
Handle: <Handle BaseSelectorEventLoop._read_from_self()>
Handle: <Handle BaseEventLoop.call_soon(<bound method...4499 running>>, 0)>
Handle: <Handle BaseSubprocessTransport._process_exited(0)>
Waking up exit waiters...
Handle: <Handle Task.task_wakeup(<Future finished result=0>)>
Handle: <Handle gather.<locals>._done_callback(<Task cancell.../litmus.py:7>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:762>
Handle: <Handle _run_until_complete_cb(<_GatheringFu...ledError('')]>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Handle: <Handle <TaskStepMethWrapper object at 0x7fc3e3e0f010>()>
Handle: <Handle _run_until_complete_cb(<Task finishe...> result=None>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Handle: <Handle <TaskStepMethWrapper object at 0x7fc3e41c13c0>()>
Handle: <Handle _run_until_complete_cb(<Task finishe...> result=None>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Traceback (most recent call last):
  File "/tmp/cpython/bisect/cpython/litmus.py", line 26, in <module>
    asyncio.run(main())
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py", line 664, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/litmus.py", line 22, in main
    await asyncio.wait([0, t2])
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 426, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 530, in _wait
    f.add_done_callback(_on_completion)
    ^^^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'add_done_callback'

Process finished with exit code 1

The program terminates successfully without any interference. It's also clear that those waiters has been woke up after the process exits. So I think the real problem is: previous fix 7015e137 breaks this cleanup process. And we have to fix that.

The fix made by myself (PR)

After applying the fix PR, it prints:

/tmp/cpython/bisect/cpython/python /tmp/cpython/bisect/cpython/litmus.py 
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7f3dacf29240>()>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7f3dacf29360>()>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7f3dacf2a140>()>
Finding out who am I...
Handle: <Handle _run_until_complete_cb(<Task finishe...e_callback'")>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7f3dacf29240>()>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7f3dad919660>()>
Handle: <Handle Task.task_wakeup(<Future cancelled>)>
Handle: <Handle PidfdChildWatcher._do_wait(1980642, 7, <bound method... debug=False>>, (<_UnixSubproc...80642 running>,))>
child process pid 1980642 exit status already read:  will report returncode 255
Handle: <Handle gather.<locals>._done_callback(<Task cancell...tasks.py:643>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:757>
Handle: <Handle gather.<locals>._done_callback(<Task cancell...ocess.py:158>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:757>
Handle: <Handle BaseEventLoop.call_soon(<bound method...0642 running>>, 255)>
Handle: <Handle BaseSelectorEventLoop._read_from_self()>
Handle: <Handle BaseSubprocessTransport._process_exited(255)>
Waking up exit waiters...
Handle: <Handle Task.task_wakeup(<Future finished result=255>)>
Handle: <Handle gather.<locals>._done_callback(<Task cancell.../litmus.py:7>>) at /tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py:757>
Handle: <Handle _run_until_complete_cb(<_GatheringFu...ledError('')]>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7f3dacf2a140>()>
Handle: <Handle _run_until_complete_cb(<Task finishe...> result=None>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Handle: <Handle <_asyncio.TaskStepMethWrapper object at 0x7f3dacf29f60>()>
Handle: <Handle _run_until_complete_cb(<Task finishe...> result=None>) at /tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py:180>
Traceback (most recent call last):
  File "/tmp/cpython/bisect/cpython/litmus.py", line 26, in <module>
    asyncio.run(main())
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/base_events.py", line 664, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/litmus.py", line 22, in main
    await asyncio.wait([0, t2])
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 454, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/cpython/bisect/cpython/Lib/asyncio/tasks.py", line 537, in _wait
    f.add_done_callback(_on_completion)
    ^^^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'add_done_callback'

Process finished with exit code 1

Now the litmus program behaves exactly the same before commit 7015e137. I think we should treat it seriously since Python is such a popular language. This flaw could also be triggered in other places, though I have not found out.

BTW, the CI is failing because I am experimenting on how to write a test case that works as expected in my branch. The test works on my Linux machine, but I doesn't have Windows or MacOS testing environments, so I have to use the public CI. The frozen event loop makes testing much harder, and I am still finding out how to mark those exceptions as expected.

Please consider twice before closing that PR. Or maybe we need another developer to review that?

@kumaraditya303 @gvanrossum

@kumaraditya303
Copy link
Contributor

Please consider twice before closing that PR. Or maybe we need another developer to review that?

I have reopened the PR (I meant to request changes but somehow it got closed and I didn't look again sorry for that ) though I wasn't referring to the test failure added by your test but rather other tests. I referred to the fix as wrong as it basically reverts my fix. That change isn't straight forward to understand, see my explanation of it in #88050 (comment)

@kumaraditya303
Copy link
Contributor

For your reference, the test failure I was referring to were from https://github.com/python/cpython/actions/runs/5212830335/jobs/9406983274, see the warnings in other tests not added by your PR.

@graingert
Copy link
Contributor

graingert commented Oct 22, 2024

This problem can be avoided by using a TaskGroup:

import asyncio
from asyncio import subprocess


async def whoami():
    print('Finding out who am I...')
    proc = await subprocess.create_subprocess_exec(
        'whoami',
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
    )
    stdout, _ = await proc.communicate()
    print(f'I am {stdout}')


async def main():
    async with asyncio.TaskGroup() as tg:
        tg.create_task(asyncio.sleep(0))
        t2 = tg.create_task(whoami())
        # Both `[0, t1]` and `[0, t2]` can cause the problem
        await asyncio.wait([0, t2])


if __name__ == '__main__':
    asyncio.run(main())

this is a duplicate of #103847

@graingert graingert closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in asyncio Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

4 participants