Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage.py test suite failure since commit 0749244d134 #115832

Closed
nedbat opened this issue Feb 22, 2024 · 5 comments · Fixed by #115856
Closed

Coverage.py test suite failure since commit 0749244d134 #115832

nedbat opened this issue Feb 22, 2024 · 5 comments · Fixed by #115856
Assignees
Labels
type-bug An unexpected behavior, bug, or error

Comments

@nedbat
Copy link
Member

nedbat commented Feb 22, 2024

Bug report

Bug description:

Coverage.py runs its test suite against nightly builds of Python. The tests stalled last night. Running them locally, there's one test that freezes the suite. The GitHub action is eventually cancelled by timing out after six hours: https://github.com/nedbat/coveragepy/actions/runs/8000886484

Bisecting CPython, the first bad commit is 0749244

commit 0749244d13412d7cb5b53d834f586f2198f5b9a6
Author: Brett Simmers <[email protected]>
Date:   Tue Feb 20 06:57:48 2024 -0800

    gh-112175: Add `eval_breaker` to `PyThreadState` (#115194)

    This change adds an `eval_breaker` field to `PyThreadState`. The primary
    motivation is for performance in free-threaded builds: with thread-local eval
    breakers, we can stop a specific thread (e.g., for an async exception) without
    interrupting other threads.

    The source of truth for the global instrumentation version is stored in the
    `instrumentation_version` field in PyInterpreterState. Threads usually read the
    version from their local `eval_breaker`, where it continues to be colocated
    with the eval breaker bits.

I'm still investigating, but I'm hoping that someone familiar with the change (@swtaarrs?) will be able to help.

To reproduce:

% git clone https://github.com/nedbat/coveragepy
% cd coveragepy
% export COVERAGE_ANYPY=/path/to/your/python3
% pip install tox
% tox -qre anypy -- -n 0 -vvv -k test_multiprocessing_simple
=== CPython 3.13.0a4+ (rev 0749244d13) with C tracer (.tox/anypy/bin/python) ===
===================================================================== test session starts ======================================================================
platform darwin -- Python 3.13.0a4+, pytest-8.0.1, pluggy-1.4.0 -- /Users/ned/coverage/trunk/.tox/anypy/bin/python
cachedir: .tox/anypy/.pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/Users/ned/coverage/trunk/.hypothesis/examples'))
rootdir: /Users/ned/coverage/trunk
configfile: pyproject.toml
plugins: flaky-3.7.0, xdist-3.5.0, hypothesis-6.98.9
collected 1386 items / 1384 deselected / 2 selected
run-last-failure: no previously failed tests, not deselecting items.

tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[fork] PASSED                                                                 [ 50%]
tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[spawn] PASSED                                                                [100%]

============================================================== 2 passed, 1384 deselected in 3.37s ==============================================================
=== CPython 3.13.0a4+ (rev 0749244d13) with sys.monitoring (.tox/anypy/bin/python) ===
===================================================================== test session starts ======================================================================
platform darwin -- Python 3.13.0a4+, pytest-8.0.1, pluggy-1.4.0 -- /Users/ned/coverage/trunk/.tox/anypy/bin/python
cachedir: .tox/anypy/.pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/Users/ned/coverage/trunk/.hypothesis/examples'))
rootdir: /Users/ned/coverage/trunk
configfile: pyproject.toml
plugins: flaky-3.7.0, xdist-3.5.0, hypothesis-6.98.9
collected 1386 items / 1384 deselected / 2 selected
run-last-failure: no previously failed tests, not deselecting items.

tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[fork] PASSED                                                                 [ 50%]
tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[spawn]

and there it is stuck.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux, macOS

Linked PRs

@nedbat nedbat added the type-bug An unexpected behavior, bug, or error label Feb 22, 2024
@nedbat
Copy link
Member Author

nedbat commented Feb 22, 2024

A bit more explanation: coverage.py can run with three different "measurement cores": a sys.settrace function written in C, or in Python, or with sys.monitoring. The tests are fine with either sys.settrace, but freeze with sys.monitoring.

There are four different multiprocessing tests, which can run with fork or spawn. Any of the four freeze with spawn. Excluding "spawn", the entire test suite runs to completion.

@swtaarrs
Copy link
Member

I'm investigating and will report back with any findings later today.

@swtaarrs
Copy link
Member

I've figured out what's happening and I have a fix in mind, so you can assign this to me.

My commit added what's essentially a thread-local cache of the global instrumentation version. During interpreter shutdown, that can get out of sync with the global version (the root of the issue). If we run Python code later in the shutdown process (from a GC, in this case), that can cause an infinite loop because this check thinks the current code object is out of date, while this check deeper in _Py_Instrument() thinks the code object is up to date, and doesn't do anything to instrument it.

@nedbat
Copy link
Member Author

nedbat commented Feb 22, 2024

Great, I would never have been able to figure that out!

swtaarrs added a commit to swtaarrs/cpython that referenced this issue Feb 23, 2024
…ter shutdown

In python/cpython@bc613e4ecb3c, I introduced a bug to `interpreter_clear()`: it
sets `interp->ceval.instrumentation_version` to 0, without making the
corresponding change to `tstate->eval_breaker` (which holds a thread-local copy
of the version). After this happens, Python code can still run due to object
finalizers during a GC, and [this version check in
bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152)
will see a different result than [this one in
instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895),
causing an infinite loop.

The fix itself is straightforward, and is what I should've done in
`interpreter_clear()` in the first place: also clear `tstate->eval_breaker`
when clearing `interp->ceval.instrumentation_version`. I also restored a
comment that I'm not sure why I deleted in the original commit.

To make bugs of this type less likely in the future, I changed
`instrumentation.c:global_version()` to read the version from a
`PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the
version from the same location as the interpreter loop. This had some fan-out
effects on callers, although a lot of them were passing `tstate->interp`
already, so this made them (very) slighty shorter.

- Issue: pythongh-115832
@swtaarrs
Copy link
Member

I have a fix ready, but it depends on #115837 and I'm not sure if there's a good way to do stacked PRs, so I'll wait for that to land before posting it.

swtaarrs added a commit to swtaarrs/cpython that referenced this issue Feb 23, 2024
…ter shutdown

In python/cpython@bc613e4ecb3c, I introduced a bug to `interpreter_clear()`: it
sets `interp->ceval.instrumentation_version` to 0, without making the
corresponding change to `tstate->eval_breaker` (which holds a thread-local copy
of the version). After this happens, Python code can still run due to object
finalizers during a GC, and [this version check in
bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152)
will see a different result than [this one in
instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895),
causing an infinite loop.

The fix itself is straightforward, and is what I should've done in
`interpreter_clear()` in the first place: also clear `tstate->eval_breaker`
when clearing `interp->ceval.instrumentation_version`. I also restored a
comment that I'm not sure why I deleted in the original commit.

To make bugs of this type less likely in the future, I changed
`instrumentation.c:global_version()` to read the version from a
`PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the
version from the same location as the interpreter loop. This had some fan-out
effects on callers, although a lot of them were passing `tstate->interp`
already, so this made them (very) slighty shorter.

- Issue: pythongh-115832
swtaarrs added a commit to swtaarrs/cpython that referenced this issue Feb 23, 2024
…ter shutdown

In python/cpython@0749244d13412d, I introduced a bug to `interpreter_clear()`: it
sets `interp->ceval.instrumentation_version` to 0, without making the
corresponding change to `tstate->eval_breaker` (which holds a thread-local copy
of the version). After this happens, Python code can still run due to object
finalizers during a GC, and [this version check in
bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152)
will see a different result than [this one in
instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895),
causing an infinite loop.

The fix itself is straightforward, and is what I should've done in
`interpreter_clear()` in the first place: also clear `tstate->eval_breaker`
when clearing `interp->ceval.instrumentation_version`. I also restored a
comment that I'm not sure why I deleted in the original commit.

To make bugs of this type less likely in the future, I changed
`instrumentation.c:global_version()` to read the version from a
`PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the
version from the same location as the interpreter loop. This had some fan-out
effects on callers, although a lot of them were passing `tstate->interp`
already, so this made them (very) slighty shorter.

- Issue: pythongh-115832
swtaarrs added a commit to swtaarrs/cpython that referenced this issue Feb 23, 2024
…ter shutdown

In python/cpython@0749244d13412d, I introduced a bug to `interpreter_clear()`: it
sets `interp->ceval.instrumentation_version` to 0, without making the
corresponding change to `tstate->eval_breaker` (which holds a thread-local copy
of the version). After this happens, Python code can still run due to object
finalizers during a GC, and [this version check in
bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152)
will see a different result than [this one in
instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895),
causing an infinite loop.

The fix itself is straightforward, and is what I should've done in
`interpreter_clear()` in the first place: also clear `tstate->eval_breaker`
when clearing `interp->ceval.instrumentation_version`. I also restored a
comment that I'm not sure why I deleted in the original commit.

To make bugs of this type less likely in the future, I changed
`instrumentation.c:global_version()` to read the version from a
`PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the
version from the same location as the interpreter loop. This had some fan-out
effects on its transitive callers, although most of them already had the
current tstate availale.

- Issue: pythongh-115832
colesbury pushed a commit that referenced this issue Mar 4, 2024
…utdown (#115856)

A previous commit introduced a bug to `interpreter_clear()`: it set
`interp->ceval.instrumentation_version` to 0, without making the corresponding
change to `tstate->eval_breaker` (which holds a thread-local copy of the
version). After this happens, Python code can still run due to object finalizers
during a GC, and the version check in bytecodes.c will see a different result
than the one in instrumentation.c causing an infinite loop.

The fix itself is straightforward: clear `tstate->eval_breaker` when clearing
`interp->ceval.instrumentation_version`.
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024
…ter shutdown (python#115856)

A previous commit introduced a bug to `interpreter_clear()`: it set
`interp->ceval.instrumentation_version` to 0, without making the corresponding
change to `tstate->eval_breaker` (which holds a thread-local copy of the
version). After this happens, Python code can still run due to object finalizers
during a GC, and the version check in bytecodes.c will see a different result
than the one in instrumentation.c causing an infinite loop.

The fix itself is straightforward: clear `tstate->eval_breaker` when clearing
`interp->ceval.instrumentation_version`.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…ter shutdown (python#115856)

A previous commit introduced a bug to `interpreter_clear()`: it set
`interp->ceval.instrumentation_version` to 0, without making the corresponding
change to `tstate->eval_breaker` (which holds a thread-local copy of the
version). After this happens, Python code can still run due to object finalizers
during a GC, and the version check in bytecodes.c will see a different result
than the one in instrumentation.c causing an infinite loop.

The fix itself is straightforward: clear `tstate->eval_breaker` when clearing
`interp->ceval.instrumentation_version`.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…ter shutdown (python#115856)

A previous commit introduced a bug to `interpreter_clear()`: it set
`interp->ceval.instrumentation_version` to 0, without making the corresponding
change to `tstate->eval_breaker` (which holds a thread-local copy of the
version). After this happens, Python code can still run due to object finalizers
during a GC, and the version check in bytecodes.c will see a different result
than the one in instrumentation.c causing an infinite loop.

The fix itself is straightforward: clear `tstate->eval_breaker` when clearing
`interp->ceval.instrumentation_version`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants