Coverage.py test suite failure since commit 0749244d134 #115832

nedbat · 2024-02-22T18:45:02Z

Bug report

Bug description:

Coverage.py runs its test suite against nightly builds of Python. The tests stalled last night. Running them locally, there's one test that freezes the suite. The GitHub action is eventually cancelled by timing out after six hours: https://github.com/nedbat/coveragepy/actions/runs/8000886484

Bisecting CPython, the first bad commit is 0749244

commit 0749244d13412d7cb5b53d834f586f2198f5b9a6
Author: Brett Simmers <[email protected]>
Date:   Tue Feb 20 06:57:48 2024 -0800

    gh-112175: Add `eval_breaker` to `PyThreadState` (#115194)

    This change adds an `eval_breaker` field to `PyThreadState`. The primary
    motivation is for performance in free-threaded builds: with thread-local eval
    breakers, we can stop a specific thread (e.g., for an async exception) without
    interrupting other threads.

    The source of truth for the global instrumentation version is stored in the
    `instrumentation_version` field in PyInterpreterState. Threads usually read the
    version from their local `eval_breaker`, where it continues to be colocated
    with the eval breaker bits.

I'm still investigating, but I'm hoping that someone familiar with the change (@swtaarrs?) will be able to help.

To reproduce:

% git clone https://github.com/nedbat/coveragepy
% cd coveragepy
% export COVERAGE_ANYPY=/path/to/your/python3
% pip install tox
% tox -qre anypy -- -n 0 -vvv -k test_multiprocessing_simple
=== CPython 3.13.0a4+ (rev 0749244d13) with C tracer (.tox/anypy/bin/python) ===
===================================================================== test session starts ======================================================================
platform darwin -- Python 3.13.0a4+, pytest-8.0.1, pluggy-1.4.0 -- /Users/ned/coverage/trunk/.tox/anypy/bin/python
cachedir: .tox/anypy/.pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/Users/ned/coverage/trunk/.hypothesis/examples'))
rootdir: /Users/ned/coverage/trunk
configfile: pyproject.toml
plugins: flaky-3.7.0, xdist-3.5.0, hypothesis-6.98.9
collected 1386 items / 1384 deselected / 2 selected
run-last-failure: no previously failed tests, not deselecting items.

tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[fork] PASSED                                                                 [ 50%]
tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[spawn] PASSED                                                                [100%]

============================================================== 2 passed, 1384 deselected in 3.37s ==============================================================
=== CPython 3.13.0a4+ (rev 0749244d13) with sys.monitoring (.tox/anypy/bin/python) ===
===================================================================== test session starts ======================================================================
platform darwin -- Python 3.13.0a4+, pytest-8.0.1, pluggy-1.4.0 -- /Users/ned/coverage/trunk/.tox/anypy/bin/python
cachedir: .tox/anypy/.pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/Users/ned/coverage/trunk/.hypothesis/examples'))
rootdir: /Users/ned/coverage/trunk
configfile: pyproject.toml
plugins: flaky-3.7.0, xdist-3.5.0, hypothesis-6.98.9
collected 1386 items / 1384 deselected / 2 selected
run-last-failure: no previously failed tests, not deselecting items.

tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[fork] PASSED                                                                 [ 50%]
tests/test_concurrency.py::MultiprocessingTest::test_multiprocessing_simple[spawn]

and there it is stuck.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux, macOS

Linked PRs

gh-115832: Fix instrumentation version mismatch during interpreter shutdown #115856

The text was updated successfully, but these errors were encountered:

nedbat · 2024-02-22T18:57:12Z

A bit more explanation: coverage.py can run with three different "measurement cores": a sys.settrace function written in C, or in Python, or with sys.monitoring. The tests are fine with either sys.settrace, but freeze with sys.monitoring.

There are four different multiprocessing tests, which can run with fork or spawn. Any of the four freeze with spawn. Excluding "spawn", the entire test suite runs to completion.

swtaarrs · 2024-02-22T19:22:03Z

I'm investigating and will report back with any findings later today.

swtaarrs · 2024-02-22T23:09:45Z

I've figured out what's happening and I have a fix in mind, so you can assign this to me.

My commit added what's essentially a thread-local cache of the global instrumentation version. During interpreter shutdown, that can get out of sync with the global version (the root of the issue). If we run Python code later in the shutdown process (from a GC, in this case), that can cause an infinite loop because this check thinks the current code object is out of date, while this check deeper in _Py_Instrument() thinks the code object is up to date, and doesn't do anything to instrument it.

nedbat · 2024-02-22T23:11:44Z

Great, I would never have been able to figure that out!

…ter shutdown In python/cpython@bc613e4ecb3c, I introduced a bug to `interpreter_clear()`: it sets `interp->ceval.instrumentation_version` to 0, without making the corresponding change to `tstate->eval_breaker` (which holds a thread-local copy of the version). After this happens, Python code can still run due to object finalizers during a GC, and [this version check in bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152) will see a different result than [this one in instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895), causing an infinite loop. The fix itself is straightforward, and is what I should've done in `interpreter_clear()` in the first place: also clear `tstate->eval_breaker` when clearing `interp->ceval.instrumentation_version`. I also restored a comment that I'm not sure why I deleted in the original commit. To make bugs of this type less likely in the future, I changed `instrumentation.c:global_version()` to read the version from a `PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the version from the same location as the interpreter loop. This had some fan-out effects on callers, although a lot of them were passing `tstate->interp` already, so this made them (very) slighty shorter. - Issue: pythongh-115832

swtaarrs · 2024-02-23T01:30:56Z

I have a fix ready, but it depends on #115837 and I'm not sure if there's a good way to do stacked PRs, so I'll wait for that to land before posting it.

…ter shutdown In python/cpython@bc613e4ecb3c, I introduced a bug to `interpreter_clear()`: it sets `interp->ceval.instrumentation_version` to 0, without making the corresponding change to `tstate->eval_breaker` (which holds a thread-local copy of the version). After this happens, Python code can still run due to object finalizers during a GC, and [this version check in bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152) will see a different result than [this one in instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895), causing an infinite loop. The fix itself is straightforward, and is what I should've done in `interpreter_clear()` in the first place: also clear `tstate->eval_breaker` when clearing `interp->ceval.instrumentation_version`. I also restored a comment that I'm not sure why I deleted in the original commit. To make bugs of this type less likely in the future, I changed `instrumentation.c:global_version()` to read the version from a `PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the version from the same location as the interpreter loop. This had some fan-out effects on callers, although a lot of them were passing `tstate->interp` already, so this made them (very) slighty shorter. - Issue: pythongh-115832

…ter shutdown In python/cpython@0749244d13412d, I introduced a bug to `interpreter_clear()`: it sets `interp->ceval.instrumentation_version` to 0, without making the corresponding change to `tstate->eval_breaker` (which holds a thread-local copy of the version). After this happens, Python code can still run due to object finalizers during a GC, and [this version check in bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152) will see a different result than [this one in instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895), causing an infinite loop. The fix itself is straightforward, and is what I should've done in `interpreter_clear()` in the first place: also clear `tstate->eval_breaker` when clearing `interp->ceval.instrumentation_version`. I also restored a comment that I'm not sure why I deleted in the original commit. To make bugs of this type less likely in the future, I changed `instrumentation.c:global_version()` to read the version from a `PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the version from the same location as the interpreter loop. This had some fan-out effects on callers, although a lot of them were passing `tstate->interp` already, so this made them (very) slighty shorter. - Issue: pythongh-115832

…ter shutdown In python/cpython@0749244d13412d, I introduced a bug to `interpreter_clear()`: it sets `interp->ceval.instrumentation_version` to 0, without making the corresponding change to `tstate->eval_breaker` (which holds a thread-local copy of the version). After this happens, Python code can still run due to object finalizers during a GC, and [this version check in bytecodes.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/bytecodes.c#L147-L152) will see a different result than [this one in instrumentation.c](https://github.com/python/cpython/blob/4ee6bdfbaa792a3aa93c65c2022a89bd2d1e0894/Python/instrumentation.c#L894-L895), causing an infinite loop. The fix itself is straightforward, and is what I should've done in `interpreter_clear()` in the first place: also clear `tstate->eval_breaker` when clearing `interp->ceval.instrumentation_version`. I also restored a comment that I'm not sure why I deleted in the original commit. To make bugs of this type less likely in the future, I changed `instrumentation.c:global_version()` to read the version from a `PyThreadState*` rather than a `PyInterpreterState*`, so it's reading the version from the same location as the interpreter loop. This had some fan-out effects on its transitive callers, although most of them already had the current tstate availale. - Issue: pythongh-115832

…utdown (#115856) A previous commit introduced a bug to `interpreter_clear()`: it set `interp->ceval.instrumentation_version` to 0, without making the corresponding change to `tstate->eval_breaker` (which holds a thread-local copy of the version). After this happens, Python code can still run due to object finalizers during a GC, and the version check in bytecodes.c will see a different result than the one in instrumentation.c causing an infinite loop. The fix itself is straightforward: clear `tstate->eval_breaker` when clearing `interp->ceval.instrumentation_version`.

…ter shutdown (python#115856) A previous commit introduced a bug to `interpreter_clear()`: it set `interp->ceval.instrumentation_version` to 0, without making the corresponding change to `tstate->eval_breaker` (which holds a thread-local copy of the version). After this happens, Python code can still run due to object finalizers during a GC, and the version check in bytecodes.c will see a different result than the one in instrumentation.c causing an infinite loop. The fix itself is straightforward: clear `tstate->eval_breaker` when clearing `interp->ceval.instrumentation_version`.

nedbat added the type-bug An unexpected behavior, bug, or error label Feb 22, 2024

nedbat assigned swtaarrs Feb 22, 2024

swtaarrs mentioned this issue Feb 23, 2024

test_monitoring contains hardcoded line numbers #115836

Closed

swtaarrs mentioned this issue Feb 23, 2024

gh-115832: Fix instrumentation version mismatch during interpreter shutdown #115856

Merged

swtaarrs linked a pull request Feb 27, 2024 that will close this issue

gh-115832: Fix instrumentation version mismatch during interpreter shutdown #115856

Merged

colesbury closed this as completed in #115856 Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage.py test suite failure since commit 0749244d134 #115832

Coverage.py test suite failure since commit 0749244d134 #115832

nedbat commented Feb 22, 2024 •

edited by bedevere-app bot

Loading

nedbat commented Feb 22, 2024

swtaarrs commented Feb 22, 2024

swtaarrs commented Feb 22, 2024

nedbat commented Feb 22, 2024

swtaarrs commented Feb 23, 2024

Coverage.py test suite failure since commit 0749244d134 #115832

Coverage.py test suite failure since commit 0749244d134 #115832

Comments

nedbat commented Feb 22, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

nedbat commented Feb 22, 2024

swtaarrs commented Feb 22, 2024

swtaarrs commented Feb 22, 2024

nedbat commented Feb 22, 2024

swtaarrs commented Feb 23, 2024

nedbat commented Feb 22, 2024 •

edited by bedevere-app bot

Loading