test_threading.test_4_daemon_threads() crash randomly #110052

vstinner · 2023-09-28T20:19:36Z

On Linux, when I stress test test_threading.test_4_daemon_threads(), it does crash randomly:

./python -m test test_threading -m test_4_daemon_threads -j50 -F --fail-env-changed

gdb traceback:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006e02f4 in PyInterpreterState_ThreadHead (interp=0xdddddddddddddddd) at Python/pystate.c:1961

warning: Source file is more recent than executable.
1961	    return interp->threads.head;
[Current thread is 1 (Thread 0x7fcbb8ff96c0 (LWP 510515))]
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-13.fc38.x86_64 glibc-2.37-5.fc38.x86_64 libffi-3.4.4-2.fc38.x86_64 libgcc-13.2.1-1.fc38.x86_64 openssl-libs-3.0.9-2.fc38.x86_64 xz-libs-5.4.1-1.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64
(gdb) where
#0  0x00000000006e02f4 in PyInterpreterState_ThreadHead (interp=0xdddddddddddddddd) at Python/pystate.c:1961
#1  0x0000000000703090 in _Py_DumpTracebackThreads (fd=2, interp=0xdddddddddddddddd, current_tstate=0x24a34e0) at Python/traceback.c:1331
#2  0x000000000071bcef in faulthandler_dump_traceback (fd=2, all_threads=1, interp=0xa53688 <_PyRuntime+92712>) at ./Modules/faulthandler.c:195
#3  0x000000000071bfed in faulthandler_fatal_error (signum=11) at ./Modules/faulthandler.c:313
#4  <signal handler called>
#5  0x000000000069f808 in take_gil (tstate=0x24a34e0) at Python/ceval_gil.c:360
#6  0x00000000006a03ab in PyEval_RestoreThread (tstate=0x24a34e0) at Python/ceval_gil.c:714
#7  0x000000000074dd93 in portable_lseek (self=0x7fcc186c3590, posobj=0x0, whence=1, suppress_pipe_error=false) at ./Modules/_io/fileio.c:934
#8  0x000000000074de88 in _io_FileIO_tell_impl (self=0x7fcc186c3590) at ./Modules/_io/fileio.c:997
#9  0x000000000074ea4c in _io_FileIO_tell (self=0x7fcc186c3590, _unused_ignored=0x0) at ./Modules/_io/clinic/fileio.c.h:460
(...)

gdb debug:


(gdb) frame 5
#5  0x000000000069f808 in take_gil (tstate=0x24a34e0) at Python/ceval_gil.c:360
360	    struct _gil_runtime_state *gil = ceval->gil;
(gdb) l
355	    }
356	
357	    assert(_PyThreadState_CheckConsistency(tstate));
358	    PyInterpreterState *interp = tstate->interp;
359	    struct _ceval_state *ceval = &interp->ceval;
360	    struct _gil_runtime_state *gil = ceval->gil;
361	
362	    /* Check that _PyEval_InitThreads() was called to create the lock */
363	    assert(gil_created(gil));
364	

(gdb) p /x ceval
$1 = 0xdddddddddddddddd

(gdb) p /x tstate->interp
$2 = 0xdddddddddddddddd


(gdb) p /x tstate
$5 = 0x24a34e0

(gdb) p /x _PyRuntime._finalizing._value 
$3 = 0xab8fa8

(gdb) p /x _PyRuntime._finalizing_id 
$4 = 0x7fcc49fce740

I don't understand why the test didn't exit: _PyThreadState_MustExit() should return, no?

I don't understand why assert(_PyThreadState_CheckConsistency(tstate)); didn't fail.

Maybe Py_Finalize() was called between the pre-check:

    if (_PyThreadState_MustExit(tstate)) {
        /* bpo-39877: If Py_Finalize() has been called and tstate is not the
           thread which called Py_Finalize(), exit immediately the thread.

           This code path can be reached by a daemon thread after Py_Finalize()
           completes. In this case, tstate is a dangling pointer: points to
           PyThreadState freed memory. */
        PyThread_exit_thread();
    }

    assert(_PyThreadState_CheckConsistency(tstate));

and the code:

    PyInterpreterState *interp = tstate->interp;
    struct _ceval_state *ceval = &interp->ceval;
    struct _gil_runtime_state *gil = ceval->gil;

Linked PRs

The text was updated successfully, but these errors were encountered:

vstinner · 2023-09-28T20:23:06Z

See also issue gh-110031 which looks similar.

vstinner · 2023-09-28T20:53:09Z

The regression was introduced by PR gh-109794: #109794 (comment)

ericsnowcurrently · 2023-09-28T23:21:51Z

I'm looking into this.

./python -m test test_threading -m test_4_daemon_threads -j50 -F --fail-env-changed

I haven't been able to get a crash in over 16 minutes (but I certainly believe you).

vstinner · 2023-09-28T23:31:43Z

I haven't been able to get a crash in over 16 minutes (but I certainly believe you).

What is your operating system? I reproduced the crash on Linux. I didn't try other operating systems.

If you want to make daemon threads crashes more likely, add sleep(1) at the end of PyInterpreterState_Delete(), after free_interpreter().

ericsnowcurrently · 2023-09-28T23:47:31Z

What is your operating system? I reproduced the crash on Linux. I didn't try other operating systems.

It's an 8-core, 24GB RAM Hyper-V Ubuntu VM running on a Windows laptop.

If you want to make daemon threads crashes more likely, add sleep(1) at the end of PyInterpreterState_Delete(), after free_interpreter().

I'll try it.

ericsnowcurrently · 2023-09-28T23:48:00Z

gdb traceback:
,,,
gdb debug:
...
I don't understand why the test didn't exit: _PyThreadState_MustExit() should return, no?

Yeah, that's the critical question.

ericsnowcurrently · 2023-09-29T00:00:03Z

Here's the critical part from gh-109794:

diff --git a/Python/pystate.c b/Python/pystate.c
index dcc6c112215b3..570b5242600c0 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -2964,11 +2964,26 @@ _PyThreadState_MustExit(PyThreadState *tstate)
        tstate->interp->runtime to support calls from Python daemon threads.
        After Py_Finalize() has been called, tstate can be a dangling pointer:
        point to PyThreadState freed memory. */
+    unsigned long finalizing_id = _PyRuntimeState_GetFinalizingID(&_PyRuntime);
     PyThreadState *finalizing = _PyRuntimeState_GetFinalizing(&_PyRuntime);
     if (finalizing == NULL) {
+        // XXX This isn't completely safe from daemon thraeds,
+        // since tstate might be a dangling pointer.
         finalizing = _PyInterpreterState_GetFinalizing(tstate->interp);
+        finalizing_id = _PyInterpreterState_GetFinalizingID(tstate->interp);
     }
-    return (finalizing != NULL && finalizing != tstate);
+    // XXX else check &_PyRuntime._main_interpreter._initial_thread
+    if (finalizing == NULL) {
+        return 0;
+    }
+    else if (finalizing == tstate) {
+        return 0;
+    }
+    else if (finalizing_id == PyThread_get_thread_ident()) {
+        /* gh-109793: we must have switched interpreters. */
+        return 0;
+    }
+    return 1;
 }

The most obvious possibility is that we switched to a different thread state during finalization in the thread where finalization is happening and the thread state belongs to the same interpreter. However, that isn't an expected use case for PyThread_Swap() and I'm pretty sure we don't have any code that does that.

ericsnowcurrently · 2023-09-29T00:03:44Z

If you want to make daemon threads crashes more likely, add sleep(1) at the end of PyInterpreterState_Delete(), after free_interpreter().

I'll try it.

No failures in 13 minutes.

vstinner · 2023-09-29T00:16:58Z

No failures in 13 minutes.

How do you build Python? Which configure options?

I'm usually using ./configure --with-pydebug to get assertions.

ericsnowcurrently · 2023-09-29T00:18:56Z

Basically: ./configure --with-pydebug CFLAGS=-O0

ericsnowcurrently · 2023-09-29T00:26:55Z

Is this issue tied to any buildbot/CI failures?

terryjreedy · 2023-09-29T00:54:13Z

6/12 core Win 10, standard PCBuild/build.bat -d

...
0:00:31 load avg: 76.65 [ 87/1] test_threading failed (1 failure)
test test_threading failed -- Traceback (most recent call last):
  File "f:\dev\3x\Lib\test\test_threading.py", line 1172, in test_4_daemon_threads
    rc, out, err = assert_python_ok('-c', script)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\dev\3x\Lib\test\support\script_helper.py", line 166, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\dev\3x\Lib\test\support\script_helper.py", line 151, in _assert_python
    res.fail(cmd_line)
  File "f:\dev\3x\Lib\test\support\script_helper.py", line 76, in fail
    raise AssertionError("Process return code is %d\n"
AssertionError: Process return code is 3221225477
command line: ['f:\\dev\\3x\\PCbuild\\amd64\\python_d.exe', '-X', 'faulthandler', '-c', "if True:\n            import os\n            import random\n            import sys\n            import time\n            import threading\n\n            thread_has_run = set()\n\n            def random_io():\n                '''Loop for a while sleeping random tiny amounts and doing some I/O.'''\n                import test.test_threading as mod\n                while True:\n                    with open(mod.__file__, 'rb') as in_f:\n                        stuff = in_f.read(200)\n                        with open(os.devnull, 'wb') as null_f:\n                            null_f.write(stuff)\n                            time.sleep(random.random() / 1995)\n                    thread_has_run.add(threading.current_thread())\n\n            def main():\n                count = 0\n                for _ in range(40):\n                    new_thread = threading.Thread(target=random_io)\n                    new_thread.daemon = True\n                    new_thread.start()\n                    count += 1\n                while len(thread_has_run) < count:\n                    time.sleep(0.001)\n                # Trigger process shutdown\n                sys.exit(0)\n\n            main()\n            "]

stdout:
---

---

stderr:
---
Windows fatal exception: access violation

terryjreedy · 2023-09-29T00:58:24Z

Rerun showed this warning first.

0:00:29 load avg: 84.65 [ 75] test_threading passed
f:\dev\3x\Lib\test\support\os_helper.py:531: RuntimeWarning: tests may fail, unable to create temporary directory 'F:\\dev\\3x\\build\\test_python_3812æ': [WinError 183] Cannot create a file when that file already exists: 'F:\\dev\\3x\\build\\test_python_3812æ'
  with temp_dir(path=name, quiet=quiet) as temp_path:
test_4_daemon_threads (test.test_threading.ThreadJoinOnShutdown.test_4_daemon_threads) ... ok

Then same failure at 319.

vstinner · 2023-09-29T01:17:12Z

f:\dev\3x\Lib\test\support\os_helper.py:531: RuntimeWarning: tests may fail, unable to create temporary directory 'F:\dev\3x\build\test_python_3812æ': [WinError 183] Cannot create a file when that file already exists: 'F:\dev\3x\build\test_python_3812æ'

Time to time, you may want to run python -m test --cleanup, to remove test directories which tests left when they crashed. regrtest is supposed to clean them. But well, regrtest itself has also bugs :-)

faulthandler now detected freed interp and freed tstate, and no longer dereference them.

faulthandler now detected freed interp and freed tstate, and no longer dereference them. (cherry picked from commit 2e37a38)

gh-110052: Fix faulthandler for freed tstate (#110069) faulthandler now detected freed interp and freed tstate, and no longer dereference them. Backport to 3.11: add pycore_pymem.h include to traceback.c. (cherry picked from commit 2e37a38)

ericsnowcurrently · 2023-09-29T14:43:05Z

f:\dev\3x\Lib\test\support\os_helper.py:531: RuntimeWarning: tests may fail, unable to create temporary directory 'F:\\dev\\3x\\build\\test_python_3812æ': [WinError 183] Cannot create a file when that file already exists: 'F:\\dev\\3x\\build\\test_python_3812æ'

FWIW, I also saw this warning periodically with ./python -m test test_threading -m test_4_daemon_threads -j50 -F --fail-env-changed (when trying to reproduce the crash), though I did not get any crashes.

ericsnowcurrently · 2023-09-29T14:55:39Z

AssertionError: Process return code is 3221225477

FTR, 3221225477 == 0xC0000005 (STATUS_ACCESS_VIOLATION), AKA segfault.

vstinner · 2023-09-29T15:42:09Z

Ok, let me try to reproduce the issue. I'm testing commit 20bc5f7c28a6f8a2e156c4a748ffabb5efc7c761 (at Fri Sep 29 16:24:38 2023 +0100).

Machine

My laptop: 6 CPU cores / 12 CPU threads
Fedora 38: 12 logicial CPUs
FreeBSD 13 running in a VM on the laptop: 5 vCPUs
Windows 11 running in a VM on the laptop: 6 vCPUs

Prepare Linux

Build a fresh Python:

git checkout main
git pull
git clean -fdx
./configure --cache-file=../python-config.cache --with-pydebug CFLAGS=-O0 --with-system-expat
make -j14

Configure Linux to write coredumps in /tmp:

sudo bash -c 'echo "/tmp/%e-%p.core" > /proc/sys/kernel/core_pattern'
ulimit -c unlimited

Test coredump:

vstinner@mona$ ./python -c 'import ctypes; ctypes.string_at(0)'
Erreur de segmentation (core dumped)
vstinner@mona$ ls /tmp/python*
/tmp/python-1623389.core

Cleanup:

./python -m test --cleanup
rm -rf /tmp/test_python_* /tmp/python*.core

Prepare FreeBSD

Build a fresh Python:

git checkout main
git pull
git clean -fdx
./configure --cache-file=../python-config.cache --with-pydebug CFLAGS=-O0 --with-system-expat
make -j7

Configure FreeBSD to write coredumps in /tmp:

sudo sysctl -w 'kern.corefile=/tmp/%N-%P.core'
ulimit -c unlimited

Test coredump:

vstinner@freebsd$ ./python -c 'import ctypes; ctypes.string_at(0)'
Segmentation fault (core dumped)
vstinner@freebsd$ ls /tmp/*.core
/tmp/python-3505.core

Cleanup:

./python -m test --cleanup
rm -rf /tmp/test_python_* /tmp/python*.core

Prepare Windows

Build a fresh Python:

git checkout main
git pull
git clean -fdx
PCbuild\build.bat -e -d -p x64

Stress-test

Let's start easy, I run -j5 on the 3 operating systems:

python -m test test_threading -m test_4_daemon_threads -j5 -F --fail-env-changed

vstinner · 2023-09-29T15:42:40Z

I reproduce the crash on Windows in 1 min 31 sec:

0:01:27 load avg: 3.15 [ 64] test_threading passed
0:01:28 load avg: 3.15 [ 65] test_threading passed
0:01:31 load avg: 3.09 [ 66/1] test_threading failed (1 failure)
test test_threading failed -- Traceback (most recent call last):
  File "C:\victor\python\main\Lib\test\test_threading.py", line 1182, in test_4_daemon_threads
    rc, out, err = assert_python_ok('-c', script)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\victor\python\main\Lib\test\support\script_helper.py", line 166, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\victor\python\main\Lib\test\support\script_helper.py", line 151, in _assert_python
    res.fail(cmd_line)
  File "C:\victor\python\main\Lib\test\support\script_helper.py", line 76, in fail
    raise AssertionError("Process return code is %d\n"
AssertionError: Process return code is 3221225477

vstinner · 2023-09-29T15:55:28Z

Patch to make the bug more likely:

diff --git a/Python/ceval_gil.c b/Python/ceval_gil.c
index ba16f5eb9b..48e5412281 100644
--- a/Python/ceval_gil.c
+++ b/Python/ceval_gil.c
@@ -329,6 +329,14 @@ drop_gil(struct _ceval_state *ceval, PyThreadState *tstate)
 }
 
 
+static void
+_Py_random_sleep(void)
+{
+    unsigned long us = (unsigned long)(drand48() * 1000);
+    usleep(us);
+}
+
+
 /* Take the GIL.
 
    The function saves errno at entry and restores its value at exit.
@@ -355,6 +363,9 @@ take_gil(PyThreadState *tstate)
     }
 
     assert(_PyThreadState_CheckConsistency(tstate));
+
+_Py_random_sleep();
+
     PyInterpreterState *interp = tstate->interp;
     struct _ceval_state *ceval = &interp->ceval;
     struct _gil_runtime_state *gil = ceval->gil;

With this patch, I reproduce the crash in a few seconds on Linux with:

$ ./python -m test test_threading -m ThreadJoinOnShutdown -v --forever -j4
(...)
test_4_daemon_threads (test.test_threading.ThreadJoinOnShutdown.test_4_daemon_threads) ... FAIL

Now enjoy gdb:

vstinner@mona$ ls /tmp/python*.core
/tmp/python-1685172.core  /tmp/python-1685173.core  /tmp/python-1685174.core  /tmp/python-1685175.core

vstinner@mona$ gdb ./python -c /tmp/python-1685172.core

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000069f884 in take_gil (tstate=0x24f9d50) at Python/ceval_gil.c:371
371	    struct _gil_runtime_state *gil = ceval->gil;
[Current thread is 1 (Thread 0x7f04baffd6c0 (LWP 1685204))]
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-13.fc38.x86_64 glibc-2.37-5.fc38.x86_64 libffi-3.4.4-2.fc38.x86_64 openssl-libs-3.0.9-2.fc38.x86_64 xz-libs-5.4.1-1.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64

(gdb) where
#0  0x000000000069f884 in take_gil (tstate=0x24f9d50) at Python/ceval_gil.c:371
#1  0x00000000006a0427 in PyEval_RestoreThread (tstate=0x24f9d50) at Python/ceval_gil.c:725
#2  0x000000000074caf6 in internal_close (self=0x7f04f840ca10) at ./Modules/_io/fileio.c:120
#3  0x000000000074cbe8 in _io_FileIO_close_impl (self=0x7f04f840ca10, cls=0x241dc20) at ./Modules/_io/fileio.c:169
#4  0x000000000074e400 in _io_FileIO_close (self=0x7f04f840ca10, cls=0x241dc20, args=0x7f04baffa930, nargs=0, kwnames=0x0) at ./Modules/_io/clinic/fileio.c.h:33
#5  0x00000000004fb7d8 in method_vectorcall_FASTCALL_KEYWORDS_METHOD (func=<method_descriptor at remote 0x7f050140cdd0>, args=0x7f04baffa928, nargsf=1, kwnames=0x0)
...

(gdb) frame 0
#0  0x000000000069f884 in take_gil (tstate=0x24f9d50) at Python/ceval_gil.c:371
371	    struct _gil_runtime_state *gil = ceval->gil;

(gdb) p /x ceval
$3 = 0xdddddddddddddddd

ericsnowcurrently · 2023-09-29T15:57:40Z

Thanks. I'll give it a try.

vstinner · 2023-09-29T16:18:23Z

Sadly, I can also reproduce the crash in the 3.9, 3.10, 3.11 and 3.12 branches.

Apparently, the bug was introduced when PyInterpreterState.ceval was added by the change:

commit 6a150bcaeb190d1731b38ab9c7a5d1a352847ddc
Author: Eric Snow <[email protected]>
Date:   Sat Jun 1 15:39:46 2019 -0600

    bpo-33608: Factor out a private, per-interpreter _Py_AddPendingCall(). (gh-13714)

This change was in the 3.8 branch but then I reverted it (commit e225beb). It's part of Python 3.9.0.

This change was the starting point for a long way towards PEP 684 – A Per-Interpreter GIL which was accepted in Python 3.12. I don't think that it can be easily reverted. So instead, we should design a fix.

I failed to reproduce the issue in Python 3.8.

ericsnowcurrently · 2023-09-29T16:19:38Z

That's good to know. Thanks for the great detective work!

vstinner · 2023-09-29T16:58:35Z

The danger zone is between _PyThreadState_DeleteExcept(tstate) called by Py_FinalizeEx() in the main thread, and the process exit. During this time, PyThreadState memory is freed, tstate pointers became dangling pointers, and shortly, PyInterpeterState will also be made freed memory in PyInterpreterState_Delete().

On a debug build (gcc -O0, pydebug), there are around 10 ms between _PyThreadState_DeleteExcept() call and Python exit.
On a release build (gcc -O3), there are around 1.7 ms between _PyThreadState_DeleteExcept() call and Python exit.

Here the "Python exit" is the exit of the main() function.

vstinner · 2023-09-29T17:18:26Z

Core of the race condition:

[thread] take_gil() calls _PyThreadState_MustExit(tstate) returns false
[main] Py_Finalize() calls _PyInterpreterState_SetFinalizing()
[main] Py_Finalize() calls _PyThreadState_DeleteExcept(tstate): all other tstate become dangling pointers to freed memory
[thread] take_gil() deferences tstate: ceval = &interp->ceval (compute a pointer: ok) and then gil = ceval->gil: CRASH!

vstinner · 2023-09-29T17:23:59Z

A workaround would be to exit the thread if PyMem_IsPtrFreed(tstate->interp) is true. But this workaround is fragile, since it relies on reading freed memory, and only in debug mode, Python fills freed memory with a byte pattern (0xDD bytes).

Some kind of synchronization is needed here to get tstate->interp in a safe way. Maybe something with atomic variable,s I'm not sure.

cc @colesbury who likes atomic variables :-)

colesbury · 2023-09-29T18:03:39Z

I dealt with these problems in nogil-3.12 by:

Refcount PyThreadState. The interpreter holds one refcount and the native thread holds another, so most of the time a PyThreadState has refcount of two.
When the main thread exits and "deletes" all other threads, decrement the refcount as they're removed from the interpreter list. Only call PyMem_RawFree if the refcount is zero!
When a thread exits in _PyThreadState_MustExit, decrement the refcount. If it's zero, call PyMem_RawFree on the thread state.

The other bit that was helpful was the thread states (implemented in #109915), which can help deal with the inherent race between _PyThreadState_MustExit() and the main thread exiting. It needs a bit more work from #109915: in --disable-gil builds you need a "stop-the-world" event to make sure all threads are in sane state. You can do a similar thing in GIL builds, where you put all other threads in say a DEAD state so that they exit instead of trying to acquire the GIL (which can be a problem if we're deleting the GIL at exit!)

vstinner · 2023-09-29T18:39:57Z

Not freeing the memory sounds an appealing solution, since trying to workaround dangling pointers is... tricky, as shown in this issue.

vstinner · 2023-09-29T18:41:20Z

Since this bug exists since Python 3.9 and nobody (before me) complained, I don't think that there is an emergency to fix it in Python 3.12. We can take time in Python 3.13 to come up with a clean design, as the one described by @colesbury: refcount and states.

ericsnowcurrently · 2023-09-29T20:05:29Z

I have been thinking about making use of thread-local storage for the problems caused by daemon threads. Something like this:

#Include/internal/pycore_pystate.h

typedef struct {
    PyThreadState *tstate;
    int finalizing;
} _Py_tss_state_t;

#Include/cpython/pystate.h

struct _ts {
    ...
    _Py_tss_state_t *tss;
    ...
} ;

# Python/pystate.c

// Replaces _Py_tss_tstate.
// Guaranteed initialized to {0}?
_Py_thread_local _Py_tss_state pystate;

// Update tstate_activate() and tstate_deactivate() to set pystate.tstate
// (and update pystate.finalizing, if necessary).

void
_PyThreadState_Bind(PyThreadState *tstate)
{
    ...
    tstate->tss = &pystate;
}

// Use this with (or in place of) _PyInterpreterState_SetFinalizing()?
void
_PyInterpreterState_SetTheadsFinalizing(PyInterpreterState *interp)
{
    HEAD_LOCK(&_PyRuntime);
    assert(interp->finalizing);  // If true, new threads won't be created.
    PyThreadState *tstate = interp->threads.head;
    while (tstate != NULL) {
        if (tstate->tss->tstate == tstate) {
            tstate->tss->finalizing = 1;
        }
        tstate = tstate->next;
    }
    HEAD_UNLOCK(&_PyRuntime);
}

int
_PyThreadState_IsFinalizing(void)
{
    return pystate.finalizing;
}

# Python/ceval_gil.c

// Update _PyThreadState_MustExit() to rely on _PyThreadState_IsFinalizing().

// XXX Update take_gil() to use critical data stored in the pystate threadlocal?

ericsnowcurrently · 2023-09-29T20:07:13Z

Since we only need to worry about races in take_gil() with daemon threads, we could refcount the GIL instead of the whole thread state, if a refcount is necessary.

colesbury · 2023-09-29T20:11:34Z

@ericsnowcurrently - I think that approach opens you up to problems if the daemon thread exits before the main thread. Then the thread->tss will point to invalid memory. There's no guarantee that the daemon thread necessarily exits cleanly and cleans up it's PyThreadState.

ericsnowcurrently · 2023-09-29T20:52:31Z

Yeah, if a thread (not just daemon) exits before we call PyThreadState_Delete() then that would be a problem, at least without some thread finalizer hook (like C++).

That said, this seems like a something that would apply broadly to the whole C community, whereas our current issue of finalization races in take_gil() with daemon threads is fairly specific to us. Thus it's more likely to have been solved already, no?

ericsnowcurrently · 2023-09-29T21:54:53Z

Hmm, tss_create() can be used for key-based TSS, a la the PEP 539 API. Unlike our API though, tss_create() takes a destructor arg that gets called when the thread exits (but not with the program exit()). We could use that to clean up the thread-local variable. I'm not sure how that works for Windows. Given that C++ can do it, I'd think we could do it from C on WIndows.

colesbury · 2023-09-29T22:05:58Z

It's hard to sequence destruction of thread-local variables. In other words, the tss destructors may be called before or after thread-local storage is destroyed. So unless you create pystate via tss_create (which would make access slower), then it's really hard to know if the tss destructor will be called before or after pystate is destroyed.

For an example of this problem, see microsoft/mimalloc#164.

Is there anything wrong with the solution I've outlined above? I've tested it in nogil-3.12 to address this crash, which occurs more frequently once the GIL is disabled.

ericsnowcurrently · 2023-09-29T22:26:47Z

I didn't have a chance yet.

gh-110052: Fix faulthandler for freed tstate (#110069) faulthandler now detected freed interp and freed tstate, and no longer dereference them. (cherry picked from commit 2e37a38)

faulthandler now detected freed interp and freed tstate, and no longer dereference them.

vstinner mentioned this issue Sep 28, 2023

test_threading failed: test_reinit_tls_after_fork() failed with "env changed" on GHA Address Sanitizer: process 19857 is still running after 300.4 seconds #110031

Closed

vstinner mentioned this issue Sep 28, 2023

gh-109793: Allow Switching Interpreters During Finalization #109794

Merged

ericsnowcurrently self-assigned this Sep 28, 2023

vstinner added a commit to vstinner/cpython that referenced this issue Sep 29, 2023

pythongh-110052: Fix faulthandler for freed tstate

e4b3bc8

faulthandler now detected freed interp and freed tstate, and no longer dereference them.

vstinner added a commit to vstinner/cpython that referenced this issue Sep 29, 2023

pythongh-110052: Fix faulthandler for freed tstate

585a1a4

faulthandler now detected freed interp and freed tstate, and no longer dereference them.

bedevere-app bot mentioned this issue Sep 29, 2023

gh-110052: Fix faulthandler for freed tstate #110069

Merged

vstinner added a commit that referenced this issue Sep 29, 2023

gh-110052: Fix faulthandler for freed tstate (#110069)

2e37a38

faulthandler now detected freed interp and freed tstate, and no longer dereference them.

vstinner added a commit to vstinner/cpython that referenced this issue Sep 29, 2023

pythongh-110052: Fix faulthandler for freed tstate (python#110069)

faf7d4a

faulthandler now detected freed interp and freed tstate, and no longer dereference them. (cherry picked from commit 2e37a38)

This was referenced Sep 29, 2023

[3.12] gh-110052: Fix faulthandler for freed tstate (#110069) #110071

Merged

[3.11] gh-110052: Fix faulthandler for freed tstate (#110069) #110072

Merged

This was referenced Oct 5, 2023

gh-108987: Fix _thread.start_new_thread() race condition #109135

Merged

The Word "Finalizing" in C-API Function Names is Sometimes Misleading #110490

Open

iritkatriel added the tests Tests in the Lib/test dir label Nov 27, 2023

serhiy-storchaka added this to Threading issues 🧵 Mar 5, 2024

jonny5532 mentioned this issue Mar 6, 2024

Segfault in faulthandler signal handler #116008

Open

colesbury mentioned this issue Mar 18, 2024

TSan: data race with _PyThreadState_Attach and _PyThreadState_Detach #116910

Open

colesbury mentioned this issue Jul 14, 2024

PyThreadState_Clear, tstate_activate/_PyThreadState_Attach and drop_gil race race on TSAN #121751

Open

Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024

pythongh-110052: Fix faulthandler for freed tstate (python#110069)

6be8ef2

faulthandler now detected freed interp and freed tstate, and no longer dereference them.

test_threading.test_4_daemon_threads() crash randomly #110052

test_threading.test_4_daemon_threads() crash randomly #110052

Comments

vstinner commented Sep 28, 2023 • edited by bedevere-app bot Loading

Linked PRs

vstinner commented Sep 28, 2023

vstinner commented Sep 28, 2023

ericsnowcurrently commented Sep 28, 2023

vstinner commented Sep 28, 2023

ericsnowcurrently commented Sep 28, 2023

ericsnowcurrently commented Sep 28, 2023

ericsnowcurrently commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

vstinner commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

terryjreedy commented Sep 29, 2023

terryjreedy commented Sep 29, 2023

vstinner commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

vstinner commented Sep 29, 2023 • edited Loading

Machine

Prepare Linux

Prepare FreeBSD

Prepare Windows

Stress-test

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

vstinner commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

colesbury commented Sep 29, 2023

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023 • edited Loading

colesbury commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

colesbury commented Sep 29, 2023

ericsnowcurrently commented Sep 29, 2023

vstinner commented Sep 28, 2023 •

edited by bedevere-app bot

Loading

vstinner commented Sep 29, 2023 •

edited

Loading

ericsnowcurrently commented Sep 29, 2023 •

edited

Loading