Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_threading.test_4_daemon_threads() crash randomly #110052

Open
vstinner opened this issue Sep 28, 2023 · 35 comments
Open

test_threading.test_4_daemon_threads() crash randomly #110052

vstinner opened this issue Sep 28, 2023 · 35 comments
Assignees
Labels
tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

vstinner commented Sep 28, 2023

On Linux, when I stress test test_threading.test_4_daemon_threads(), it does crash randomly:

./python -m test test_threading -m test_4_daemon_threads -j50 -F --fail-env-changed  

gdb traceback:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006e02f4 in PyInterpreterState_ThreadHead (interp=0xdddddddddddddddd) at Python/pystate.c:1961

warning: Source file is more recent than executable.
1961	    return interp->threads.head;
[Current thread is 1 (Thread 0x7fcbb8ff96c0 (LWP 510515))]
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-13.fc38.x86_64 glibc-2.37-5.fc38.x86_64 libffi-3.4.4-2.fc38.x86_64 libgcc-13.2.1-1.fc38.x86_64 openssl-libs-3.0.9-2.fc38.x86_64 xz-libs-5.4.1-1.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64
(gdb) where
#0  0x00000000006e02f4 in PyInterpreterState_ThreadHead (interp=0xdddddddddddddddd) at Python/pystate.c:1961
#1  0x0000000000703090 in _Py_DumpTracebackThreads (fd=2, interp=0xdddddddddddddddd, current_tstate=0x24a34e0) at Python/traceback.c:1331
#2  0x000000000071bcef in faulthandler_dump_traceback (fd=2, all_threads=1, interp=0xa53688 <_PyRuntime+92712>) at ./Modules/faulthandler.c:195
#3  0x000000000071bfed in faulthandler_fatal_error (signum=11) at ./Modules/faulthandler.c:313
#4  <signal handler called>
#5  0x000000000069f808 in take_gil (tstate=0x24a34e0) at Python/ceval_gil.c:360
#6  0x00000000006a03ab in PyEval_RestoreThread (tstate=0x24a34e0) at Python/ceval_gil.c:714
#7  0x000000000074dd93 in portable_lseek (self=0x7fcc186c3590, posobj=0x0, whence=1, suppress_pipe_error=false) at ./Modules/_io/fileio.c:934
#8  0x000000000074de88 in _io_FileIO_tell_impl (self=0x7fcc186c3590) at ./Modules/_io/fileio.c:997
#9  0x000000000074ea4c in _io_FileIO_tell (self=0x7fcc186c3590, _unused_ignored=0x0) at ./Modules/_io/clinic/fileio.c.h:460
(...)

gdb debug:


(gdb) frame 5
#5  0x000000000069f808 in take_gil (tstate=0x24a34e0) at Python/ceval_gil.c:360
360	    struct _gil_runtime_state *gil = ceval->gil;
(gdb) l
355	    }
356	
357	    assert(_PyThreadState_CheckConsistency(tstate));
358	    PyInterpreterState *interp = tstate->interp;
359	    struct _ceval_state *ceval = &interp->ceval;
360	    struct _gil_runtime_state *gil = ceval->gil;
361	
362	    /* Check that _PyEval_InitThreads() was called to create the lock */
363	    assert(gil_created(gil));
364	

(gdb) p /x ceval
$1 = 0xdddddddddddddddd

(gdb) p /x tstate->interp
$2 = 0xdddddddddddddddd


(gdb) p /x tstate
$5 = 0x24a34e0

(gdb) p /x _PyRuntime._finalizing._value 
$3 = 0xab8fa8

(gdb) p /x _PyRuntime._finalizing_id 
$4 = 0x7fcc49fce740

I don't understand why the test didn't exit: _PyThreadState_MustExit() should return, no?

I don't understand why assert(_PyThreadState_CheckConsistency(tstate)); didn't fail.

Maybe Py_Finalize() was called between the pre-check:

    if (_PyThreadState_MustExit(tstate)) {
        /* bpo-39877: If Py_Finalize() has been called and tstate is not the
           thread which called Py_Finalize(), exit immediately the thread.

           This code path can be reached by a daemon thread after Py_Finalize()
           completes. In this case, tstate is a dangling pointer: points to
           PyThreadState freed memory. */
        PyThread_exit_thread();
    }

    assert(_PyThreadState_CheckConsistency(tstate));

and the code:

    PyInterpreterState *interp = tstate->interp;
    struct _ceval_state *ceval = &interp->ceval;
    struct _gil_runtime_state *gil = ceval->gil;

Linked PRs

@vstinner
Copy link
Member Author

See also issue gh-110031 which looks similar.

@vstinner
Copy link
Member Author

The regression was introduced by PR gh-109794: #109794 (comment)

@ericsnowcurrently ericsnowcurrently self-assigned this Sep 28, 2023
@ericsnowcurrently
Copy link
Member

I'm looking into this.

./python -m test test_threading -m test_4_daemon_threads -j50 -F --fail-env-changed

I haven't been able to get a crash in over 16 minutes (but I certainly believe you).

@vstinner
Copy link
Member Author

I haven't been able to get a crash in over 16 minutes (but I certainly believe you).

What is your operating system? I reproduced the crash on Linux. I didn't try other operating systems.

If you want to make daemon threads crashes more likely, add sleep(1) at the end of PyInterpreterState_Delete(), after free_interpreter().

@ericsnowcurrently
Copy link
Member

What is your operating system? I reproduced the crash on Linux. I didn't try other operating systems.

It's an 8-core, 24GB RAM Hyper-V Ubuntu VM running on a Windows laptop.

If you want to make daemon threads crashes more likely, add sleep(1) at the end of PyInterpreterState_Delete(), after free_interpreter().

I'll try it.

@ericsnowcurrently
Copy link
Member

gdb traceback:
,,,
gdb debug:
...
I don't understand why the test didn't exit: _PyThreadState_MustExit() should return, no?

Yeah, that's the critical question.

@ericsnowcurrently
Copy link
Member

Here's the critical part from gh-109794:

diff --git a/Python/pystate.c b/Python/pystate.c
index dcc6c112215b3..570b5242600c0 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -2964,11 +2964,26 @@ _PyThreadState_MustExit(PyThreadState *tstate)
        tstate->interp->runtime to support calls from Python daemon threads.
        After Py_Finalize() has been called, tstate can be a dangling pointer:
        point to PyThreadState freed memory. */
+    unsigned long finalizing_id = _PyRuntimeState_GetFinalizingID(&_PyRuntime);
     PyThreadState *finalizing = _PyRuntimeState_GetFinalizing(&_PyRuntime);
     if (finalizing == NULL) {
+        // XXX This isn't completely safe from daemon thraeds,
+        // since tstate might be a dangling pointer.
         finalizing = _PyInterpreterState_GetFinalizing(tstate->interp);
+        finalizing_id = _PyInterpreterState_GetFinalizingID(tstate->interp);
     }
-    return (finalizing != NULL && finalizing != tstate);
+    // XXX else check &_PyRuntime._main_interpreter._initial_thread
+    if (finalizing == NULL) {
+        return 0;
+    }
+    else if (finalizing == tstate) {
+        return 0;
+    }
+    else if (finalizing_id == PyThread_get_thread_ident()) {
+        /* gh-109793: we must have switched interpreters. */
+        return 0;
+    }
+    return 1;
 }

The most obvious possibility is that we switched to a different thread state during finalization in the thread where finalization is happening and the thread state belongs to the same interpreter. However, that isn't an expected use case for PyThread_Swap() and I'm pretty sure we don't have any code that does that.

@ericsnowcurrently
Copy link
Member

If you want to make daemon threads crashes more likely, add sleep(1) at the end of PyInterpreterState_Delete(), after free_interpreter().

I'll try it.

No failures in 13 minutes.

@vstinner
Copy link
Member Author

No failures in 13 minutes.

How do you build Python? Which configure options?

I'm usually using ./configure --with-pydebug to get assertions.

@ericsnowcurrently
Copy link
Member

Basically: ./configure --with-pydebug CFLAGS=-O0

@ericsnowcurrently
Copy link
Member

Is this issue tied to any buildbot/CI failures?

@terryjreedy
Copy link
Member

6/12 core Win 10, standard PCBuild/build.bat -d

...
0:00:31 load avg: 76.65 [ 87/1] test_threading failed (1 failure)
test test_threading failed -- Traceback (most recent call last):
  File "f:\dev\3x\Lib\test\test_threading.py", line 1172, in test_4_daemon_threads
    rc, out, err = assert_python_ok('-c', script)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\dev\3x\Lib\test\support\script_helper.py", line 166, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\dev\3x\Lib\test\support\script_helper.py", line 151, in _assert_python
    res.fail(cmd_line)
  File "f:\dev\3x\Lib\test\support\script_helper.py", line 76, in fail
    raise AssertionError("Process return code is %d\n"
AssertionError: Process return code is 3221225477
command line: ['f:\\dev\\3x\\PCbuild\\amd64\\python_d.exe', '-X', 'faulthandler', '-c', "if True:\n            import os\n            import random\n            import sys\n            import time\n            import threading\n\n            thread_has_run = set()\n\n            def random_io():\n                '''Loop for a while sleeping random tiny amounts and doing some I/O.'''\n                import test.test_threading as mod\n                while True:\n                    with open(mod.__file__, 'rb') as in_f:\n                        stuff = in_f.read(200)\n                        with open(os.devnull, 'wb') as null_f:\n                            null_f.write(stuff)\n                            time.sleep(random.random() / 1995)\n                    thread_has_run.add(threading.current_thread())\n\n            def main():\n                count = 0\n                for _ in range(40):\n                    new_thread = threading.Thread(target=random_io)\n                    new_thread.daemon = True\n                    new_thread.start()\n                    count += 1\n                while len(thread_has_run) < count:\n                    time.sleep(0.001)\n                # Trigger process shutdown\n                sys.exit(0)\n\n            main()\n            "]

stdout:
---

---

stderr:
---
Windows fatal exception: access violation

@terryjreedy
Copy link
Member

Rerun showed this warning first.

0:00:29 load avg: 84.65 [ 75] test_threading passed
f:\dev\3x\Lib\test\support\os_helper.py:531: RuntimeWarning: tests may fail, unable to create temporary directory 'F:\\dev\\3x\\build\\test_python_3812æ': [WinError 183] Cannot create a file when that file already exists: 'F:\\dev\\3x\\build\\test_python_3812æ'
  with temp_dir(path=name, quiet=quiet) as temp_path:
test_4_daemon_threads (test.test_threading.ThreadJoinOnShutdown.test_4_daemon_threads) ... ok

Then same failure at 319.

@vstinner
Copy link
Member Author

f:\dev\3x\Lib\test\support\os_helper.py:531: RuntimeWarning: tests may fail, unable to create temporary directory 'F:\dev\3x\build\test_python_3812æ': [WinError 183] Cannot create a file when that file already exists: 'F:\dev\3x\build\test_python_3812æ'

Time to time, you may want to run python -m test --cleanup, to remove test directories which tests left when they crashed. regrtest is supposed to clean them. But well, regrtest itself has also bugs :-)

vstinner added a commit to vstinner/cpython that referenced this issue Sep 29, 2023
faulthandler now detected freed interp and freed tstate, and no
longer dereference them.
vstinner added a commit to vstinner/cpython that referenced this issue Sep 29, 2023
faulthandler now detected freed interp and freed tstate, and no
longer dereference them.
vstinner added a commit that referenced this issue Sep 29, 2023
faulthandler now detected freed interp and freed tstate, and no
longer dereference them.
vstinner added a commit to vstinner/cpython that referenced this issue Sep 29, 2023
faulthandler now detected freed interp and freed tstate, and no
longer dereference them.

(cherry picked from commit 2e37a38)
vstinner added a commit that referenced this issue Sep 29, 2023
gh-110052: Fix faulthandler for freed tstate (#110069)

faulthandler now detected freed interp and freed tstate, and no
longer dereference them.

Backport to 3.11: add pycore_pymem.h include to traceback.c.

(cherry picked from commit 2e37a38)
@ericsnowcurrently
Copy link
Member

f:\dev\3x\Lib\test\support\os_helper.py:531: RuntimeWarning: tests may fail, unable to create temporary directory 'F:\\dev\\3x\\build\\test_python_3812æ': [WinError 183] Cannot create a file when that file already exists: 'F:\\dev\\3x\\build\\test_python_3812æ'

FWIW, I also saw this warning periodically with ./python -m test test_threading -m test_4_daemon_threads -j50 -F --fail-env-changed (when trying to reproduce the crash), though I did not get any crashes.

@ericsnowcurrently
Copy link
Member

AssertionError: Process return code is 3221225477

FTR, 3221225477 == 0xC0000005 (STATUS_ACCESS_VIOLATION), AKA segfault.

@vstinner
Copy link
Member Author

vstinner commented Sep 29, 2023

Ok, let me try to reproduce the issue. I'm testing commit 20bc5f7c28a6f8a2e156c4a748ffabb5efc7c761 (at Fri Sep 29 16:24:38 2023 +0100).

Machine

  • My laptop: 6 CPU cores / 12 CPU threads
  • Fedora 38: 12 logicial CPUs
  • FreeBSD 13 running in a VM on the laptop: 5 vCPUs
  • Windows 11 running in a VM on the laptop: 6 vCPUs

Prepare Linux

Build a fresh Python:

git checkout main
git pull
git clean -fdx
./configure --cache-file=../python-config.cache --with-pydebug CFLAGS=-O0 --with-system-expat
make -j14

Configure Linux to write coredumps in /tmp:

sudo bash -c 'echo "/tmp/%e-%p.core" > /proc/sys/kernel/core_pattern'
ulimit -c unlimited

Test coredump:

vstinner@mona$ ./python -c 'import ctypes; ctypes.string_at(0)'
Erreur de segmentation (core dumped)
vstinner@mona$ ls /tmp/python*
/tmp/python-1623389.core

Cleanup:

./python -m test --cleanup
rm -rf /tmp/test_python_* /tmp/python*.core

Prepare FreeBSD

Build a fresh Python:

git checkout main
git pull
git clean -fdx
./configure --cache-file=../python-config.cache --with-pydebug CFLAGS=-O0 --with-system-expat
make -j7

Configure FreeBSD to write coredumps in /tmp:

sudo sysctl -w 'kern.corefile=/tmp/%N-%P.core'
ulimit -c unlimited

Test coredump:

vstinner@freebsd$ ./python -c 'import ctypes; ctypes.string_at(0)'
Segmentation fault (core dumped)
vstinner@freebsd$ ls /tmp/*.core
/tmp/python-3505.core

Cleanup:

./python -m test --cleanup
rm -rf /tmp/test_python_* /tmp/python*.core

Prepare Windows

Build a fresh Python:

git checkout main
git pull
git clean -fdx
PCbuild\build.bat -e -d -p x64

Stress-test

Let's start easy, I run -j5 on the 3 operating systems:

python -m test test_threading -m test_4_daemon_threads -j5 -F --fail-env-changed

@vstinner
Copy link
Member Author

I reproduce the crash on Windows in 1 min 31 sec:

0:01:27 load avg: 3.15 [ 64] test_threading passed
0:01:28 load avg: 3.15 [ 65] test_threading passed
0:01:31 load avg: 3.09 [ 66/1] test_threading failed (1 failure)
test test_threading failed -- Traceback (most recent call last):
  File "C:\victor\python\main\Lib\test\test_threading.py", line 1182, in test_4_daemon_threads
    rc, out, err = assert_python_ok('-c', script)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\victor\python\main\Lib\test\support\script_helper.py", line 166, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\victor\python\main\Lib\test\support\script_helper.py", line 151, in _assert_python
    res.fail(cmd_line)
  File "C:\victor\python\main\Lib\test\support\script_helper.py", line 76, in fail
    raise AssertionError("Process return code is %d\n"
AssertionError: Process return code is 3221225477

@vstinner
Copy link
Member Author

Patch to make the bug more likely:

diff --git a/Python/ceval_gil.c b/Python/ceval_gil.c
index ba16f5eb9b..48e5412281 100644
--- a/Python/ceval_gil.c
+++ b/Python/ceval_gil.c
@@ -329,6 +329,14 @@ drop_gil(struct _ceval_state *ceval, PyThreadState *tstate)
 }
 
 
+static void
+_Py_random_sleep(void)
+{
+    unsigned long us = (unsigned long)(drand48() * 1000);
+    usleep(us);
+}
+
+
 /* Take the GIL.
 
    The function saves errno at entry and restores its value at exit.
@@ -355,6 +363,9 @@ take_gil(PyThreadState *tstate)
     }
 
     assert(_PyThreadState_CheckConsistency(tstate));
+
+_Py_random_sleep();
+
     PyInterpreterState *interp = tstate->interp;
     struct _ceval_state *ceval = &interp->ceval;
     struct _gil_runtime_state *gil = ceval->gil;

With this patch, I reproduce the crash in a few seconds on Linux with:

$ ./python -m test test_threading -m ThreadJoinOnShutdown -v --forever -j4
(...)
test_4_daemon_threads (test.test_threading.ThreadJoinOnShutdown.test_4_daemon_threads) ... FAIL

Now enjoy gdb:

vstinner@mona$ ls /tmp/python*.core
/tmp/python-1685172.core  /tmp/python-1685173.core  /tmp/python-1685174.core  /tmp/python-1685175.core

vstinner@mona$ gdb ./python -c /tmp/python-1685172.core

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000069f884 in take_gil (tstate=0x24f9d50) at Python/ceval_gil.c:371
371	    struct _gil_runtime_state *gil = ceval->gil;
[Current thread is 1 (Thread 0x7f04baffd6c0 (LWP 1685204))]
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-13.fc38.x86_64 glibc-2.37-5.fc38.x86_64 libffi-3.4.4-2.fc38.x86_64 openssl-libs-3.0.9-2.fc38.x86_64 xz-libs-5.4.1-1.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64

(gdb) where
#0  0x000000000069f884 in take_gil (tstate=0x24f9d50) at Python/ceval_gil.c:371
#1  0x00000000006a0427 in PyEval_RestoreThread (tstate=0x24f9d50) at Python/ceval_gil.c:725
#2  0x000000000074caf6 in internal_close (self=0x7f04f840ca10) at ./Modules/_io/fileio.c:120
#3  0x000000000074cbe8 in _io_FileIO_close_impl (self=0x7f04f840ca10, cls=0x241dc20) at ./Modules/_io/fileio.c:169
#4  0x000000000074e400 in _io_FileIO_close (self=0x7f04f840ca10, cls=0x241dc20, args=0x7f04baffa930, nargs=0, kwnames=0x0) at ./Modules/_io/clinic/fileio.c.h:33
#5  0x00000000004fb7d8 in method_vectorcall_FASTCALL_KEYWORDS_METHOD (func=<method_descriptor at remote 0x7f050140cdd0>, args=0x7f04baffa928, nargsf=1, kwnames=0x0)
...

(gdb) frame 0
#0  0x000000000069f884 in take_gil (tstate=0x24f9d50) at Python/ceval_gil.c:371
371	    struct _gil_runtime_state *gil = ceval->gil;

(gdb) p /x ceval
$3 = 0xdddddddddddddddd

@ericsnowcurrently
Copy link
Member

Thanks. I'll give it a try.

@vstinner
Copy link
Member Author

Sadly, I can also reproduce the crash in the 3.9, 3.10, 3.11 and 3.12 branches.

Apparently, the bug was introduced when PyInterpreterState.ceval was added by the change:

commit 6a150bcaeb190d1731b38ab9c7a5d1a352847ddc
Author: Eric Snow <[email protected]>
Date:   Sat Jun 1 15:39:46 2019 -0600

    bpo-33608: Factor out a private, per-interpreter _Py_AddPendingCall(). (gh-13714)

This change was in the 3.8 branch but then I reverted it (commit e225beb). It's part of Python 3.9.0.

This change was the starting point for a long way towards PEP 684 – A Per-Interpreter GIL which was accepted in Python 3.12. I don't think that it can be easily reverted. So instead, we should design a fix.


I failed to reproduce the issue in Python 3.8.

@ericsnowcurrently
Copy link
Member

That's good to know. Thanks for the great detective work!

@vstinner
Copy link
Member Author

The danger zone is between _PyThreadState_DeleteExcept(tstate) called by Py_FinalizeEx() in the main thread, and the process exit. During this time, PyThreadState memory is freed, tstate pointers became dangling pointers, and shortly, PyInterpeterState will also be made freed memory in PyInterpreterState_Delete().

  • On a debug build (gcc -O0, pydebug), there are around 10 ms between _PyThreadState_DeleteExcept() call and Python exit.
  • On a release build (gcc -O3), there are around 1.7 ms between _PyThreadState_DeleteExcept() call and Python exit.

Here the "Python exit" is the exit of the main() function.

@vstinner
Copy link
Member Author

Core of the race condition:

schema

  1. [thread] take_gil() calls _PyThreadState_MustExit(tstate) returns false
  2. [main] Py_Finalize() calls _PyInterpreterState_SetFinalizing()
  3. [main] Py_Finalize() calls _PyThreadState_DeleteExcept(tstate): all other tstate become dangling pointers to freed memory
  4. [thread] take_gil() deferences tstate: ceval = &interp->ceval (compute a pointer: ok) and then gil = ceval->gil: CRASH!

@vstinner
Copy link
Member Author

A workaround would be to exit the thread if PyMem_IsPtrFreed(tstate->interp) is true. But this workaround is fragile, since it relies on reading freed memory, and only in debug mode, Python fills freed memory with a byte pattern (0xDD bytes).

Some kind of synchronization is needed here to get tstate->interp in a safe way. Maybe something with atomic variable,s I'm not sure.

cc @colesbury who likes atomic variables :-)

@colesbury
Copy link
Contributor

I dealt with these problems in nogil-3.12 by:

  1. Refcount PyThreadState. The interpreter holds one refcount and the native thread holds another, so most of the time a PyThreadState has refcount of two.
  2. When the main thread exits and "deletes" all other threads, decrement the refcount as they're removed from the interpreter list. Only call PyMem_RawFree if the refcount is zero!
  3. When a thread exits in _PyThreadState_MustExit, decrement the refcount. If it's zero, call PyMem_RawFree on the thread state.

The other bit that was helpful was the thread states (implemented in #109915), which can help deal with the inherent race between _PyThreadState_MustExit() and the main thread exiting. It needs a bit more work from #109915: in --disable-gil builds you need a "stop-the-world" event to make sure all threads are in sane state. You can do a similar thing in GIL builds, where you put all other threads in say a DEAD state so that they exit instead of trying to acquire the GIL (which can be a problem if we're deleting the GIL at exit!)

@vstinner
Copy link
Member Author

Not freeing the memory sounds an appealing solution, since trying to workaround dangling pointers is... tricky, as shown in this issue.

@vstinner
Copy link
Member Author

Since this bug exists since Python 3.9 and nobody (before me) complained, I don't think that there is an emergency to fix it in Python 3.12. We can take time in Python 3.13 to come up with a clean design, as the one described by @colesbury: refcount and states.

@ericsnowcurrently
Copy link
Member

I have been thinking about making use of thread-local storage for the problems caused by daemon threads. Something like this:

#Include/internal/pycore_pystate.h

typedef struct {
    PyThreadState *tstate;
    int finalizing;
} _Py_tss_state_t;

#Include/cpython/pystate.h

struct _ts {
    ...
    _Py_tss_state_t *tss;
    ...
} ;

# Python/pystate.c

// Replaces _Py_tss_tstate.
// Guaranteed initialized to {0}?
_Py_thread_local _Py_tss_state pystate;

// Update tstate_activate() and tstate_deactivate() to set pystate.tstate
// (and update pystate.finalizing, if necessary).

void
_PyThreadState_Bind(PyThreadState *tstate)
{
    ...
    tstate->tss = &pystate;
}

// Use this with (or in place of) _PyInterpreterState_SetFinalizing()?
void
_PyInterpreterState_SetTheadsFinalizing(PyInterpreterState *interp)
{
    HEAD_LOCK(&_PyRuntime);
    assert(interp->finalizing);  // If true, new threads won't be created.
    PyThreadState *tstate = interp->threads.head;
    while (tstate != NULL) {
        if (tstate->tss->tstate == tstate) {
            tstate->tss->finalizing = 1;
        }
        tstate = tstate->next;
    }
    HEAD_UNLOCK(&_PyRuntime);
}

int
_PyThreadState_IsFinalizing(void)
{
    return pystate.finalizing;
}

# Python/ceval_gil.c

// Update _PyThreadState_MustExit() to rely on _PyThreadState_IsFinalizing().

// XXX Update take_gil() to use critical data stored in the pystate threadlocal?

@ericsnowcurrently
Copy link
Member

ericsnowcurrently commented Sep 29, 2023

Since we only need to worry about races in take_gil() with daemon threads, we could refcount the GIL instead of the whole thread state, if a refcount is necessary.

@colesbury
Copy link
Contributor

@ericsnowcurrently - I think that approach opens you up to problems if the daemon thread exits before the main thread. Then the thread->tss will point to invalid memory. There's no guarantee that the daemon thread necessarily exits cleanly and cleans up it's PyThreadState.

@ericsnowcurrently
Copy link
Member

Yeah, if a thread (not just daemon) exits before we call PyThreadState_Delete() then that would be a problem, at least without some thread finalizer hook (like C++).

That said, this seems like a something that would apply broadly to the whole C community, whereas our current issue of finalization races in take_gil() with daemon threads is fairly specific to us. Thus it's more likely to have been solved already, no?

@ericsnowcurrently
Copy link
Member

Hmm, tss_create() can be used for key-based TSS, a la the PEP 539 API. Unlike our API though, tss_create() takes a destructor arg that gets called when the thread exits (but not with the program exit()). We could use that to clean up the thread-local variable. I'm not sure how that works for Windows. Given that C++ can do it, I'd think we could do it from C on WIndows.

@colesbury
Copy link
Contributor

It's hard to sequence destruction of thread-local variables. In other words, the tss destructors may be called before or after thread-local storage is destroyed. So unless you create pystate via tss_create (which would make access slower), then it's really hard to know if the tss destructor will be called before or after pystate is destroyed.

For an example of this problem, see microsoft/mimalloc#164.

Is there anything wrong with the solution I've outlined above? I've tested it in nogil-3.12 to address this crash, which occurs more frequently once the GIL is disabled.

@ericsnowcurrently
Copy link
Member

I didn't have a chance yet.

Yhg1s pushed a commit that referenced this issue Oct 2, 2023
gh-110052: Fix faulthandler for freed tstate (#110069)

faulthandler now detected freed interp and freed tstate, and no
longer dereference them.

(cherry picked from commit 2e37a38)
@iritkatriel iritkatriel added the tests Tests in the Lib/test dir label Nov 27, 2023
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
faulthandler now detected freed interp and freed tstate, and no
longer dereference them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir
Projects
Status: No status
Development

No branches or pull requests

5 participants