-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_threading.test_4_daemon_threads() crash randomly #110052
Comments
See also issue gh-110031 which looks similar. |
The regression was introduced by PR gh-109794: #109794 (comment) |
I'm looking into this.
I haven't been able to get a crash in over 16 minutes (but I certainly believe you). |
What is your operating system? I reproduced the crash on Linux. I didn't try other operating systems. If you want to make daemon threads crashes more likely, add |
It's an 8-core, 24GB RAM Hyper-V Ubuntu VM running on a Windows laptop.
I'll try it. |
Yeah, that's the critical question. |
Here's the critical part from gh-109794:
The most obvious possibility is that we switched to a different thread state during finalization in the thread where finalization is happening and the thread state belongs to the same interpreter. However, that isn't an expected use case for |
No failures in 13 minutes. |
How do you build Python? Which configure options? I'm usually using |
Basically: |
Is this issue tied to any buildbot/CI failures? |
6/12 core Win 10, standard PCBuild/build.bat -d
|
Rerun showed this warning first.
Then same failure at 319. |
Time to time, you may want to run |
faulthandler now detected freed interp and freed tstate, and no longer dereference them.
faulthandler now detected freed interp and freed tstate, and no longer dereference them.
faulthandler now detected freed interp and freed tstate, and no longer dereference them.
faulthandler now detected freed interp and freed tstate, and no longer dereference them. (cherry picked from commit 2e37a38)
FWIW, I also saw this warning periodically with |
FTR, 3221225477 == |
Ok, let me try to reproduce the issue. I'm testing commit Machine
Prepare LinuxBuild a fresh Python:
Configure Linux to write coredumps in /tmp:
Test coredump:
Cleanup:
Prepare FreeBSDBuild a fresh Python:
Configure FreeBSD to write coredumps in /tmp:
Test coredump:
Cleanup:
Prepare WindowsBuild a fresh Python:
Stress-testLet's start easy, I run -j5 on the 3 operating systems:
|
I reproduce the crash on Windows in 1 min 31 sec:
|
Patch to make the bug more likely: diff --git a/Python/ceval_gil.c b/Python/ceval_gil.c
index ba16f5eb9b..48e5412281 100644
--- a/Python/ceval_gil.c
+++ b/Python/ceval_gil.c
@@ -329,6 +329,14 @@ drop_gil(struct _ceval_state *ceval, PyThreadState *tstate)
}
+static void
+_Py_random_sleep(void)
+{
+ unsigned long us = (unsigned long)(drand48() * 1000);
+ usleep(us);
+}
+
+
/* Take the GIL.
The function saves errno at entry and restores its value at exit.
@@ -355,6 +363,9 @@ take_gil(PyThreadState *tstate)
}
assert(_PyThreadState_CheckConsistency(tstate));
+
+_Py_random_sleep();
+
PyInterpreterState *interp = tstate->interp;
struct _ceval_state *ceval = &interp->ceval;
struct _gil_runtime_state *gil = ceval->gil; With this patch, I reproduce the crash in a few seconds on Linux with:
Now enjoy gdb:
|
Thanks. I'll give it a try. |
Sadly, I can also reproduce the crash in the 3.9, 3.10, 3.11 and 3.12 branches. Apparently, the bug was introduced when
This change was in the 3.8 branch but then I reverted it (commit e225beb). It's part of Python 3.9.0. This change was the starting point for a long way towards PEP 684 – A Per-Interpreter GIL which was accepted in Python 3.12. I don't think that it can be easily reverted. So instead, we should design a fix. I failed to reproduce the issue in Python 3.8. |
That's good to know. Thanks for the great detective work! |
The danger zone is between
Here the "Python exit" is the exit of the main() function. |
Core of the race condition:
|
A workaround would be to exit the thread if Some kind of synchronization is needed here to get cc @colesbury who likes atomic variables :-) |
I dealt with these problems in nogil-3.12 by:
The other bit that was helpful was the thread states (implemented in #109915), which can help deal with the inherent race between |
Not freeing the memory sounds an appealing solution, since trying to workaround dangling pointers is... tricky, as shown in this issue. |
Since this bug exists since Python 3.9 and nobody (before me) complained, I don't think that there is an emergency to fix it in Python 3.12. We can take time in Python 3.13 to come up with a clean design, as the one described by @colesbury: refcount and states. |
I have been thinking about making use of thread-local storage for the problems caused by daemon threads. Something like this: #Include/internal/pycore_pystate.h
typedef struct {
PyThreadState *tstate;
int finalizing;
} _Py_tss_state_t;
#Include/cpython/pystate.h
struct _ts {
...
_Py_tss_state_t *tss;
...
} ;
# Python/pystate.c
// Replaces _Py_tss_tstate.
// Guaranteed initialized to {0}?
_Py_thread_local _Py_tss_state pystate;
// Update tstate_activate() and tstate_deactivate() to set pystate.tstate
// (and update pystate.finalizing, if necessary).
void
_PyThreadState_Bind(PyThreadState *tstate)
{
...
tstate->tss = &pystate;
}
// Use this with (or in place of) _PyInterpreterState_SetFinalizing()?
void
_PyInterpreterState_SetTheadsFinalizing(PyInterpreterState *interp)
{
HEAD_LOCK(&_PyRuntime);
assert(interp->finalizing); // If true, new threads won't be created.
PyThreadState *tstate = interp->threads.head;
while (tstate != NULL) {
if (tstate->tss->tstate == tstate) {
tstate->tss->finalizing = 1;
}
tstate = tstate->next;
}
HEAD_UNLOCK(&_PyRuntime);
}
int
_PyThreadState_IsFinalizing(void)
{
return pystate.finalizing;
}
# Python/ceval_gil.c
// Update _PyThreadState_MustExit() to rely on _PyThreadState_IsFinalizing().
// XXX Update take_gil() to use critical data stored in the pystate threadlocal? |
Since we only need to worry about races in |
@ericsnowcurrently - I think that approach opens you up to problems if the daemon thread exits before the main thread. Then the |
Yeah, if a thread (not just daemon) exits before we call That said, this seems like a something that would apply broadly to the whole C community, whereas our current issue of finalization races in |
Hmm, |
It's hard to sequence destruction of thread-local variables. In other words, the tss destructors may be called before or after thread-local storage is destroyed. So unless you create For an example of this problem, see microsoft/mimalloc#164. Is there anything wrong with the solution I've outlined above? I've tested it in nogil-3.12 to address this crash, which occurs more frequently once the GIL is disabled. |
I didn't have a chance yet. |
faulthandler now detected freed interp and freed tstate, and no longer dereference them.
On Linux, when I stress test test_threading.test_4_daemon_threads(), it does crash randomly:
gdb traceback:
gdb debug:
I don't understand why the test didn't exit: _PyThreadState_MustExit() should return, no?
I don't understand why
assert(_PyThreadState_CheckConsistency(tstate));
didn't fail.Maybe Py_Finalize() was called between the pre-check:
and the code:
Linked PRs
The text was updated successfully, but these errors were encountered: