Thread.join returns before PyThreadState is destroyed #63008

TamasK · 2013-08-22T13:56:38Z

BPO	18808
Nosy	@tim-one, @jcea, @csernazs, @ncoghlan, @pitrou, @vstinner, @koobs
Files	subinterp_threads.patch threadstate_join.patch threadstate_join_2.patch threadstate_join_3.patch threadstate_join_4.patch threadstate_join_5.patch: added more is_alive() testing threadstate_join_6.patch subinterp_test.patch blind.patch: noble attempt ;-) to repair buildbot failures

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-09-09.20:08:46.070>
created_at = <Date 2013-08-22.13:56:37.656>
labels = ['interpreter-core', 'library', 'type-crash']
title = 'Thread.join returns before PyThreadState is destroyed'
updated_at = <Date 2019-06-10.23:19:38.831>
user = 'https://bugs.python.org/TamasK'

bugs.python.org fields:

activity = <Date 2019-06-10.23:19:38.831>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2013-09-09.20:08:46.070>
closer = 'tim.peters'
components = ['Interpreter Core', 'Library (Lib)']
creation = <Date 2013-08-22.13:56:37.656>
creator = 'Tamas.K'
dependencies = []
files = ['31445', '31455', '31537', '31635', '31640', '31650', '31652', '31653', '31659']
hgrepos = []
issue_num = 18808
keywords = ['patch']
message_count = 60.0
messages = ['195891', '195895', '195906', '195909', '195910', '195911', '195999', '196076', '196077', '196085', '196115', '196154', '196156', '196611', '196627', '196663', '196672', '196923', '196956', '196962', '196977', '196983', '197038', '197052', '197125', '197137', '197177', '197178', '197179', '197181', '197182', '197183', '197184', '197185', '197188', '197190', '197191', '197193', '197195', '197197', '197202', '197207', '197213', '197214', '197215', '197217', '197221', '197247', '197248', '197251', '197253', '197291', '197294', '197302', '197304', '197306', '197307', '197310', '197338', '345152']
nosy_count = 12.0
nosy_names = ['tim.peters', 'jcea', 'csernazs', 'ncoghlan', 'pitrou', 'vstinner', 'grahamd', 'neologix', 'python-dev', 'bkabrda', 'koobs', 'Tamas.K']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue18808'
versions = ['Python 3.4']

TamasK · 2013-08-22T13:56:37Z

When a thread is started in CPython, t_bootstrap [Modules/threadmodule.c] creates a PyThreadState object and then calls Thread.__bootstrap_inner [Lib/threading.py] which calls Thread.run, protected with self.__block, a Condition. Thread.join uses the same __block to block until Thread.run finished.

When Thread.run finished, __bootstrap_inner notifies on __block, so join will return. Here lies a race condition, if a thread switch to Thread.join occures before __bootstrap_inner returns to t_bootstrap. Then join will return before the PyThreadState for the thread is destroyed by t_bootstrap.

It is mostly harmless for general use, as __bootstrap_inner eventually gets scheduled again and PyThreadState is taken care of.

However. Py_EndInterpreter [Python/pythonrun.c] can be called when only the main interpreter thread is running. So when we want to call Py_EndInterpreter, we signal every other thread to stop, and join them. And when Thread.join returns, we call Py_EndInterpreter. Py_EndInterpreter checks if there are any other PyThreadStates still around and does a Py_FatalError.

As a workaround, we now do a sleep after join.

TamasK · 2013-08-22T14:16:32Z

The bug was found in 2.6.5, but after a quick eyeballing, current HEAD has the same problem.

pitrou · 2013-08-22T15:30:23Z

I *think* we could remove that limitation in Py_EndInterpreter(). After all, Py_Finalize() is able to join() non-daemon threads, so there's no reason for Py_EndInterpreter() to allow it too. We must keep the fatal error for daemon threads, though.

As for ensuring the thread state is destroyed before Thread.join() returns, I don't know how to do it: the thread state isn't a PyObject, we can't access it from Python code.

pitrou · 2013-08-22T15:50:43Z

(Actually, it would almost be possible to combine weakrefs and thread locals to ensure that join() only returns when the thread state is cleared. However, the thread state is cleared before it is untied from the interpreter, and there would still be a small window for a race condition if some destructor called by clearing the thread state released the GIL.)

pitrou · 2013-08-22T15:52:16Z

(Of course, the latter problem can be solved by having a dedicated sentinel in the thread state that gets DECREF'efed only *after* the thread state is removed from the interpreter...)

TamasK · 2013-08-22T16:05:05Z

Removing the last thread check sounds good, what we actually need is a Py_EndInterpreter that does not casually abort() on us, we don't really care about PyThreadStates that much.

pitrou · 2013-08-23T18:35:11Z

Here is a patch for 3.3/3.4.

TamasK · 2013-08-24T13:41:58Z

After a quick glance, I can't see how this patch would fix the problem. It still depends on threading's Thread.join, which is affected by the race condition in __bootstrap_inner.

We already did a Thread.join before calling Py_EndInterpreter and still got bitten by the race.

pitrou · 2013-08-24T13:46:38Z

Well, that's a good point. It does bring in line subinterpreters with the main interpreter when it comes to automatically joining non-daemon threads, but it doesn't solve the race condition you talked about. I forgot a bit too fast about it :-)

pitrou · 2013-08-24T18:44:13Z

Here is a patch to remove the race condition. The patch is sufficiently delicate that I'd rather reserve this for 3.4, though.

neologix · 2013-08-25T09:01:04Z

The first patch looks good, as for the second one, it'll take some time :-)

python-dev · 2013-08-25T17:48:36Z

New changeset becbb65074e1 by Antoine Pitrou in branch 'default':
Issue bpo-18808: Non-daemon threads are now automatically joined when a sub-interpreter is shutdown (it would previously dump a fatal error).
http://hg.python.org/cpython/rev/becbb65074e1

pitrou · 2013-08-25T17:50:26Z

Ok, I've committed the first patch.

tim-one · 2013-08-31T02:41:02Z

Nasty problem ;-)

I don't understand the need for all the indirections in the second patch. Like, why use a weakref? It's not like we have to worry about an immortal tstate keeping a measly little lock object alive forever, right? Seems to me that if the tstate object held the new lock directly, the tstate destructor could release the lock directly (and so also skip the new tstate->on_delete and tstate->on_delete_data indirections too).

Then again, I'd settle for Py_EndInterpreter simply sleeping for a second and trying again when it finds "another thread" hanging around (effectively moving Tamas's sleep() into Py_EndInterpreter, but only sleeping if needed). Yes, that's theoretically insecure.

But if we're worried about wildly improbable timing problems, then the patched code can appear not to honor a non-None Thread.join() timeout argument too. That is, the first call to the new pred() can theoretically consume any amount of time, due to its self._tstate_lock.acquire(). The same kinds of timing implausibilities that could cause a short sleep to fail could also cause that acquire() to take "a long time" (maybe even 2 seconds - LOL).

But, ya, it's better to really fix it.

pitrou · 2013-08-31T10:13:48Z

I don't understand the need for all the indirections in the second
patch. Like, why use a weakref? It's not like we have to worry about
an immortal tstate keeping a measly little lock object alive forever,
right? Seems to me that if the tstate object held the new lock
directly, the tstate destructor could release the lock directly (and
so also skip the new tstate->on_delete and tstate->on_delete_data
indirections too).

The problem is if the tstate holds the last reference to the lock. It
will destroy the lock at the end, but the lock could have weakrefs
attached to it with arbitrary callbacks. Those callbacks will run
without a current thread state and may crash. OTOH, this can't happen
with a private weakref.

You may suggest we only keep the Py_thread_lock in the tstate, rather
than the enclosing PyObject. But it has lifetime problems of its owns
(the Py_thread_lock doesn't have a refcount, and it is shared with a
PyObject who thinks it owns it and will destroy it on dealloc... adding
the necessary logic to circumvent this would make the final solution
more complicated than the weakref one).

Then again, I'd settle for Py_EndInterpreter simply sleeping for a
second and trying again when it finds "another thread" hanging around
(effectively moving Tamas's sleep() into Py_EndInterpreter, but only
sleeping if needed). Yes, that's theoretically insecure.

Well, that sounds less predictable. Depending on machine load,
Py_EndInterpreter might succeed or crash with a fatal error. Users may
not like this :-)

But if we're worried about wildly improbable timing problems, then the
patched code can appear not to honor a non-None Thread.join()
timeout argument too. That is, the first call to the new pred()
can theoretically consume any amount of time, due to its
self._tstate_lock.acquire().

Ah, well, good point. It's weird it didn't get caught by the unit
tests... I'll take a look.

pitrou · 2013-08-31T18:44:22Z

Ok, here is a new patch observing the timeout parameter. I've changed the implementation strategy a bit, it seems hopefully sane.

There's a remaining (slightly bordeline) issue: the thread state lock isn't reinitialized on fork(). Should it?

tim-one · 2013-08-31T21:01:25Z

I'm getting a headache now - LOL ;-) Thanks for the explanation!

What I still don't understand: the new lock is an internal implementation detail. How would it gain a weakref with a callback? Users aren't going to mess with this lock, and if you want to stop Python maintainers from giving it a weakref with a callback, simply say they shouldn't do that (in the code comments) - you could even add code verifying it doesn't have any weakrefs outstanding (although that would likely be a waste of time and code: no maintainer is going to _want_ to make a weakref to it, let alone a weakref with a callback).

My concern is the bulk and obscurity of this code, all to plug such a minor hole. I call it "minor" because it's been reported once in the history of the project, and Tamas wormed around it with a 1-liner (added a sleep).

Granted, it's much harder to fix "for real" and when most of the interpreter has been destroyed ;-)

pitrou · 2013-09-04T15:07:42Z

What I still don't understand: the new lock is an internal
implementation detail. How would it gain a weakref with a callback?
Users aren't going to mess with this lock, and if you want to stop
Python maintainers from giving it a weakref with a callback, simply
say they shouldn't do that (in the code comments) - you could even
add code verifying it doesn't have any weakrefs outstanding
(although that would likely be a waste of time and code: no
maintainer is going to _want_ to make a weakref to it, let alone a
weakref with a callback).

Well... perhaps I'm being too perfectionist, but I don't want Python
to be crashable merely by manipulating a Thread object's private
attributes.
If I add some check code, it will quickly become more complicated than
the current weakref version, which works by design :-)

My concern is the bulk and obscurity of this code, all to plug such a
minor hole. I call it "minor" because it's been reported once in
the history of the project, and Tamas wormed around it with a
1-liner (added a sleep).

Yeah, the overall concern is a bit obscure, but still: right now, if
you use threads inside a subinterpreter, your code can work or crash
depending on subtle timing conditions (where "crash" doesn't even mean
"raise an exception" but "abort the whole process"). So I think it's
worth trying to fix, even though the complication can look a bit
disproportionate.

tim-one · 2013-09-04T21:09:25Z

Oh, I'm not opposed, I'm just complaining ;-)

It would be much nicer to have an approach that worked for all thread users, not just threading.Thread users. For example, a user can easily (well, plausibly) get into the same kinds of troubles here by calling _thread.start_new_thread() directly, then waiting for their threads "to end" before letting the program finish - they have no idea either when their tstates are actually destroyed.

A high-probability way to "appear to fix" this for everyone could change Py_EndInterpreter's

    if (tstate != interp->tstate_head || tstate->next != NULL)
        Py_FatalError("Py_EndInterpreter: not the last thread");

to something like

    int count = 0;
    while (tstate != interp->tstate_head || tstate->next != NULL) {
        ++count;
        if (count > SOME_MAGIC_VALUE)
            Py_FatalError("Py_EndInterpreter: not the last thread");
        sleep(SOME_SHORT_TIME);
    }

In the meantime ;-), you should change this part of the new .join() code:

        if endtime is not None:
            waittime = endtime - _time()
            if not lock.acquire(timeout=waittime):
                return

The problem here is that we have no idea how much time may have elapsed before computing the new waittime. So the new waittime may be negative, in which case we've already timed out (but passing a negative waittime to acquire() means "wait as long as it takes to acquire the lock"). So this block should return if waittime < 0.

tim-one · 2013-09-04T21:57:42Z

Oops! The docs are wrong - a negative timeout actually raises:

ValueError: timeout value must be strictly positive

unless the timeout is exactly -1. All the more reason to ensure that a negative waittime isn't passed.

I opened a different issue about the doc glitch.

tim-one · 2013-09-05T03:49:40Z

Fudge - there's another unlikely problem here. For example: main program creates a threading.Thread t, runs it, and does t.join(5) (whatever - any timeout value). When t.join() returns, the main program has no idea whether t is done or not. Suppose t isn't done, but finishes _bootstrap_inner a microsecond (whatever) later.

Now the user follows the doc's advice, and checks t.is_alive() to see whether it still needs to join t. t.is_alive() returns False! _bootstrap_inner() already finished, so the t._stopped event is already set.

So the main program skips doing another t.join(), and lets the program exit. There's no guarantee then that t's tstate has been cleared, and we can be back to same fatal error that started this.

Of course this has nothing to do with the patch switching from a Condition to an Event to manage the stopped state (good change!) - it has to do with that, no matter how it's coded, _bootstrap_inner() says "this thread is done" before the tstate is unlinked from the chain.

For a related reason, note that Python's threading._shutdown() doesn't prevent the fatal shutdown error even after the patch! _pickSomeNonDaemonThread() also ignores threads for which is_alive() returns False. As above, that can return False while the tstate is still in the chain, and then threading._shutdown() never joins such a thread. join() can't fix the problem if it's not called. I supposed that one can be repaired by removing the is_alive() check in _pickSomeNonDaemonThread() (i.e., always join() every non-daemon thread).

In their favor, the probabilistic "sleep" approaches would almost always "fix" these too ;-)

pitrou · 2013-09-05T08:15:11Z

It would be much nicer to have an approach that worked for all thread users,
not just threading.Thread users. For example, a user can easily (well,
plausibly) get into the same kinds of troubles here by calling
_thread.start_new_thread() directly, then waiting for their threads "to end"
before letting the program finish - they have no idea either when their
tstates are actually destroyed.

The _thread module is a private API these days, I wouldn't worry too much about that. Are there reasons *not* to use the threading module (except CPython bootstrap issues that normal users should have no bother about)? Also, how would they wait for the thread to end, except by triggering their own Event object?

So the new waittime may be negative, in which case we've already timed
out (but passing a negative waittime to acquire() means "wait as long as
it takes to acquire the lock"). So this block should return if waittime < 0.

Thanks, good point :-o

As for the is_alive() method, mmh, ok, it needs to be tweaked too. That will make the patch more interesting.

tim-one · 2013-09-06T00:19:24Z

So you're not concerned about a now-private API (which used to be advertised), but are concerned about a user mucking with a new private lock in an exceedingly unlikely (in the absence of malice) way. That clarifies things ;-)

I'm not really concerned about either. User perversity knows no limits, though, so I wouldn't be surprised if some people are rolling their own higher-level threads in Python for reasons they think are compelling.

Details don't really matter to that, right? Like:

Also, how would they wait for the thread to end, except by
triggering their own Event object?

Any number of ways. Roll their own Event, roll their own Barrier, roll their own Condition, or even something as simple as keeping an integer count of the number of threads they created, and then (e.g.)

while nthreads:
    time.sleep(1)

at the end, with each thread doing a

global nthreads
with nthreads_lock:
    nthreads -= 1

in its end-of-life code. Essentially rolling their own clumsy variant of a Semaphore.

I've seen code like that "in the wild". But, no, I'll join you in not worrying about it ;-)

pitrou · 2013-09-06T06:31:48Z

Le vendredi 06 septembre 2013 à 00:19 +0000, Tim Peters a écrit :

Tim Peters added the comment:

So you're not concerned about a now-private API (which used to be
advertised), but are concerned about a user mucking with a new private
lock in an exceedingly unlikely (in the absence of malice) way. That
clarifies things ;-)

:-)
The only reason I'm concerned about the user mucking with the private
lock is that it's a new opportunity that's opened. But, yeah, I could
remove the weakref and only keep the lock. The code would only be ~10
lines shorter, though. What do other people think?

in its end-of-life code. Essentially rolling their own clumsy variant
of a Semaphore.

I guess they spell it like:

  import clumsy_threading as threading

I've seen code like that "in the wild".

Well, I've sure seen my share of sleep() calls as a synchronization
measure too (and removed a number of them)... :-)
That's one of the reasons I added the timeout arguments, actually.

tim-one · 2013-09-07T00:49:56Z

All the timeout args are great! I wish Python had always had them :-)

Back to the pain at hand, it's not the number of lines of code that rubs me the wrong way, but the sheer obscurity of it all. This handful of lines is - of necessity - sprayed across two C code files, a C header file, and a Python module. That makes it very hard (for anyone but you - LOL) to reconstruct the _intent_ of it all.

I'm adding another patch, which is your threadstate_join_2.patch but with a new comment block (in pystate.h) spelling out the rationale behind it all. I can live with that ;-)

tim-one · 2013-09-07T05:31:51Z

New patch (threadstate_join_4.patch) refactors so that is_alive() returns True until the tstate is cleared. This simplifies join() a lot (it doesn't have to roll its own timeout implementation anymore), but complicates is_alive().

Caution: I don't know anything about the dummy threading implementation. So if I broke that, sue me ;-)

tim-one · 2013-09-07T18:32:00Z

New patch threadstate_join_5.patch adds more testing of is_alive().

An inelegance I don't care about (but someone might): if join() is called with a timeout, and the Python part of the thread ends before the timeout expires (stopped gets set), then a _non-blocking attempt to acquire _tstate_lock is made, and join() returns regardless of the outcome. So, with a timeout, it's possible for join() to return before the C part of the thread is done even if the timeout isn't yet exhausted.

That's unlikely, and I don't see that it makes any difference. Anyone doing a join() with a timeout has to be aware that they have no idea whether the thread is done when join() returns, and do another join() or check is_alive() later.

I'd rather not complicate the code to wait longer for _tstate_lock in this case.

python-dev · 2013-09-07T21:39:17Z

New changeset d52b68edbca6 by Antoine Pitrou in branch 'default':
Issue bpo-18808: Thread.join() now waits for the underlying thread state to be destroyed before returning.
http://hg.python.org/cpython/rev/d52b68edbca6

pitrou · 2013-09-07T21:50:37Z

Ok, this should be finally fixed!

pitrou · 2013-09-07T23:03:26Z

Ha, apparently test_is_alive_after_fork still fails on old Linux kernels:

http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8562
http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2665

tim-one · 2013-09-07T23:17:07Z

Figures. That's why I wanted your name on the blamelist ;-)

ncoghlan · 2013-09-07T23:44:34Z

You mean I actually need to pay proper attention now? :)

ncoghlan · 2013-09-08T00:42:24Z

Slavek - one you may want to take a look at this. Antoine and Tim are trying to fix a race condition with the destruction of thread state objects, but the current fix isn't playing nice on older Linux kernels (including the one in RHEL 6).

python-dev · 2013-09-08T02:23:24Z

New changeset 5cfd7b2eb994 by Tim Peters in branch 'default':
bpo-18808: blind attempt to repair some buildbot failures.
http://hg.python.org/cpython/rev/5cfd7b2eb994

tim-one · 2013-09-08T02:24:14Z

Just pushed 5cfd7b2eb994 in a poke-and-hope blind attempt to repair the annoying ;-) buildbot failures.

tim-one · 2013-09-08T02:51:18Z

Doesn't look like 5cfd7b2eb994 is going to fix it :-( So I'll revert it. Attaching the patch as blind.patch. After that patch, is_alive() only looks at Thread data members, where ._is_stopped "is obviously" True, and ._tstate_lock "is obviously" None, both forced by threading.py's _after_fork() function. No theory as to how that can be screwed up.

tim-one · 2013-09-08T03:01:11Z

Weird! The Ubuntu box passed test_is_alive_after_fork on its /second/ run with the patch:

http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8564/steps/test/logs/stdio

The other box passed all tests:

http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2667

So I won't revert it after all - but the Ubuntu box's behavior is still major mystery :-(

tim-one · 2013-09-08T04:06:17Z

Well, the next time the Ubuntu box ran the tests, it was clean all the way. So it's fixed! Despite that it isn't ;-)

pitrou · 2013-09-08T10:42:52Z

Indeed the Ubuntu Shared buildbot started failing again. There's probably a timing-dependent behaviour here (which is why test_is_alive_after_fork() tries several times, after all).

pitrou · 2013-09-08T11:08:14Z

I think I've found the answer: the thread is sometimes already stopped by the time the child is forked, so it doesn't appear in _enumerate() anymore (it left the _active dict). Therefore its locks are not reset in _after_fork().

Oh, I also get the following sporadic failure which is triggered by slight change in semantics with Thread.join(timeout) :-)

======================================================================
FAIL: test_various_ops (test.test_threading.ThreadTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
    self.assertTrue(not t.is_alive())
AssertionError: False is not true

python-dev · 2013-09-08T11:19:13Z

New changeset 74dc664ad699 by Antoine Pitrou in branch 'default':
Issue bpo-18808 again: fix the after-fork logic for not-yet-started or already-stopped threads.
http://hg.python.org/cpython/rev/74dc664ad699

koobs · 2013-09-08T11:39:02Z

Adding reference to failing tests on koobs-freebsd9 and koobs-freebsd10 buildbots:

======================================================================
FAIL: test_is_alive_after_fork (test.test_threading.ThreadTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/home/buildbot/koobs-freebsd10/3.x.koobs-freebsd10/build/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork
    self.assertEqual(0, status)
AssertionError: 0 != 256

tim-one · 2013-09-08T17:30:36Z

[Antoine]

Oh, I also get the following sporadic failure
which is triggered by slight change in semantics
with Thread.join(timeout) :-)
======================================================================
FAIL: test_various_ops (test.test_threading.ThreadTests)
----------------------------------------------------------------------

> Traceback (most recent call last):
>  File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
>    self.assertTrue(not t.is_alive())
> AssertionError: False is not true

Really! In context, the test does:

            t.join()
            self.assertTrue(not t.is_alive())

(BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) )

It was the intent that this continue to work - the only intended change in Python-visible semantics had to do with join'ing with a timeout.

Without a timeout, I confess I don't see how this can fail. join() is join(timeout=None), which does:

        self._stopped.wait(timeout)
        if self._stopped.is_set():
            self._wait_for_tstate_lock(timeout is None)

which is

        self._stopped.wait(None)
        if self._stopped.is_set():
            self._wait_for_tstate_lock(True)

which should be the same as

        self._stopped.wait()
        self._wait_for_tstate_lock(True)

after which _stopped should be set and _tstate_lock should be None. The subsequent is_alive() should then return False, via its

        return self._tstate_lock is not None

What's going wrong?

pitrou · 2013-09-08T17:42:26Z

Le dimanche 08 septembre 2013 à 17:30 +0000, Tim Peters a écrit :

Really! In context, the test does:
        t.join()
        self.assertTrue(not t.is_alive())

Ah, no, the failing test did t.join(something). I removed the timeout
to remove the failure :-)

(BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) )

Yes, old coding style.

tim-one · 2013-09-08T17:58:10Z

Ah - the test used to do t.join(NUMTASKS)! That's just bizarre ;-)

I believe I can repair that too (well - there was never a _guarantee_ that waiting 10 seconds would be long enough), but I'll wait until this all settles down.

join() and is_alive() are too complicated now, because of the 2-step dance to check whether the thread is done: we have both an Event (_stopped) and a lock (tstate_lock) to check now. The Event doesn't serve a purpose anymore: it's almost always uninteresting to know _just that the Python part of the thread has ended. The only exception I can see is the perverse case of joining the main thread done in some of the tests (in that case we have to claim the main thread is done even though its tstate is still active).

Anyway, after getting rid of the Event it should be dead easy to make join(10) "appear to work the same as before, despite that it never really worked ;-)".

pitrou · 2013-09-08T18:01:51Z

join() and is_alive() are too complicated now, because of the 2-step
dance to check whether the thread is done: we have both an Event
(_stopped) and a lock (tstate_lock) to check now. The Event doesn't
serve a purpose anymore: it's almost always uninteresting to know
_just that the Python part of the thread has ended.

Yes, that crossed my mind too. The difficulty is that only plain lock
objects are available from C code, not Events. But if the first join()er
releases the lock just after taking it, it will be enough to make the
code correct?

(after all, it's an event that's never reset, which simplifies things)

(also, why is the current Event implementation based on Condition? isn't
an Event actually simpler than a Condition?)

tim-one · 2013-09-08T18:16:22Z

Without _stopped, join() can simply wait to acquire _tstate_lock (with or without a timeout, and skipping this if _tstate_lock is already None). Etc ;-) Of course details matter, but it's easy. I did it once, but the tests joining the main thread failed and I put the code on hold. I'll dust it off when the buildbots are all happy with the current changes.

(also, why is the current Event implementation based
on Condition?

We'd have to ask Guido ;-) Best guess is that Condition supplied all the machinery to make Event.wait() work correctly, including waking all waiters up when the Event gets set.

isn't an Event actually simpler than a Condition?)

Events are indeed simple :-) There are many ways to implement them, but "ain't broke, don't fix" seems the right approach to me here. In effect, if we get rid of _stopped, the code remaining will be much like an Event implementation built on the plain _tstate_lock lock.

pitrou · 2013-09-08T18:18:53Z

Without _stopped, join() can simply wait to acquire _tstate_lock (with
or without a timeout, and skipping this if _tstate_lock is already
None). Etc ;-) Of course details matter, but it's easy. I did it
once, but the tests joining the main thread failed and I put the code
on hold.

Ah, of course. The main thread needs the event, since the thread state
will only be deleted at the end of Py_Finalize().
The MainThread class could override is_alive() and join(), then.

tim-one · 2013-09-08T18:27:17Z

The MainThread class could override is_alive() and join(), then.

I think it will be easier than that, but we'll see ;-)

python-dev · 2013-09-08T23:46:44Z

New changeset aff959a3ba13 by Tim Peters in branch 'default':
bpo-18984: Remove ._stopped Event from Thread internals.
http://hg.python.org/cpython/rev/aff959a3ba13

vstinner · 2019-06-10T23:19:39Z

_thread._set_sentinel() and threading.Thread._tstate_lock is a great enhancement, as Py_EndInterprter() which now calls threading._shutdown().

FYI I found yet another race condition, this time in threading._shutdown(). See bpo-36402 follow-up: "threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly".

TamasK mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump labels Aug 22, 2013

pitrou closed this as completed Sep 7, 2013

pitrou reopened this Sep 7, 2013

tim-one closed this as completed Sep 9, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

This was referenced May 13, 2023

test__xxsubinterpreters is Occasionally Crashing #104341

Closed

gh-104341: Fix threading Module Shutdown #104560

Draft

Thread.join returns before PyThreadState is destroyed #63008

Thread.join returns before PyThreadState is destroyed #63008

Comments

TamasK mannequin commented Aug 22, 2013

TamasK mannequin commented Aug 22, 2013

TamasK mannequin commented Aug 22, 2013

pitrou commented Aug 22, 2013

pitrou commented Aug 22, 2013

pitrou commented Aug 22, 2013

TamasK mannequin commented Aug 22, 2013

pitrou commented Aug 23, 2013

TamasK mannequin commented Aug 24, 2013

pitrou commented Aug 24, 2013

pitrou commented Aug 24, 2013

neologix mannequin commented Aug 25, 2013

python-dev mannequin commented Aug 25, 2013

pitrou commented Aug 25, 2013

tim-one commented Aug 31, 2013

pitrou commented Aug 31, 2013

pitrou commented Aug 31, 2013

tim-one commented Aug 31, 2013

pitrou commented Sep 4, 2013

tim-one commented Sep 4, 2013

tim-one commented Sep 4, 2013

tim-one commented Sep 5, 2013

pitrou commented Sep 5, 2013

tim-one commented Sep 6, 2013

pitrou commented Sep 6, 2013

tim-one commented Sep 7, 2013

tim-one commented Sep 7, 2013

tim-one commented Sep 7, 2013

python-dev mannequin commented Sep 7, 2013

pitrou commented Sep 7, 2013

pitrou commented Sep 7, 2013

tim-one commented Sep 7, 2013

ncoghlan commented Sep 7, 2013

ncoghlan commented Sep 8, 2013

python-dev mannequin commented Sep 8, 2013

tim-one commented Sep 8, 2013

tim-one commented Sep 8, 2013

tim-one commented Sep 8, 2013

tim-one commented Sep 8, 2013

pitrou commented Sep 8, 2013

pitrou commented Sep 8, 2013

python-dev mannequin commented Sep 8, 2013

koobs commented Sep 8, 2013

tim-one commented Sep 8, 2013

pitrou commented Sep 8, 2013

tim-one commented Sep 8, 2013

pitrou commented Sep 8, 2013

tim-one commented Sep 8, 2013

pitrou commented Sep 8, 2013

tim-one commented Sep 8, 2013

python-dev mannequin commented Sep 8, 2013

vstinner commented Jun 10, 2019