-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread.join returns before PyThreadState is destroyed #63008
Comments
When a thread is started in CPython, t_bootstrap [Modules/threadmodule.c] creates a PyThreadState object and then calls Thread.__bootstrap_inner [Lib/threading.py] which calls Thread.run, protected with self.__block, a Condition. Thread.join uses the same __block to block until Thread.run finished. When Thread.run finished, __bootstrap_inner notifies on __block, so join will return. Here lies a race condition, if a thread switch to Thread.join occures before __bootstrap_inner returns to t_bootstrap. Then join will return before the PyThreadState for the thread is destroyed by t_bootstrap. It is mostly harmless for general use, as __bootstrap_inner eventually gets scheduled again and PyThreadState is taken care of. However. Py_EndInterpreter [Python/pythonrun.c] can be called when only the main interpreter thread is running. So when we want to call Py_EndInterpreter, we signal every other thread to stop, and join them. And when Thread.join returns, we call Py_EndInterpreter. Py_EndInterpreter checks if there are any other PyThreadStates still around and does a Py_FatalError. As a workaround, we now do a sleep after join. |
The bug was found in 2.6.5, but after a quick eyeballing, current HEAD has the same problem. |
I *think* we could remove that limitation in Py_EndInterpreter(). After all, Py_Finalize() is able to join() non-daemon threads, so there's no reason for Py_EndInterpreter() to allow it too. We must keep the fatal error for daemon threads, though. As for ensuring the thread state is destroyed before Thread.join() returns, I don't know how to do it: the thread state isn't a PyObject, we can't access it from Python code. |
(Actually, it would almost be possible to combine weakrefs and thread locals to ensure that join() only returns when the thread state is cleared. However, the thread state is cleared before it is untied from the interpreter, and there would still be a small window for a race condition if some destructor called by clearing the thread state released the GIL.) |
(Of course, the latter problem can be solved by having a dedicated sentinel in the thread state that gets DECREF'efed only *after* the thread state is removed from the interpreter...) |
Removing the last thread check sounds good, what we actually need is a Py_EndInterpreter that does not casually abort() on us, we don't really care about PyThreadStates that much. |
Here is a patch for 3.3/3.4. |
After a quick glance, I can't see how this patch would fix the problem. It still depends on threading's Thread.join, which is affected by the race condition in __bootstrap_inner. We already did a Thread.join before calling Py_EndInterpreter and still got bitten by the race. |
Well, that's a good point. It does bring in line subinterpreters with the main interpreter when it comes to automatically joining non-daemon threads, but it doesn't solve the race condition you talked about. I forgot a bit too fast about it :-) |
Here is a patch to remove the race condition. The patch is sufficiently delicate that I'd rather reserve this for 3.4, though. |
The first patch looks good, as for the second one, it'll take some time :-) |
New changeset becbb65074e1 by Antoine Pitrou in branch 'default': |
Ok, I've committed the first patch. |
Nasty problem ;-) I don't understand the need for all the indirections in the second patch. Like, why use a weakref? It's not like we have to worry about an immortal tstate keeping a measly little lock object alive forever, right? Seems to me that if the tstate object held the new lock directly, the tstate destructor could release the lock directly (and so also skip the new tstate->on_delete and tstate->on_delete_data indirections too). Then again, I'd settle for Py_EndInterpreter simply sleeping for a second and trying again when it finds "another thread" hanging around (effectively moving Tamas's sleep() into Py_EndInterpreter, but only sleeping if needed). Yes, that's theoretically insecure. But if we're worried about wildly improbable timing problems, then the patched code can appear not to honor a non-None Thread.join() But, ya, it's better to really fix it. |
The problem is if the tstate holds the last reference to the lock. It You may suggest we only keep the Py_thread_lock in the tstate, rather
Well, that sounds less predictable. Depending on machine load,
Ah, well, good point. It's weird it didn't get caught by the unit |
Ok, here is a new patch observing the timeout parameter. I've changed the implementation strategy a bit, it seems hopefully sane. There's a remaining (slightly bordeline) issue: the thread state lock isn't reinitialized on fork(). Should it? |
I'm getting a headache now - LOL ;-) Thanks for the explanation! What I still don't understand: the new lock is an internal implementation detail. How would it gain a weakref with a callback? Users aren't going to mess with this lock, and if you want to stop Python maintainers from giving it a weakref with a callback, simply say they shouldn't do that (in the code comments) - you could even add code verifying it doesn't have any weakrefs outstanding (although that would likely be a waste of time and code: no maintainer is going to _want_ to make a weakref to it, let alone a weakref with a callback). My concern is the bulk and obscurity of this code, all to plug such a minor hole. I call it "minor" because it's been reported once in the history of the project, and Tamas wormed around it with a 1-liner (added a sleep). Granted, it's much harder to fix "for real" and when most of the interpreter has been destroyed ;-) |
Well... perhaps I'm being too perfectionist, but I don't want Python
Yeah, the overall concern is a bit obscure, but still: right now, if |
Oh, I'm not opposed, I'm just complaining ;-) It would be much nicer to have an approach that worked for all thread users, not just threading.Thread users. For example, a user can easily (well, plausibly) get into the same kinds of troubles here by calling _thread.start_new_thread() directly, then waiting for their threads "to end" before letting the program finish - they have no idea either when their tstates are actually destroyed. A high-probability way to "appear to fix" this for everyone could change Py_EndInterpreter's if (tstate != interp->tstate_head || tstate->next != NULL)
Py_FatalError("Py_EndInterpreter: not the last thread"); to something like int count = 0;
while (tstate != interp->tstate_head || tstate->next != NULL) {
++count;
if (count > SOME_MAGIC_VALUE)
Py_FatalError("Py_EndInterpreter: not the last thread");
sleep(SOME_SHORT_TIME);
} In the meantime ;-), you should change this part of the new .join() code: if endtime is not None:
waittime = endtime - _time()
if not lock.acquire(timeout=waittime):
return The problem here is that we have no idea how much time may have elapsed before computing the new |
Oops! The docs are wrong - a negative timeout actually raises: ValueError: timeout value must be strictly positive unless the timeout is exactly -1. All the more reason to ensure that a negative waittime isn't passed. I opened a different issue about the doc glitch. |
Fudge - there's another unlikely problem here. For example: main program creates a threading.Thread t, runs it, and does t.join(5) (whatever - any timeout value). When t.join() returns, the main program has no idea whether t is done or not. Suppose t isn't done, but finishes _bootstrap_inner a microsecond (whatever) later. Now the user follows the doc's advice, and checks t.is_alive() to see whether it still needs to join t. t.is_alive() returns False! _bootstrap_inner() already finished, so the t._stopped event is already set. So the main program skips doing another t.join(), and lets the program exit. There's no guarantee then that t's tstate has been cleared, and we can be back to same fatal error that started this. Of course this has nothing to do with the patch switching from a Condition to an Event to manage the stopped state (good change!) - it has to do with that, no matter how it's coded, _bootstrap_inner() says "this thread is done" before the tstate is unlinked from the chain. For a related reason, note that Python's threading._shutdown() doesn't prevent the fatal shutdown error even after the patch! _pickSomeNonDaemonThread() also ignores threads for which is_alive() returns False. As above, that can return False while the tstate is still in the chain, and then threading._shutdown() never joins such a thread. join() can't fix the problem if it's not called. I supposed that one can be repaired by removing the is_alive() check in _pickSomeNonDaemonThread() (i.e., always join() every non-daemon thread). In their favor, the probabilistic "sleep" approaches would almost always "fix" these too ;-) |
The _thread module is a private API these days, I wouldn't worry too much about that. Are there reasons *not* to use the threading module (except CPython bootstrap issues that normal users should have no bother about)? Also, how would they wait for the thread to end, except by triggering their own Event object?
Thanks, good point :-o As for the is_alive() method, mmh, ok, it needs to be tweaked too. That will make the patch more interesting. |
So you're not concerned about a now-private API (which used to be advertised), but are concerned about a user mucking with a new private lock in an exceedingly unlikely (in the absence of malice) way. That clarifies things ;-) I'm not really concerned about either. User perversity knows no limits, though, so I wouldn't be surprised if some people are rolling their own higher-level threads in Python for reasons they think are compelling. Details don't really matter to that, right? Like:
Any number of ways. Roll their own Event, roll their own Barrier, roll their own Condition, or even something as simple as keeping an integer count of the number of threads they created, and then (e.g.) while nthreads:
time.sleep(1) at the end, with each thread doing a
in its end-of-life code. Essentially rolling their own clumsy variant of a Semaphore. I've seen code like that "in the wild". But, no, I'll join you in not worrying about it ;-) |
Le vendredi 06 septembre 2013 à 00:19 +0000, Tim Peters a écrit :
:-)
I guess they spell it like: import clumsy_threading as threading
Well, I've sure seen my share of sleep() calls as a synchronization |
All the timeout args are great! I wish Python had always had them :-) Back to the pain at hand, it's not the number of lines of code that rubs me the wrong way, but the sheer obscurity of it all. This handful of lines is - of necessity - sprayed across two C code files, a C header file, and a Python module. That makes it very hard (for anyone but you - LOL) to reconstruct the _intent_ of it all. I'm adding another patch, which is your threadstate_join_2.patch but with a new comment block (in pystate.h) spelling out the rationale behind it all. I can live with that ;-) |
New patch (threadstate_join_4.patch) refactors so that is_alive() returns True until the tstate is cleared. This simplifies join() a lot (it doesn't have to roll its own timeout implementation anymore), but complicates is_alive(). Caution: I don't know anything about the dummy threading implementation. So if I broke that, sue me ;-) |
New patch threadstate_join_5.patch adds more testing of is_alive(). An inelegance I don't care about (but someone might): if join() is called with a timeout, and the Python part of the thread ends before the timeout expires (stopped gets set), then a _non-blocking attempt to acquire _tstate_lock is made, and join() returns regardless of the outcome. So, with a timeout, it's possible for join() to return before the C part of the thread is done even if the timeout isn't yet exhausted. That's unlikely, and I don't see that it makes any difference. Anyone doing a join() with a timeout has to be aware that they have no idea whether the thread is done when join() returns, and do another join() or check is_alive() later. I'd rather not complicate the code to wait longer for _tstate_lock in this case. |
New changeset d52b68edbca6 by Antoine Pitrou in branch 'default': |
Ok, this should be finally fixed! |
Ha, apparently test_is_alive_after_fork still fails on old Linux kernels: http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8562 |
Figures. That's why I wanted your name on the blamelist ;-) |
You mean I actually need to pay proper attention now? :) |
Slavek - one you may want to take a look at this. Antoine and Tim are trying to fix a race condition with the destruction of thread state objects, but the current fix isn't playing nice on older Linux kernels (including the one in RHEL 6). |
New changeset 5cfd7b2eb994 by Tim Peters in branch 'default': |
Just pushed 5cfd7b2eb994 in a poke-and-hope blind attempt to repair the annoying ;-) buildbot failures. |
Doesn't look like 5cfd7b2eb994 is going to fix it :-( So I'll revert it. Attaching the patch as blind.patch. After that patch, is_alive() only looks at Thread data members, where ._is_stopped "is obviously" True, and ._tstate_lock "is obviously" None, both forced by threading.py's _after_fork() function. No theory as to how that can be screwed up. |
Weird! The Ubuntu box passed test_is_alive_after_fork on its /second/ run with the patch: The other box passed all tests: http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2667 So I won't revert it after all - but the Ubuntu box's behavior is still major mystery :-( |
Well, the next time the Ubuntu box ran the tests, it was clean all the way. So it's fixed! Despite that it isn't ;-) |
Indeed the Ubuntu Shared buildbot started failing again. There's probably a timing-dependent behaviour here (which is why test_is_alive_after_fork() tries several times, after all). |
I think I've found the answer: the thread is sometimes already stopped by the time the child is forked, so it doesn't appear in _enumerate() anymore (it left the _active dict). Therefore its locks are not reset in _after_fork(). Oh, I also get the following sporadic failure which is triggered by slight change in semantics with Thread.join(timeout) :-) ====================================================================== Traceback (most recent call last):
File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
self.assertTrue(not t.is_alive())
AssertionError: False is not true |
New changeset 74dc664ad699 by Antoine Pitrou in branch 'default': |
Adding reference to failing tests on koobs-freebsd9 and koobs-freebsd10 buildbots: ====================================================================== Traceback (most recent call last):
File "/usr/home/buildbot/koobs-freebsd10/3.x.koobs-freebsd10/build/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork
self.assertEqual(0, status)
AssertionError: 0 != 256 |
[Antoine]
> Traceback (most recent call last):
> File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
> self.assertTrue(not t.is_alive())
> AssertionError: False is not true Really! In context, the test does: t.join()
self.assertTrue(not t.is_alive()) (BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) ) It was the intent that this continue to work - the only intended change in Python-visible semantics had to do with join'ing with a timeout. Without a timeout, I confess I don't see how this can fail. join() is join(timeout=None), which does: self._stopped.wait(timeout)
if self._stopped.is_set():
self._wait_for_tstate_lock(timeout is None) which is self._stopped.wait(None)
if self._stopped.is_set():
self._wait_for_tstate_lock(True) which should be the same as self._stopped.wait()
self._wait_for_tstate_lock(True) after which _stopped should be set and _tstate_lock should be None. The subsequent is_alive() should then return False, via its return self._tstate_lock is not None What's going wrong? |
Le dimanche 08 septembre 2013 à 17:30 +0000, Tim Peters a écrit :
Ah, no, the failing test did
Yes, old coding style. |
Ah - the test used to do t.join(NUMTASKS)! That's just bizarre ;-) I believe I can repair that too (well - there was never a _guarantee_ that waiting 10 seconds would be long enough), but I'll wait until this all settles down. join() and is_alive() are too complicated now, because of the 2-step dance to check whether the thread is done: we have both an Event (_stopped) and a lock (tstate_lock) to check now. The Event doesn't serve a purpose anymore: it's almost always uninteresting to know _just that the Python part of the thread has ended. The only exception I can see is the perverse case of joining the main thread done in some of the tests (in that case we have to claim the main thread is done even though its tstate is still active). Anyway, after getting rid of the Event it should be dead easy to make join(10) "appear to work the same as before, despite that it never really worked ;-)". |
Yes, that crossed my mind too. The difficulty is that only plain lock (after all, it's an event that's never reset, which simplifies things) (also, why is the current Event implementation based on Condition? isn't |
Without _stopped, join() can simply wait to acquire _tstate_lock (with or without a timeout, and skipping this if _tstate_lock is already None). Etc ;-) Of course details matter, but it's easy. I did it once, but the tests joining the main thread failed and I put the code on hold. I'll dust it off when the buildbots are all happy with the current changes.
We'd have to ask Guido ;-) Best guess is that Condition supplied all the machinery to make Event.wait() work correctly, including waking all waiters up when the Event gets set.
Events are indeed simple :-) There are many ways to implement them, but "ain't broke, don't fix" seems the right approach to me here. In effect, if we get rid of _stopped, the code remaining will be much like an Event implementation built on the plain _tstate_lock lock. |
Ah, of course. The main thread needs the event, since the thread state |
I think it will be easier than that, but we'll see ;-) |
New changeset aff959a3ba13 by Tim Peters in branch 'default': |
_thread._set_sentinel() and threading.Thread._tstate_lock is a great enhancement, as Py_EndInterprter() which now calls threading._shutdown(). FYI I found yet another race condition, this time in threading._shutdown(). See bpo-36402 follow-up: "threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly". |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: