Issue 18808: Thread.join returns before PyThreadState is destroyed

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63008

classification

Title:	Thread.join returns before PyThreadState is destroyed
Type:	crash	Stage:	resolved
Components:	Interpreter Core, Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Tamas.K, bkabrda, csernazs, grahamd, jcea, koobs, ncoghlan, neologix, pitrou, python-dev, tim.peters, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-08-22 13:56 by Tamas.K, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
subinterp_threads.patch	pitrou, 2013-08-23 18:35		review
threadstate_join.patch	pitrou, 2013-08-24 18:44		review
threadstate_join_2.patch	pitrou, 2013-08-31 18:44		review
threadstate_join_3.patch	tim.peters, 2013-09-07 00:49		review
threadstate_join_4.patch	tim.peters, 2013-09-07 05:31		review
threadstate_join_5.patch	tim.peters, 2013-09-07 18:32	added more is_alive() testing	review
threadstate_join_6.patch	pitrou, 2013-09-07 19:37		review
subinterp_test.patch	pitrou, 2013-09-07 21:29		review
blind.patch	tim.peters, 2013-09-08 02:51	noble attempt ;-) to repair buildbot failures

Messages (60)
msg195891 - (view)	Author: Tamas K (Tamas.K)	Date: 2013-08-22 13:56
When a thread is started in CPython, t_bootstrap [Modules/threadmodule.c] creates a PyThreadState object and then calls Thread.__bootstrap_inner [Lib/threading.py] which calls Thread.run, protected with self.__block, a Condition. Thread.join uses the same __block to block until Thread.run finished. When Thread.run finished, __bootstrap_inner notifies on __block, so join will return. Here lies a race condition, if a thread switch to Thread.join occures before __bootstrap_inner returns to t_bootstrap. Then join will return before the PyThreadState for the thread is destroyed by t_bootstrap. It is mostly harmless for general use, as __bootstrap_inner eventually gets scheduled again and PyThreadState is taken care of. However. Py_EndInterpreter [Python/pythonrun.c] can be called when only the main interpreter thread is running. So when we want to call Py_EndInterpreter, we signal every other thread to stop, and join them. And when Thread.join returns, we call Py_EndInterpreter. Py_EndInterpreter checks if there are any other PyThreadStates still around and does a Py_FatalError. As a workaround, we now do a sleep after join.
msg195895 - (view)	Author: Tamas K (Tamas.K)	Date: 2013-08-22 14:16
The bug was found in 2.6.5, but after a quick eyeballing, current HEAD has the same problem.
msg195906 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-22 15:30
I think we could remove that limitation in Py_EndInterpreter(). After all, Py_Finalize() is able to join() non-daemon threads, so there's no reason for Py_EndInterpreter() to allow it too. We must keep the fatal error for daemon threads, though. As for ensuring the thread state is destroyed before Thread.join() returns, I don't know how to do it: the thread state isn't a PyObject, we can't access it from Python code.
msg195909 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-22 15:50
(Actually, it would almost be possible to combine weakrefs and thread locals to ensure that join() only returns when the thread state is cleared. However, the thread state is cleared before it is untied from the interpreter, and there would still be a small window for a race condition if some destructor called by clearing the thread state released the GIL.)
msg195910 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-22 15:52
(Of course, the latter problem can be solved by having a dedicated sentinel in the thread state that gets DECREF'efed only after the thread state is removed from the interpreter...)
msg195911 - (view)	Author: Tamas K (Tamas.K)	Date: 2013-08-22 16:05
Removing the last thread check sounds good, what we actually need is a Py_EndInterpreter that does not casually abort() on us, we don't really care about PyThreadStates that much.
msg195999 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-23 18:35
Here is a patch for 3.3/3.4.
msg196076 - (view)	Author: Tamas K (Tamas.K)	Date: 2013-08-24 13:41
After a quick glance, I can't see how this patch would fix the problem. It still depends on threading's Thread.join, which is affected by the race condition in __bootstrap_inner. We already did a Thread.join before calling Py_EndInterpreter and still got bitten by the race.
msg196077 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-24 13:46
Well, that's a good point. It does bring in line subinterpreters with the main interpreter when it comes to automatically joining non-daemon threads, but it doesn't solve the race condition you talked about. I forgot a bit too fast about it :-)
msg196085 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-24 18:44
Here is a patch to remove the race condition. The patch is sufficiently delicate that I'd rather reserve this for 3.4, though.
msg196115 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-08-25 09:01
The first patch looks good, as for the second one, it'll take some time :-)
msg196154 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-08-25 17:48
New changeset becbb65074e1 by Antoine Pitrou in branch 'default': Issue #18808: Non-daemon threads are now automatically joined when a sub-interpreter is shutdown (it would previously dump a fatal error). http://hg.python.org/cpython/rev/becbb65074e1
msg196156 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-25 17:50
Ok, I've committed the first patch.
msg196611 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-08-31 02:41
Nasty problem ;-) I don't understand the need for all the indirections in the second patch. Like, why use a weakref? It's not like we have to worry about an immortal tstate keeping a measly little lock object alive forever, right? Seems to me that if the tstate object held the new lock directly, the tstate destructor could release the lock directly (and so also skip the new tstate->on_delete and tstate->on_delete_data indirections too). Then again, I'd settle for Py_EndInterpreter simply sleeping for a second and trying again when it finds "another thread" hanging around (effectively moving Tamas's sleep() into Py_EndInterpreter, but only sleeping if needed). Yes, that's theoretically insecure. But if we're worried about wildly improbable timing problems, then the patched code can appear not to honor a non-None Thread.join() `timeout` argument too. That is, the first call to the new `pred()` can theoretically consume any amount of time, due to its self._tstate_lock.acquire(). The same kinds of timing implausibilities that could cause a short sleep to fail could also cause that acquire() to take "a long time" (maybe even 2 seconds - LOL). But, ya, it's better to really fix it.
msg196627 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-31 10:13
> I don't understand the need for all the indirections in the second > patch. Like, why use a weakref? It's not like we have to worry about > an immortal tstate keeping a measly little lock object alive forever, > right? Seems to me that if the tstate object held the new lock > directly, the tstate destructor could release the lock directly (and > so also skip the new tstate->on_delete and tstate->on_delete_data > indirections too). The problem is if the tstate holds the last reference to the lock. It will destroy the lock at the end, but the lock could have weakrefs attached to it with arbitrary callbacks. Those callbacks will run without a current thread state and may crash. OTOH, this can't happen with a private weakref. You may suggest we only keep the Py_thread_lock in the tstate, rather than the enclosing PyObject. But it has lifetime problems of its owns (the Py_thread_lock doesn't have a refcount, and it is shared with a PyObject who thinks it owns it and will destroy it on dealloc... adding the necessary logic to circumvent this would make the final solution more complicated than the weakref one). > Then again, I'd settle for Py_EndInterpreter simply sleeping for a > second and trying again when it finds "another thread" hanging around > (effectively moving Tamas's sleep() into Py_EndInterpreter, but only > sleeping if needed). Yes, that's theoretically insecure. Well, that sounds less predictable. Depending on machine load, Py_EndInterpreter might succeed or crash with a fatal error. Users may not like this :-) > But if we're worried about wildly improbable timing problems, then the > patched code can appear not to honor a non-None Thread.join() > `timeout` argument too. That is, the first call to the new `pred()` > can theoretically consume any amount of time, due to its > self._tstate_lock.acquire(). Ah, well, good point. It's weird it didn't get caught by the unit tests... I'll take a look.
msg196663 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-31 18:44
Ok, here is a new patch observing the timeout parameter. I've changed the implementation strategy a bit, it seems hopefully sane. There's a remaining (slightly bordeline) issue: the thread state lock isn't reinitialized on fork(). Should it?
msg196672 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-08-31 21:01
I'm getting a headache now - LOL ;-) Thanks for the explanation! What I still don't understand: the new lock is an internal implementation detail. How would it gain a weakref with a callback? Users aren't going to mess with this lock, and if you want to stop Python maintainers from giving it a weakref with a callback, simply say they shouldn't do that (in the code comments) - you could even add code verifying it doesn't have any weakrefs outstanding (although that would likely be a waste of time and code: no maintainer is going to _want_ to make a weakref to it, let alone a weakref with a callback). My concern is the bulk and obscurity of this code, all to plug such a minor hole. I call it "minor" because it's been reported once in the history of the project, and Tamas wormed around it with a 1-liner (added a sleep). Granted, it's much harder to fix "for real" and when most of the interpreter has been destroyed ;-)
msg196923 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-04 15:07
> What I still don't understand: the new lock is an internal > implementation detail. How would it gain a weakref with a callback? > Users aren't going to mess with this lock, and if you want to stop > Python maintainers from giving it a weakref with a callback, simply > say they shouldn't do that (in the code comments) - you could even > add code verifying it doesn't have any weakrefs outstanding > (although that would likely be a waste of time and code: no > maintainer is going to _want_ to make a weakref to it, let alone a > weakref with a callback). Well... perhaps I'm being too perfectionist, but I don't want Python to be crashable merely by manipulating a Thread object's private attributes. If I add some check code, it will quickly become more complicated than the current weakref version, which works by design :-) > My concern is the bulk and obscurity of this code, all to plug such a > minor hole. I call it "minor" because it's been reported once in > the history of the project, and Tamas wormed around it with a > 1-liner (added a sleep). Yeah, the overall concern is a bit obscure, but still: right now, if you use threads inside a subinterpreter, your code can work or crash depending on subtle timing conditions (where "crash" doesn't even mean "raise an exception" but "abort the whole process"). So I think it's worth trying to fix, even though the complication can look a bit disproportionate.
msg196956 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-04 21:09
Oh, I'm not opposed, I'm just complaining ;-) It would be much nicer to have an approach that worked for all thread users, not just threading.Thread users. For example, a user can easily (well, plausibly) get into the same kinds of troubles here by calling _thread.start_new_thread() directly, then waiting for their threads "to end" before letting the program finish - they have no idea either when their tstates are actually destroyed. A high-probability way to "appear to fix" this for everyone could change Py_EndInterpreter's if (tstate != interp->tstate_head \|\| tstate->next != NULL) Py_FatalError("Py_EndInterpreter: not the last thread"); to something like int count = 0; while (tstate != interp->tstate_head \|\| tstate->next != NULL) { ++count; if (count > SOME_MAGIC_VALUE) Py_FatalError("Py_EndInterpreter: not the last thread"); sleep(SOME_SHORT_TIME); } In the meantime ;-), you should change this part of the new .join() code: if endtime is not None: waittime = endtime - _time() if not lock.acquire(timeout=waittime): return The problem here is that we have no idea how much time may have elapsed before computing the new `waittime`. So the new `waittime` _may_ be negative, in which case we've already timed out (but passing a negative `waittime` to acquire() means "wait as long as it takes to acquire the lock"). So this block should return if waittime < 0.
msg196962 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-04 21:57
Oops! The docs are wrong - a negative timeout actually raises: ValueError: timeout value must be strictly positive unless the timeout is exactly -1. All the more reason to ensure that a negative waittime isn't passed. I opened a different issue about the doc glitch.
msg196977 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-05 03:49
Fudge - there's another unlikely problem here. For example: main program creates a threading.Thread t, runs it, and does t.join(5) (whatever - any timeout value). When t.join() returns, the main program has no idea whether t is done or not. Suppose t isn't done, but finishes _bootstrap_inner a microsecond (whatever) later. Now the user follows the doc's advice, and checks t.is_alive() to see whether it still needs to join t. t.is_alive() returns False! _bootstrap_inner() already finished, so the t._stopped event is already set. So the main program skips doing another t.join(), and lets the program exit. There's no guarantee then that t's tstate has been cleared, and we can be back to same fatal error that started this. Of course this has nothing to do with the patch switching from a Condition to an Event to manage the stopped state (good change!) - it has to do with that, no matter how it's coded, _bootstrap_inner() says "this thread is done" before the tstate is unlinked from the chain. For a related reason, note that Python's threading._shutdown() doesn't prevent the fatal shutdown error even after the patch! _pickSomeNonDaemonThread() also ignores threads for which is_alive() returns False. As above, that can return False while the tstate is still in the chain, and then threading._shutdown() never joins such a thread. join() can't fix the problem if it's not called. I supposed that one can be repaired by removing the is_alive() check in _pickSomeNonDaemonThread() (i.e., always join() every non-daemon thread). In their favor, the probabilistic "sleep" approaches would almost always "fix" these too ;-)
msg196983 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-05 08:15
> It would be much nicer to have an approach that worked for all thread users, > not just threading.Thread users. For example, a user can easily (well, > plausibly) get into the same kinds of troubles here by calling > _thread.start_new_thread() directly, then waiting for their threads "to end" > before letting the program finish - they have no idea either when their > tstates are actually destroyed. The _thread module is a private API these days, I wouldn't worry too much about that. Are there reasons not to use the threading module (except CPython bootstrap issues that normal users should have no bother about)? Also, how would they wait for the thread to end, except by triggering their own Event object? > So the new `waittime` _may_ be negative, in which case we've already timed > out (but passing a negative `waittime` to acquire() means "wait as long as > it takes to acquire the lock"). So this block should return if waittime < 0. Thanks, good point :-o As for the is_alive() method, mmh, ok, it needs to be tweaked too. That will make the patch more interesting.
msg197038 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-06 00:19
So you're not concerned about a now-private API (which used to be advertised), but are concerned about a user mucking with a new private lock in an exceedingly unlikely (in the absence of malice) way. That clarifies things ;-) I'm not really concerned about either. User perversity knows no limits, though, so I wouldn't be surprised if some people are rolling their own higher-level threads in Python for reasons they think are compelling. Details don't really matter to that, right? Like: > Also, how would they wait for the thread to end, except by > triggering their own Event object? Any number of ways. Roll their own Event, roll their own Barrier, roll their own Condition, or even something as simple as keeping an integer count of the number of threads they created, and then (e.g.) while nthreads: time.sleep(1) at the end, with each thread doing a global nthreads with nthreads_lock: nthreads -= 1 in its end-of-life code. Essentially rolling their own clumsy variant of a Semaphore. I've seen code like that "in the wild". But, no, I'll join you in not worrying about it ;-)
msg197052 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-06 06:31
Le vendredi 06 septembre 2013 à 00:19 +0000, Tim Peters a écrit : > Tim Peters added the comment: > > So you're not concerned about a now-private API (which used to be > advertised), but are concerned about a user mucking with a new private > lock in an exceedingly unlikely (in the absence of malice) way. That > clarifies things ;-) :-) The only reason I'm concerned about the user mucking with the private lock is that it's a new opportunity that's opened. But, yeah, I could remove the weakref and only keep the lock. The code would only be ~10 lines shorter, though. What do other people think? > in its end-of-life code. Essentially rolling their own clumsy variant > of a Semaphore. I guess they spell it like: import clumsy_threading as threading > I've seen code like that "in the wild". Well, I've sure seen my share of sleep() calls as a synchronization measure too (and removed a number of them)... :-) That's one of the reasons I added the timeout arguments, actually.
msg197125 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 00:49
All the timeout args are great! I wish Python had always had them :-) Back to the pain at hand, it's not the number of lines of code that rubs me the wrong way, but the sheer obscurity of it all. This handful of lines is - of necessity - sprayed across two C code files, a C header file, and a Python module. That makes it very hard (for anyone but you - LOL) to reconstruct the _intent_ of it all. I'm adding another patch, which is your threadstate_join_2.patch but with a new comment block (in pystate.h) spelling out the rationale behind it all. I can live with that ;-)
msg197137 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 05:31
New patch (threadstate_join_4.patch) refactors so that is_alive() returns True until the tstate is cleared. This simplifies join() a lot (it doesn't have to roll its own timeout implementation anymore), but complicates is_alive(). Caution: I don't know anything about the dummy threading implementation. So if I broke that, sue me ;-)
msg197177 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 18:32
New patch threadstate_join_5.patch adds more testing of is_alive(). An inelegance I don't care about (but someone might): if join() is called with a timeout, and the Python part of the thread ends before the timeout expires (_stopped gets set), then a _non_-blocking attempt to acquire _tstate_lock is made, and join() returns regardless of the outcome. So, with a timeout, it's possible for join() to return before the C part of the thread is done even if the timeout isn't yet exhausted. That's unlikely, and I don't see that it makes any difference. Anyone doing a join() with a timeout has to be aware that they have no idea whether the thread is done when join() returns, and do another join() or check is_alive() later. I'd rather not complicate the code to wait longer for _tstate_lock in this case.
msg197178 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-07 18:59
> Anyone doing a join() with a timeout has to be aware that they have no > idea whether the thread is done when join() returns, and do another > join() or check is_alive() later. Agreed.
msg197179 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-07 19:25
I like threadstate_join_5.patch, but I got the following failure when running the test suite: ====================================================================== FAIL: test_is_alive_after_fork (test.test_threading.ThreadTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork self.assertEqual(0, status) AssertionError: 0 != 256
msg197181 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 19:34
Ah! I'm running on Windows, where all fork() tests are skipped. Can you fix it? My prejudice is that anyone mixing threads with fork() should be shot for the good of humanity <0.7 wink>.
msg197182 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-07 19:37
Yes, the fix is to reinitialize the lock (in the surviving thread) or yank it (in the other threads). Patch attached.
msg197183 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 19:54
Cool! What could possibly go wrong? ;-)
msg197184 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-07 20:00
Ok, do you think the patch is finally ready? If so, I'm gonna commit (unless you'd rather do it yourself).
msg197185 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 21:12
If would be nice if Tamas could test it in his application, since we're not actually testing Py_EndInterpreter. But, ya, commit it! If it doesn't work for Tamas, we can start over again ;-)
msg197188 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-07 21:29
Ah, we can add a test for the race condition. Patch attached :-)
msg197190 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 21:36
Excellent - ship it! :-)
msg197191 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-09-07 21:39
New changeset d52b68edbca6 by Antoine Pitrou in branch 'default': Issue #18808: Thread.join() now waits for the underlying thread state to be destroyed before returning. http://hg.python.org/cpython/rev/d52b68edbca6
msg197193 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-07 21:50
Ok, this should be finally fixed!
msg197195 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-07 23:03
Ha, apparently test_is_alive_after_fork still fails on old Linux kernels: http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8562 http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2665
msg197197 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-07 23:17
Figures. That's why I wanted your name on the blamelist ;-)
msg197202 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-09-07 23:44
You mean I actually need to pay proper attention now? :)
msg197207 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-09-08 00:42
Slavek - one you may want to take a look at this. Antoine and Tim are trying to fix a race condition with the destruction of thread state objects, but the current fix isn't playing nice on older Linux kernels (including the one in RHEL 6).
msg197213 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-09-08 02:23
New changeset 5cfd7b2eb994 by Tim Peters in branch 'default': Issue 18808: blind attempt to repair some buildbot failures. http://hg.python.org/cpython/rev/5cfd7b2eb994
msg197214 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 02:24
Just pushed 5cfd7b2eb994 in a poke-and-hope blind attempt to repair the annoying ;-) buildbot failures.
msg197215 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 02:51
Doesn't look like 5cfd7b2eb994 is going to fix it :-( So I'll revert it. Attaching the patch as blind.patch. After that patch, is_alive() only looks at Thread data members, where ._is_stopped "is obviously" True, and ._tstate_lock "is obviously" None, both forced by threading.py's _after_fork() function. No theory as to how that can be screwed up.
msg197217 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 03:01
Weird! The Ubuntu box passed test_is_alive_after_fork on its /second/ run with the patch: http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8564/steps/test/logs/stdio The other box passed all tests: http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2667 So I won't revert it after all - but the Ubuntu box's behavior is still major mystery :-(
msg197221 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 04:06
Well, the next time the Ubuntu box ran the tests, it was clean all the way. So it's fixed! Despite that it isn't ;-)
msg197247 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-08 10:42
Indeed the Ubuntu Shared buildbot started failing again. There's probably a timing-dependent behaviour here (which is why test_is_alive_after_fork() tries several times, after all).
msg197248 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-08 11:08
I think I've found the answer: the thread is sometimes already stopped by the time the child is forked, so it doesn't appear in _enumerate() anymore (it left the _active dict). Therefore its locks are not reset in _after_fork(). Oh, I also get the following sporadic failure which is triggered by slight change in semantics with Thread.join(timeout) :-) ====================================================================== FAIL: test_various_ops (test.test_threading.ThreadTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops self.assertTrue(not t.is_alive()) AssertionError: False is not true
msg197251 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-09-08 11:19
New changeset 74dc664ad699 by Antoine Pitrou in branch 'default': Issue #18808 again: fix the after-fork logic for not-yet-started or already-stopped threads. http://hg.python.org/cpython/rev/74dc664ad699
msg197253 - (view)	Author: Kubilay Kocak (koobs)	Date: 2013-09-08 11:39
Adding reference to failing tests on koobs-freebsd9 and koobs-freebsd10 buildbots: ====================================================================== FAIL: test_is_alive_after_fork (test.test_threading.ThreadTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/koobs-freebsd10/3.x.koobs-freebsd10/build/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork self.assertEqual(0, status) AssertionError: 0 != 256
msg197291 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 17:30
[Antoine] > Oh, I also get the following sporadic failure > which is triggered by slight change in semantics > with Thread.join(timeout) :-) > ====================================================================== > FAIL: test_various_ops (test.test_threading.ThreadTests) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops > self.assertTrue(not t.is_alive()) > AssertionError: False is not true Really! In context, the test does: t.join() self.assertTrue(not t.is_alive()) (BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) ) It was the intent that this continue to work - the only intended change in Python-visible semantics had to do with join'ing with a timeout. Without a timeout, I confess I don't see how this can fail. join() is join(timeout=None), which does: self._stopped.wait(timeout) if self._stopped.is_set(): self._wait_for_tstate_lock(timeout is None) which is self._stopped.wait(None) if self._stopped.is_set(): self._wait_for_tstate_lock(True) which should be the same as self._stopped.wait() self._wait_for_tstate_lock(True) after which _stopped should be set and _tstate_lock should be None. The subsequent is_alive() should then return False, via its return self._tstate_lock is not None What's going wrong?
msg197294 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-08 17:42
Le dimanche 08 septembre 2013 à 17:30 +0000, Tim Peters a écrit : > Really! In context, the test does: > > t.join() > self.assertTrue(not t.is_alive()) Ah, no, the failing test did `t.join(something)`. I removed the timeout to remove the failure :-) > (BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) ) Yes, old coding style.
msg197302 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 17:58
Ah - the test used to do t.join(NUMTASKS)! That's just bizarre ;-) I believe I can repair that too (well - there was never a _guarantee_ that waiting 10 seconds would be long enough), but I'll wait until this all settles down. join() and is_alive() are too complicated now, because of the 2-step dance to check whether the thread is done: we have both an Event (_stopped) and a lock (_tstate_lock) to check now. The Event doesn't serve a purpose anymore: it's almost always uninteresting to know _just_ that the Python part of the thread has ended. The only exception I can see is the perverse case of joining the main thread done in some of the tests (in that case we have to claim the main thread is done even though its tstate is still active). Anyway, after getting rid of the Event it should be dead easy to make join(10) "appear to work the same as before, despite that it never really worked ;-)".
msg197304 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-08 18:01
> join() and is_alive() are too complicated now, because of the 2-step > dance to check whether the thread is done: we have both an Event > (_stopped) and a lock (_tstate_lock) to check now. The Event doesn't > serve a purpose anymore: it's almost always uninteresting to know > _just_ that the Python part of the thread has ended. Yes, that crossed my mind too. The difficulty is that only plain lock objects are available from C code, not Events. But if the first join()er releases the lock just after taking it, it will be enough to make the code correct? (after all, it's an event that's never reset, which simplifies things) (also, why is the current Event implementation based on Condition? isn't an Event actually simpler than a Condition?)
msg197306 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 18:16
Without _stopped, join() can simply wait to acquire _tstate_lock (with or without a timeout, and skipping this if _tstate_lock is already None). Etc ;-) Of course details matter, but it's easy. I did it once, but the tests joining the main thread failed and I put the code on hold. I'll dust it off when the buildbots are all happy with the current changes. > (also, why is the current Event implementation based > on Condition? We'd have to ask Guido ;-) Best guess is that Condition supplied all the machinery to make Event.wait() work correctly, including waking all waiters up when the Event gets set. > isn't an Event actually simpler than a Condition?) Events are indeed simple :-) There are many ways to implement them, but "ain't broke, don't fix" seems the right approach to me here. In effect, if we get rid of _stopped, the code remaining will be much like an Event implementation built on the plain _tstate_lock lock.
msg197307 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-08 18:18
> Without _stopped, join() can simply wait to acquire _tstate_lock (with > or without a timeout, and skipping this if _tstate_lock is already > None). Etc ;-) Of course details matter, but it's easy. I did it > once, but the tests joining the main thread failed and I put the code > on hold. Ah, of course. The main thread needs the event, since the thread state will only be deleted at the end of Py_Finalize(). The MainThread class could override is_alive() and join(), then.
msg197310 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-09-08 18:27
> The MainThread class could override is_alive() and join(), then. I think it will be easier than that, but we'll see ;-)
msg197338 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-09-08 23:46
New changeset aff959a3ba13 by Tim Peters in branch 'default': Issue 18984: Remove ._stopped Event from Thread internals. http://hg.python.org/cpython/rev/aff959a3ba13
msg345152 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-10 23:19
_thread._set_sentinel() and threading.Thread._tstate_lock is a great enhancement, as Py_EndInterprter() which now calls threading._shutdown(). FYI I found yet another race condition, this time in threading._shutdown(). See bpo-36402 follow-up: "threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly".

History
Date	User	Action	Args
2022-04-11 14:57:49	admin	set	github: 63008
2019-06-10 23:19:38	vstinner	set	nosy: + vstinner messages: + msg345152
2013-09-09 20:08:46	tim.peters	set	status: open -> closed stage: commit review -> resolved
2013-09-08 23:46:44	python-dev	set	messages: + msg197338
2013-09-08 21:47:44	tim.peters	link	issue18984 dependencies
2013-09-08 18:27:16	tim.peters	set	messages: + msg197310
2013-09-08 18:18:52	pitrou	set	messages: + msg197307
2013-09-08 18:16:22	tim.peters	set	messages: + msg197306
2013-09-08 18:01:51	pitrou	set	messages: + msg197304
2013-09-08 17:58:10	tim.peters	set	messages: + msg197302
2013-09-08 17:42:25	pitrou	set	messages: + msg197294
2013-09-08 17:30:36	tim.peters	set	messages: + msg197291
2013-09-08 11:39:01	koobs	set	nosy: + koobs messages: + msg197253
2013-09-08 11:19:13	python-dev	set	messages: + msg197251
2013-09-08 11:08:14	pitrou	set	messages: + msg197248
2013-09-08 10:42:52	pitrou	set	messages: + msg197247
2013-09-08 04:06:16	tim.peters	set	messages: + msg197221
2013-09-08 03:01:10	tim.peters	set	messages: + msg197217
2013-09-08 02:51:19	tim.peters	set	files: + blind.patch messages: + msg197215
2013-09-08 02:24:13	tim.peters	set	messages: + msg197214
2013-09-08 02:23:23	python-dev	set	messages: + msg197213
2013-09-08 00:42:24	ncoghlan	set	nosy: + bkabrda messages: + msg197207 stage: resolved -> commit review
2013-09-07 23:44:33	ncoghlan	set	messages: + msg197202
2013-09-07 23:17:07	tim.peters	set	messages: + msg197197
2013-09-07 23:03:26	pitrou	set	status: closed -> open messages: + msg197195
2013-09-07 21:50:37	pitrou	set	status: open -> closed resolution: fixed messages: + msg197193 stage: patch review -> resolved
2013-09-07 21:39:16	python-dev	set	messages: + msg197191
2013-09-07 21:36:04	tim.peters	set	messages: + msg197190
2013-09-07 21:29:30	pitrou	set	files: + subinterp_test.patch messages: + msg197188
2013-09-07 21:12:38	tim.peters	set	messages: + msg197185
2013-09-07 20:00:53	pitrou	set	messages: + msg197184
2013-09-07 19:54:08	tim.peters	set	messages: + msg197183
2013-09-07 19:37:15	pitrou	set	files: + threadstate_join_6.patch messages: + msg197182
2013-09-07 19:34:42	tim.peters	set	messages: + msg197181
2013-09-07 19:25:20	pitrou	set	messages: + msg197179
2013-09-07 18:59:24	pitrou	set	messages: + msg197178
2013-09-07 18:32:00	tim.peters	set	files: + threadstate_join_5.patch messages: + msg197177
2013-09-07 05:31:51	tim.peters	set	files: + threadstate_join_4.patch messages: + msg197137
2013-09-07 00:50:02	tim.peters	set	files: + threadstate_join_3.patch messages: + msg197125
2013-09-06 06:31:48	pitrou	set	messages: + msg197052
2013-09-06 00:19:26	tim.peters	set	messages: + msg197038
2013-09-05 08:15:11	pitrou	set	messages: + msg196983
2013-09-05 03:49:41	tim.peters	set	messages: + msg196977
2013-09-04 21:57:42	tim.peters	set	messages: + msg196962
2013-09-04 21:09:25	tim.peters	set	messages: + msg196956
2013-09-04 15:07:42	pitrou	set	messages: + msg196923
2013-09-02 20:31:05	csernazs	set	nosy: + csernazs
2013-08-31 21:01:24	tim.peters	set	messages: + msg196672
2013-08-31 18:44:23	pitrou	set	files: + threadstate_join_2.patch messages: + msg196663
2013-08-31 10:13:48	pitrou	set	messages: + msg196627
2013-08-31 02:41:02	tim.peters	set	messages: + msg196611
2013-08-29 13:02:54	pitrou	set	nosy: + tim.peters
2013-08-25 17:50:26	pitrou	set	messages: + msg196156 versions: - Python 3.3
2013-08-25 17:48:36	python-dev	set	nosy: + python-dev messages: + msg196154
2013-08-25 09:01:03	neologix	set	messages: + msg196115
2013-08-24 18:44:14	pitrou	set	files: + threadstate_join.patch nosy: + neologix messages: + msg196085
2013-08-24 13:46:38	pitrou	set	messages: + msg196077
2013-08-24 13:41:58	Tamas.K	set	messages: + msg196076
2013-08-24 01:19:12	jcea	set	nosy: + jcea
2013-08-23 18:35:11	pitrou	set	files: + subinterp_threads.patch versions: - Python 2.7 messages: + msg195999 keywords: + patch stage: needs patch -> patch review
2013-08-22 16:05:05	Tamas.K	set	messages: + msg195911
2013-08-22 15:52:16	pitrou	set	messages: + msg195910
2013-08-22 15:50:43	pitrou	set	messages: + msg195909
2013-08-22 15:30:22	pitrou	set	nosy: + ncoghlan, grahamd messages: + msg195906 stage: test needed -> needs patch
2013-08-22 14:16:32	Tamas.K	set	messages: + msg195895
2013-08-22 14:05:12	serhiy.storchaka	set	nosy: + pitrou stage: test needed versions: + Python 3.4, - Python 2.6, Python 3.1, Python 3.2
2013-08-22 13:56:37	Tamas.K	create