classification
Title: Thread.join returns before PyThreadState is destroyed
Type: crash Stage: resolved
Components: Interpreter Core, Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Tamas.K, bkabrda, csernazs, grahamd, jcea, koobs, ncoghlan, neologix, pitrou, python-dev, tim.peters
Priority: normal Keywords: patch

Created on 2013-08-22 13:56 by Tamas.K, last changed 2013-09-09 20:08 by tim.peters. This issue is now closed.

Files
File name Uploaded Description Edit
subinterp_threads.patch pitrou, 2013-08-23 18:35 review
threadstate_join.patch pitrou, 2013-08-24 18:44 review
threadstate_join_2.patch pitrou, 2013-08-31 18:44 review
threadstate_join_3.patch tim.peters, 2013-09-07 00:49 review
threadstate_join_4.patch tim.peters, 2013-09-07 05:31 review
threadstate_join_5.patch tim.peters, 2013-09-07 18:32 added more is_alive() testing review
threadstate_join_6.patch pitrou, 2013-09-07 19:37 review
subinterp_test.patch pitrou, 2013-09-07 21:29 review
blind.patch tim.peters, 2013-09-08 02:51 noble attempt ;-) to repair buildbot failures
Messages (59)
msg195891 - (view) Author: Tamas K (Tamas.K) Date: 2013-08-22 13:56
When a thread is started in CPython, t_bootstrap [Modules/threadmodule.c] creates a PyThreadState object and then calls Thread.__bootstrap_inner [Lib/threading.py] which calls Thread.run, protected with self.__block, a Condition. Thread.join uses the same __block to block until Thread.run finished.

When Thread.run finished, __bootstrap_inner notifies on __block, so join will return. Here lies a race condition, if a thread switch to Thread.join occures before __bootstrap_inner returns to t_bootstrap. Then join will return before the PyThreadState for the thread is destroyed by t_bootstrap.

It is mostly harmless for general use, as __bootstrap_inner eventually gets scheduled again and PyThreadState is taken care of.

However. Py_EndInterpreter [Python/pythonrun.c] can be called when only the main interpreter thread is running. So when we want to call Py_EndInterpreter, we signal every other thread to stop, and join them. And when Thread.join returns, we call Py_EndInterpreter. Py_EndInterpreter checks if there are any other PyThreadStates still around and does a Py_FatalError.

As a workaround, we now do a sleep after join.
msg195895 - (view) Author: Tamas K (Tamas.K) Date: 2013-08-22 14:16
The bug was found in 2.6.5, but after a quick eyeballing, current HEAD has the same problem.
msg195906 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-22 15:30
I *think* we could remove that limitation in Py_EndInterpreter(). After all, Py_Finalize() is able to join() non-daemon threads, so there's no reason for Py_EndInterpreter() to allow it too. We must keep the fatal error for daemon threads, though.

As for ensuring the thread state is destroyed before Thread.join() returns, I don't know how to do it: the thread state isn't a PyObject, we can't access it from Python code.
msg195909 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-22 15:50
(Actually, it would almost be possible to combine weakrefs and thread locals to ensure that join() only returns when the thread state is cleared. However, the thread state is cleared before it is untied from the interpreter, and there would still be a small window for a race condition if some destructor called by clearing the thread state released the GIL.)
msg195910 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-22 15:52
(Of course, the latter problem can be solved by having a dedicated sentinel in the thread state that gets DECREF'efed only *after* the thread state is removed from the interpreter...)
msg195911 - (view) Author: Tamas K (Tamas.K) Date: 2013-08-22 16:05
Removing the last thread check sounds good, what we actually need is a Py_EndInterpreter that does not casually abort() on us, we don't really care about PyThreadStates that much.
msg195999 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-23 18:35
Here is a patch for 3.3/3.4.
msg196076 - (view) Author: Tamas K (Tamas.K) Date: 2013-08-24 13:41
After a quick glance, I can't see how this patch would fix the problem. It still depends on threading's Thread.join, which is affected by the race condition in __bootstrap_inner.

We already did a Thread.join before calling Py_EndInterpreter and still got bitten by the race.
msg196077 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-24 13:46
Well, that's a good point. It does bring in line subinterpreters with the main interpreter when it comes to automatically joining non-daemon threads, but it doesn't solve the race condition you talked about. I forgot a bit too fast about it :-)
msg196085 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-24 18:44
Here is a patch to remove the race condition. The patch is sufficiently delicate that I'd rather reserve this for 3.4, though.
msg196115 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-08-25 09:01
The first patch looks good, as for the second one, it'll take some time :-)
msg196154 - (view) Author: Roundup Robot (python-dev) Date: 2013-08-25 17:48
New changeset becbb65074e1 by Antoine Pitrou in branch 'default':
Issue #18808: Non-daemon threads are now automatically joined when a sub-interpreter is shutdown (it would previously dump a fatal error).
http://hg.python.org/cpython/rev/becbb65074e1
msg196156 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-25 17:50
Ok, I've committed the first patch.
msg196611 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-08-31 02:41
Nasty problem ;-)

I don't understand the need for all the indirections in the second patch.  Like, why use a weakref?  It's not like we have to worry about an immortal tstate keeping a measly little lock object alive forever, right?  Seems to me that if the tstate object held the new lock directly, the tstate destructor could release the lock directly (and so also skip the new tstate->on_delete and tstate->on_delete_data indirections too).

Then again, I'd settle for Py_EndInterpreter simply sleeping for a second and trying again when it finds "another thread" hanging around (effectively moving Tamas's sleep() into Py_EndInterpreter, but only sleeping if needed).  Yes, that's theoretically insecure.

But if we're worried about wildly improbable timing problems, then the patched code can appear not to honor a non-None Thread.join() `timeout` argument too.  That is, the first call to the new `pred()` can theoretically consume any amount of time, due to its self._tstate_lock.acquire().  The same kinds of timing implausibilities that could cause a short sleep to fail could also cause that acquire() to take "a long time" (maybe even 2 seconds - LOL).

But, ya, it's better to really fix it.
msg196627 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-31 10:13
> I don't understand the need for all the indirections in the second
> patch.  Like, why use a weakref?  It's not like we have to worry about
> an immortal tstate keeping a measly little lock object alive forever,
> right?  Seems to me that if the tstate object held the new lock
> directly, the tstate destructor could release the lock directly (and
> so also skip the new tstate->on_delete and tstate->on_delete_data
> indirections too).

The problem is if the tstate holds the last reference to the lock. It
will destroy the lock at the end, but the lock could have weakrefs
attached to it with arbitrary callbacks. Those callbacks will run
without a current thread state and may crash. OTOH, this can't happen
with a private weakref.

You may suggest we only keep the Py_thread_lock in the tstate, rather
than the enclosing PyObject. But it has lifetime problems of its owns
(the Py_thread_lock doesn't have a refcount, and it is shared with a
PyObject who thinks it owns it and will destroy it on dealloc... adding
the necessary logic to circumvent this would make the final solution
more complicated than the weakref one).

> Then again, I'd settle for Py_EndInterpreter simply sleeping for a
> second and trying again when it finds "another thread" hanging around
> (effectively moving Tamas's sleep() into Py_EndInterpreter, but only
> sleeping if needed).  Yes, that's theoretically insecure.

Well, that sounds less predictable. Depending on machine load,
Py_EndInterpreter might succeed or crash with a fatal error. Users may
not like this :-)

> But if we're worried about wildly improbable timing problems, then the
> patched code can appear not to honor a non-None Thread.join()
> `timeout` argument too.  That is, the first call to the new `pred()`
> can theoretically consume any amount of time, due to its
> self._tstate_lock.acquire().

Ah, well, good point. It's weird it didn't get caught by the unit
tests... I'll take a look.
msg196663 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-31 18:44
Ok, here is a new patch observing the timeout parameter. I've changed the implementation strategy a bit, it seems hopefully sane.

There's a remaining (slightly bordeline) issue: the thread state lock isn't reinitialized on fork(). Should it?
msg196672 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-08-31 21:01
I'm getting a headache now - LOL ;-)  Thanks for the explanation!

What I still don't understand:  the new lock is an internal implementation detail.  How would it gain a weakref with a callback?  Users aren't going to mess with this lock, and if you want to stop Python maintainers from giving it a weakref with a callback, simply say they shouldn't do that (in the code comments) - you could even add code verifying it doesn't have any weakrefs outstanding (although that would likely be a waste of time and code:  no maintainer is going to _want_ to make a weakref to it, let alone a weakref with a callback).

My concern is the bulk and obscurity of this code, all to plug such a minor hole.  I call it "minor" because it's been reported once in the history of the project, and Tamas wormed around it with a 1-liner (added a sleep).

Granted, it's much harder to fix "for real" and when most of the interpreter has been destroyed ;-)
msg196923 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-04 15:07
> What I still don't understand:  the new lock is an internal
> implementation detail.  How would it gain a weakref with a callback?
>  Users aren't going to mess with this lock, and if you want to stop
> Python maintainers from giving it a weakref with a callback, simply
> say they shouldn't do that (in the code comments) - you could even
> add code verifying it doesn't have any weakrefs outstanding
> (although that would likely be a waste of time and code:  no
> maintainer is going to _want_ to make a weakref to it, let alone a
> weakref with a callback).

Well... perhaps I'm being too perfectionist, but I don't want Python
to be crashable merely by manipulating a Thread object's private
attributes.
If I add some check code, it will quickly become more complicated than
the current weakref version, which works by design :-)

> My concern is the bulk and obscurity of this code, all to plug such a
> minor hole.  I call it "minor" because it's been reported once in
> the history of the project, and Tamas wormed around it with a
> 1-liner (added a sleep).

Yeah, the overall concern is a bit obscure, but still: right now, if
you use threads inside a subinterpreter, your code can work or crash
depending on subtle timing conditions (where "crash" doesn't even mean
"raise an exception" but "abort the whole process"). So I think it's
worth trying to fix, even though the complication can look a bit
disproportionate.
msg196956 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-04 21:09
Oh, I'm not opposed, I'm just complaining ;-)

It would be much nicer to have an approach that worked for all thread users, not just threading.Thread users.  For example, a user can easily (well, plausibly) get into the same kinds of troubles here by calling _thread.start_new_thread() directly, then waiting for their threads "to end" before letting the program finish - they have no idea either when their tstates are actually destroyed.

A high-probability way to "appear to fix" this for everyone could change Py_EndInterpreter's

    if (tstate != interp->tstate_head || tstate->next != NULL)
        Py_FatalError("Py_EndInterpreter: not the last thread");

to something like

    int count = 0;
    while (tstate != interp->tstate_head || tstate->next != NULL) {
        ++count;
        if (count > SOME_MAGIC_VALUE)
            Py_FatalError("Py_EndInterpreter: not the last thread");
        sleep(SOME_SHORT_TIME);
    }

In the meantime ;-), you should change this part of the new .join() code:

        if endtime is not None:
            waittime = endtime - _time()
            if not lock.acquire(timeout=waittime):
                return

The problem here is that we have no idea how much time may have elapsed before computing the new `waittime`.  So the new `waittime` _may_ be negative, in which case we've already timed out (but passing a negative `waittime` to acquire() means "wait as long as it takes to acquire the lock").  So this block should return if waittime < 0.
msg196962 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-04 21:57
Oops!  The docs are wrong - a negative timeout actually raises:

ValueError: timeout value must be strictly positive

unless the timeout is exactly -1.  All the more reason to ensure that a negative waittime isn't passed.

I opened a different issue about the doc glitch.
msg196977 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-05 03:49
Fudge - there's another unlikely problem here.  For example:  main program creates a threading.Thread t, runs it, and does t.join(5) (whatever - any timeout value).  When t.join() returns, the main program has no idea whether t is done or not.  Suppose t isn't done, but finishes _bootstrap_inner a microsecond (whatever) later.

Now the user follows the doc's advice, and checks t.is_alive() to see whether it still needs to join t.  t.is_alive() returns False!  _bootstrap_inner() already finished, so the t._stopped event is already set.

So the main program skips doing another t.join(), and lets the program exit.  There's no guarantee then that t's tstate has been cleared, and we can be back to same fatal error that started this.

Of course this has nothing to do with the patch switching from a Condition to an Event to manage the stopped state (good change!) - it has to do with that, no matter how it's coded, _bootstrap_inner() says "this thread is done" before the tstate is unlinked from the chain.

For a related reason, note that Python's threading._shutdown() doesn't prevent the fatal shutdown error even after the patch!  _pickSomeNonDaemonThread() also ignores threads for which is_alive() returns False.  As above, that can return False while the tstate is still in the chain, and then threading._shutdown() never joins such a thread.  join() can't fix the problem if it's not called.  I supposed that one can be repaired by removing the is_alive() check in _pickSomeNonDaemonThread() (i.e., always join() every non-daemon thread).

In their favor, the probabilistic "sleep" approaches would almost always "fix" these too ;-)
msg196983 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-05 08:15
> It would be much nicer to have an approach that worked for all thread users, 
> not just threading.Thread users.  For example, a user can easily (well,
> plausibly) get into the same kinds of troubles here by calling
> _thread.start_new_thread() directly, then waiting for their threads "to end" 
> before letting the program finish - they have no idea either when their
> tstates are actually destroyed.

The _thread module is a private API these days, I wouldn't worry too much about that. Are there reasons *not* to use the threading module (except CPython bootstrap issues that normal users should have no bother about)? Also, how would they wait for the thread to end, except by triggering their own Event object?

> So the new `waittime` _may_ be negative, in which case we've already timed
> out (but passing a negative `waittime` to acquire() means "wait as long as 
> it takes to acquire the lock").  So this block should return if waittime < 0.

Thanks, good point :-o

As for the is_alive() method, mmh, ok, it needs to be tweaked too. That will make the patch more interesting.
msg197038 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-06 00:19
So you're not concerned about a now-private API (which used to be advertised), but are concerned about a user mucking with a new private lock in an exceedingly unlikely (in the absence of malice) way.  That clarifies things ;-)

I'm not really concerned about either.  User perversity knows no limits, though, so I wouldn't be surprised if some people are rolling their own higher-level threads in Python for reasons they think are compelling.

Details don't really matter to that, right?  Like:

> Also, how would they wait for the thread to end, except by
> triggering their own Event object?

Any number of ways.  Roll their own Event, roll their own Barrier, roll their own Condition, or even something as simple as keeping an integer count of the number of threads they created, and then (e.g.)

while nthreads:
    time.sleep(1)

at the end, with each thread doing a

    global nthreads
    with nthreads_lock:
        nthreads -= 1

in its end-of-life code.  Essentially rolling their own clumsy variant of a Semaphore.

I've seen code like that "in the wild".  But, no, I'll join you in not worrying about it ;-)
msg197052 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-06 06:31
Le vendredi 06 septembre 2013 à 00:19 +0000, Tim Peters a écrit :
> Tim Peters added the comment:
> 
> So you're not concerned about a now-private API (which used to be
> advertised), but are concerned about a user mucking with a new private
> lock in an exceedingly unlikely (in the absence of malice) way.  That
> clarifies things ;-)

:-)
The only reason I'm concerned about the user mucking with the private
lock is that it's a new opportunity that's opened. But, yeah, I could
remove the weakref and only keep the lock. The code would only be ~10
lines shorter, though. What do other people think?

> in its end-of-life code.  Essentially rolling their own clumsy variant
> of a Semaphore.

I guess they spell it like:

  import clumsy_threading as threading

> I've seen code like that "in the wild".

Well, I've sure seen my share of sleep() calls as a synchronization
measure too (and removed a number of them)... :-)
That's one of the reasons I added the timeout arguments, actually.
msg197125 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 00:49
All the timeout args are great!  I wish Python had always had them :-)

Back to the pain at hand, it's not the number of lines of code that rubs me the wrong way, but the sheer obscurity of it all.  This handful of lines is - of necessity - sprayed across two C code files, a C header file, and a Python module.  That makes it very hard (for anyone but you - LOL) to reconstruct the _intent_ of it all.

I'm adding another patch, which is your threadstate_join_2.patch but with a new comment block (in pystate.h) spelling out the rationale behind it all.  I can live with that ;-)
msg197137 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 05:31
New patch (threadstate_join_4.patch) refactors so that is_alive() returns True until the tstate is cleared.  This simplifies join() a lot (it doesn't have to roll its own timeout implementation anymore), but complicates is_alive().

Caution:  I don't know anything about the dummy threading implementation.  So if I broke that, sue me ;-)
msg197177 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 18:32
New patch threadstate_join_5.patch adds more testing of is_alive().

An inelegance I don't care about (but someone might):  if join() is called with a timeout, and the Python part of the thread ends before the timeout expires (_stopped gets set), then a _non_-blocking attempt to acquire _tstate_lock is made, and join() returns regardless of the outcome.  So, with a timeout, it's possible for join() to return before the C part of the thread is done even if the timeout isn't yet exhausted.

That's unlikely, and I don't see that it makes any difference.  Anyone doing a join() with a timeout has to be aware that they have no idea whether the thread is done when join() returns, and do another join() or check is_alive() later.

I'd rather not complicate the code to wait longer for _tstate_lock in this case.
msg197178 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-07 18:59
> Anyone doing a join() with a timeout has to be aware that they have no 
> idea whether the thread is done when join() returns, and do another
> join() or check is_alive() later.

Agreed.
msg197179 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-07 19:25
I like threadstate_join_5.patch, but I got the following failure when running the test suite:

======================================================================
FAIL: test_is_alive_after_fork (test.test_threading.ThreadTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork
    self.assertEqual(0, status)
AssertionError: 0 != 256
msg197181 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 19:34
Ah!  I'm running on Windows, where all fork() tests are skipped.  Can you fix it?  My prejudice is that anyone mixing threads with fork() should be shot for the good of humanity <0.7 wink>.
msg197182 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-07 19:37
Yes, the fix is to reinitialize the lock (in the surviving thread) or yank it (in the other threads). Patch attached.
msg197183 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 19:54
Cool!  What could possibly go wrong?  ;-)
msg197184 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-07 20:00
Ok, do you think the patch is finally ready? If so, I'm gonna commit (unless you'd rather do it yourself).
msg197185 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 21:12
If would be nice if Tamas could test it in his application, since we're not actually testing Py_EndInterpreter.  But, ya, commit it!  If it doesn't work for Tamas, we can start over again ;-)
msg197188 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-07 21:29
Ah, we *can* add a test for the race condition. Patch attached :-)
msg197190 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 21:36
Excellent - ship it! :-)
msg197191 - (view) Author: Roundup Robot (python-dev) Date: 2013-09-07 21:39
New changeset d52b68edbca6 by Antoine Pitrou in branch 'default':
Issue #18808: Thread.join() now waits for the underlying thread state to be destroyed before returning.
http://hg.python.org/cpython/rev/d52b68edbca6
msg197193 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-07 21:50
Ok, this should be finally fixed!
msg197195 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-07 23:03
Ha, apparently test_is_alive_after_fork still fails on old Linux kernels:

http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8562
http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2665
msg197197 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-07 23:17
Figures.  That's why I wanted your name on the blamelist ;-)
msg197202 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-07 23:44
You mean I actually need to pay proper attention now? :)
msg197207 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-08 00:42
Slavek - one you may want to take a look at this. Antoine and Tim are trying to fix a race condition with the destruction of thread state objects, but the current fix isn't playing nice on older Linux kernels (including the one in RHEL 6).
msg197213 - (view) Author: Roundup Robot (python-dev) Date: 2013-09-08 02:23
New changeset 5cfd7b2eb994 by Tim Peters in branch 'default':
Issue 18808:  blind attempt to repair some buildbot failures.
http://hg.python.org/cpython/rev/5cfd7b2eb994
msg197214 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 02:24
Just pushed 5cfd7b2eb994 in a poke-and-hope blind attempt to repair the annoying ;-) buildbot failures.
msg197215 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 02:51
Doesn't look like 5cfd7b2eb994 is going to fix it :-(  So I'll revert it.  Attaching the patch as blind.patch.  After that patch, is_alive() only looks at Thread data members, where ._is_stopped "is obviously" True, and ._tstate_lock "is obviously" None, both forced by threading.py's _after_fork() function.  No theory as to how that can be screwed up.
msg197217 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 03:01
Weird!  The Ubuntu box passed test_is_alive_after_fork on its /second/ run with the patch:

http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8564/steps/test/logs/stdio

The other box passed all tests:

http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2667

So I won't revert it after all - but the Ubuntu box's behavior is still major mystery :-(
msg197221 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 04:06
Well, the next time the Ubuntu box ran the tests, it was clean all the way. So it's fixed!  Despite that it isn't ;-)
msg197247 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-08 10:42
Indeed the Ubuntu Shared buildbot started failing again. There's probably a timing-dependent behaviour here (which is why test_is_alive_after_fork() tries several times, after all).
msg197248 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-08 11:08
I think I've found the answer: the thread is sometimes already stopped by the time the child is forked, so it doesn't appear in _enumerate() anymore (it left the _active dict). Therefore its locks are not reset in _after_fork().

Oh, I also get the following sporadic failure which is triggered by slight change in semantics with Thread.join(timeout) :-)

======================================================================
FAIL: test_various_ops (test.test_threading.ThreadTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
    self.assertTrue(not t.is_alive())
AssertionError: False is not true
msg197251 - (view) Author: Roundup Robot (python-dev) Date: 2013-09-08 11:19
New changeset 74dc664ad699 by Antoine Pitrou in branch 'default':
Issue #18808 again: fix the after-fork logic for not-yet-started or already-stopped threads.
http://hg.python.org/cpython/rev/74dc664ad699
msg197253 - (view) Author: Kubilay Kocak (koobs) Date: 2013-09-08 11:39
Adding reference to failing tests on koobs-freebsd9 and koobs-freebsd10 buildbots:

======================================================================
FAIL: test_is_alive_after_fork (test.test_threading.ThreadTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/koobs-freebsd10/3.x.koobs-freebsd10/build/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork
    self.assertEqual(0, status)
AssertionError: 0 != 256
msg197291 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 17:30
[Antoine]
> Oh, I also get the following sporadic failure
> which is triggered by slight change in semantics
> with Thread.join(timeout) :-)
> ======================================================================
> FAIL: test_various_ops (test.test_threading.ThreadTests)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>  File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
>    self.assertTrue(not t.is_alive())
> AssertionError: False is not true

Really!  In context, the test does:

            t.join()
            self.assertTrue(not t.is_alive())

(BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) )

It was the intent that this continue to work - the only intended change in Python-visible semantics had to do with join'ing with a timeout.

Without a timeout, I confess I don't see how this can fail.  join() is join(timeout=None), which does:

        self._stopped.wait(timeout)
        if self._stopped.is_set():
            self._wait_for_tstate_lock(timeout is None)

which is

        self._stopped.wait(None)
        if self._stopped.is_set():
            self._wait_for_tstate_lock(True)

which should be the same as

        self._stopped.wait()
        self._wait_for_tstate_lock(True)

after which _stopped should be set and _tstate_lock should be None.  The subsequent is_alive() should then return False, via its

        return self._tstate_lock is not None

What's going wrong?
msg197294 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-08 17:42
Le dimanche 08 septembre 2013 à 17:30 +0000, Tim Peters a écrit :
> Really!  In context, the test does:
> 
>             t.join()
>             self.assertTrue(not t.is_alive())

Ah, no, the failing test did `t.join(something)`. I removed the timeout
to remove the failure :-)

> (BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) )

Yes, old coding style.
msg197302 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 17:58
Ah - the test used to do t.join(NUMTASKS)!  That's just bizarre ;-)

I believe I can repair that too (well - there was never a _guarantee_ that waiting 10 seconds would be long enough), but I'll wait until this all settles down.

join() and is_alive() are too complicated now, because of the 2-step dance to check whether the thread is done:  we have both an Event (_stopped) and a lock (_tstate_lock) to check now.  The Event doesn't serve a purpose anymore:  it's almost always uninteresting to know _just_ that the Python part of the thread has ended.  The only exception I can see is the perverse case of joining the main thread done in some of the tests (in that case we have to claim the main thread is done even though its tstate is still active).

Anyway, after getting rid of the Event it should be dead easy to make join(10) "appear to work the same as before, despite that it never really worked ;-)".
msg197304 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-08 18:01
> join() and is_alive() are too complicated now, because of the 2-step
> dance to check whether the thread is done:  we have both an Event
> (_stopped) and a lock (_tstate_lock) to check now.  The Event doesn't
> serve a purpose anymore:  it's almost always uninteresting to know
> _just_ that the Python part of the thread has ended.

Yes, that crossed my mind too. The difficulty is that only plain lock
objects are available from C code, not Events. But if the first join()er
releases the lock just after taking it, it will be enough to make the
code correct?

(after all, it's an event that's never reset, which simplifies things)

(also, why is the current Event implementation based on Condition? isn't
an Event actually simpler than a Condition?)
msg197306 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 18:16
Without _stopped, join() can simply wait to acquire _tstate_lock (with or without a timeout, and skipping this if _tstate_lock is already None).  Etc ;-)  Of course details matter, but it's easy.  I did it once, but the tests joining the main thread failed and I put the code on hold.  I'll dust it off when the buildbots are all happy with the current changes.

> (also, why is the current Event implementation based
> on Condition?

We'd have to ask Guido ;-)  Best guess is that Condition supplied all the machinery to make Event.wait() work correctly, including waking all waiters up when the Event gets set.

> isn't an Event actually simpler than a Condition?)

Events are indeed simple :-)  There are many ways to implement them, but "ain't broke, don't fix" seems the right approach to me here.  In effect, if we get rid of _stopped, the code remaining will be much like an Event implementation built on the plain _tstate_lock lock.
msg197307 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-08 18:18
> Without _stopped, join() can simply wait to acquire _tstate_lock (with
> or without a timeout, and skipping this if _tstate_lock is already
> None).  Etc ;-)  Of course details matter, but it's easy.  I did it
> once, but the tests joining the main thread failed and I put the code
> on hold.

Ah, of course. The main thread needs the event, since the thread state
will only be deleted at the end of Py_Finalize().
The MainThread class could override is_alive() and join(), then.
msg197310 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-09-08 18:27
> The MainThread class could override is_alive() and join(), then.

I think it will be easier than that, but we'll see ;-)
msg197338 - (view) Author: Roundup Robot (python-dev) Date: 2013-09-08 23:46
New changeset aff959a3ba13 by Tim Peters in branch 'default':
Issue 18984:  Remove ._stopped Event from Thread internals.
http://hg.python.org/cpython/rev/aff959a3ba13
History
Date User Action Args
2013-09-09 20:08:46tim.peterssetstatus: open -> closed
stage: commit review -> resolved
2013-09-08 23:46:44python-devsetmessages: + msg197338
2013-09-08 21:47:44tim.peterslinkissue18984 dependencies
2013-09-08 18:27:16tim.peterssetmessages: + msg197310
2013-09-08 18:18:52pitrousetmessages: + msg197307
2013-09-08 18:16:22tim.peterssetmessages: + msg197306
2013-09-08 18:01:51pitrousetmessages: + msg197304
2013-09-08 17:58:10tim.peterssetmessages: + msg197302
2013-09-08 17:42:25pitrousetmessages: + msg197294
2013-09-08 17:30:36tim.peterssetmessages: + msg197291
2013-09-08 11:39:01koobssetnosy: + koobs
messages: + msg197253
2013-09-08 11:19:13python-devsetmessages: + msg197251
2013-09-08 11:08:14pitrousetmessages: + msg197248
2013-09-08 10:42:52pitrousetmessages: + msg197247
2013-09-08 04:06:16tim.peterssetmessages: + msg197221
2013-09-08 03:01:10tim.peterssetmessages: + msg197217
2013-09-08 02:51:19tim.peterssetfiles: + blind.patch

messages: + msg197215
2013-09-08 02:24:13tim.peterssetmessages: + msg197214
2013-09-08 02:23:23python-devsetmessages: + msg197213
2013-09-08 00:42:24ncoghlansetnosy: + bkabrda

messages: + msg197207
stage: resolved -> commit review
2013-09-07 23:44:33ncoghlansetmessages: + msg197202
2013-09-07 23:17:07tim.peterssetmessages: + msg197197
2013-09-07 23:03:26pitrousetstatus: closed -> open

messages: + msg197195
2013-09-07 21:50:37pitrousetstatus: open -> closed
resolution: fixed
messages: + msg197193

stage: patch review -> resolved
2013-09-07 21:39:16python-devsetmessages: + msg197191
2013-09-07 21:36:04tim.peterssetmessages: + msg197190
2013-09-07 21:29:30pitrousetfiles: + subinterp_test.patch

messages: + msg197188
2013-09-07 21:12:38tim.peterssetmessages: + msg197185
2013-09-07 20:00:53pitrousetmessages: + msg197184
2013-09-07 19:54:08tim.peterssetmessages: + msg197183
2013-09-07 19:37:15pitrousetfiles: + threadstate_join_6.patch

messages: + msg197182
2013-09-07 19:34:42tim.peterssetmessages: + msg197181
2013-09-07 19:25:20pitrousetmessages: + msg197179
2013-09-07 18:59:24pitrousetmessages: + msg197178
2013-09-07 18:32:00tim.peterssetfiles: + threadstate_join_5.patch

messages: + msg197177
2013-09-07 05:31:51tim.peterssetfiles: + threadstate_join_4.patch

messages: + msg197137
2013-09-07 00:50:02tim.peterssetfiles: + threadstate_join_3.patch

messages: + msg197125
2013-09-06 06:31:48pitrousetmessages: + msg197052
2013-09-06 00:19:26tim.peterssetmessages: + msg197038
2013-09-05 08:15:11pitrousetmessages: + msg196983
2013-09-05 03:49:41tim.peterssetmessages: + msg196977
2013-09-04 21:57:42tim.peterssetmessages: + msg196962
2013-09-04 21:09:25tim.peterssetmessages: + msg196956
2013-09-04 15:07:42pitrousetmessages: + msg196923
2013-09-02 20:31:05csernazssetnosy: + csernazs
2013-08-31 21:01:24tim.peterssetmessages: + msg196672
2013-08-31 18:44:23pitrousetfiles: + threadstate_join_2.patch

messages: + msg196663
2013-08-31 10:13:48pitrousetmessages: + msg196627
2013-08-31 02:41:02tim.peterssetmessages: + msg196611
2013-08-29 13:02:54pitrousetnosy: + tim.peters
2013-08-25 17:50:26pitrousetmessages: + msg196156
versions: - Python 3.3
2013-08-25 17:48:36python-devsetnosy: + python-dev
messages: + msg196154
2013-08-25 09:01:03neologixsetmessages: + msg196115
2013-08-24 18:44:14pitrousetfiles: + threadstate_join.patch
nosy: + neologix
messages: + msg196085

2013-08-24 13:46:38pitrousetmessages: + msg196077
2013-08-24 13:41:58Tamas.Ksetmessages: + msg196076
2013-08-24 01:19:12jceasetnosy: + jcea
2013-08-23 18:35:11pitrousetfiles: + subinterp_threads.patch
versions: - Python 2.7
messages: + msg195999

keywords: + patch
stage: needs patch -> patch review
2013-08-22 16:05:05Tamas.Ksetmessages: + msg195911
2013-08-22 15:52:16pitrousetmessages: + msg195910
2013-08-22 15:50:43pitrousetmessages: + msg195909
2013-08-22 15:30:22pitrousetnosy: + ncoghlan, grahamd

messages: + msg195906
stage: test needed -> needs patch
2013-08-22 14:16:32Tamas.Ksetmessages: + msg195895
2013-08-22 14:05:12serhiy.storchakasetnosy: + pitrou
stage: test needed

versions: + Python 3.4, - Python 2.6, Python 3.1, Python 3.2
2013-08-22 13:56:37Tamas.Kcreate