classification
Title: Thread.isAlive() sometimes True after fork
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: emptysquare, gregory.p.smith, koobs, neologix, pitrou, python-dev, sbt
Priority: normal Keywords: patch

Created on 2013-07-09 22:13 by emptysquare, last changed 2013-09-08 11:40 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
fix_is_alive_and_fork.patch emptysquare, 2013-07-09 22:13 Patch to test and fix this bug. review
fix_is_alive_and_fork_2.patch emptysquare, 2013-07-15 23:51 Second patch, for 2.7 branch. review
fix_is_alive_and_fork_python_33.patch emptysquare, 2013-07-16 00:05 Similar patch for 3.3 branch. review
issue18418-2.patch emptysquare, 2013-07-25 18:00 review
issue18418-3.patch emptysquare, 2013-08-04 15:50 review
Messages (23)
msg192780 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-09 22:13
In threading.Thread.__bootstrap_inner(), the thread sets self.__started before adding itself to the _active collection. If a different thread forks between these two operations, we're left with a thread that believes it's alive although it isn't. _after_fork() doesn't call __stop() on this thread because the thread isn't in _active.

In the attached patch, Thread adds itself to _active before setting self.__started. This way, if it's interrupted by a fork, it may end up stopped without being started, rather than the reverse. In this case isAlive() correctly returns False.

One concern about the patch: my new test in test_threading.py only reveals the bug in Python 2.7.5 very occasionally. I don't know how to force the test to fail more frequently.

Looking at the threading.py on the 3.3 branch, it appears to have the same bug, but I can't reproduce it there, only on the 2.7 branch.
msg192933 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-07-12 08:22
This is just a special-case of the more general problem of forking() in a multi-threaded program (see #16500 and #6721).
Since fork() can occur at any time, even though your data structures are protected by locks, they can end up in an inconsistent state in the child process. There's nothing we can do about it, and it affects *every code that's not atomic*.
For example, even though your patch fixes this specific issue, you can still end up with the thread both in _active and _limbo at the same time (which shouldn't be a problem since both are cleared after fork(), but still).

OTOH, the patch is simple and safe, so it might be interesting (another maybe better approach would be to also iterate over _limbo in _after_fork()).

As for the reproduction in 3.3, you're probably having trouble to reproduce it because of the GIL change. You might want to play with sys.setswitchinterval().
msg193104 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-15 14:17
Charles-François, I agree that many guarantees are impossible to enforce in a multithreaded application that calls fork().

But the threading module does try to guarantee, after a fork, that isAlive() is False for all threads but one. I claim that it can more reliably provide this guarantee with my patch applied.
msg193124 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-07-15 19:15
This patch makes sense to me.  I've gone over the code and I cannot spot any adverse effects.  I was wondering in particular if anything would be surprised to find a non-started thread in _active within the short window where that will be true but nothing appears to care about that.

You should add a comment to the code mentioning this issue and why the .set() is done after the _active insertion.  After that, I'd say commit it.

If you want a consistent reproducible test case for this I believe you will need to replace the Thread object's __started with a test wrapper who's set() method blocks waiting for for the fork to have happened before doing the actual set().  That is a bit tricky and may not be worth it.
msg193127 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-07-15 20:30
> If you want a consistent reproducible test case for this I believe you will need to replace the Thread object's __started with a test wrapper who's set() method blocks waiting for for the fork to have happened before doing the actual set().  That is a bit tricky and may not be worth it.

Actually, setting sys.setswitchinterval(1e-6) makes it fairly easy to reproduce.
msg193137 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-15 23:51
Patch #2:

* Add comment before .set() as requested.

* setcheckinterval(0) and try 20 times to repro the bug, inspired by test_enumerate_after_join. This reliably repros in 2.7.5.
msg193539 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-22 14:56
Bump.
msg193689 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-07-25 09:02
> Bump.

Did you see my review at http://bugs.python.org/review/18418/#ps8668 ?
msg193693 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-25 11:13
Oh, no I didn't. I'm learning how this system works. I'll respond there
soon.

Thanks

On Thursday, July 25, 2013, Charles-François Natali wrote:

>
> Charles-François Natali added the comment:
>
> > Bump.
>
> Did you see my review at http://bugs.python.org/review/18418/#ps8668 ?
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org <javascript:;>>
> <http://bugs.python.org/issue18418>
> _______________________________________
>
msg193709 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-25 18:00
New patch for 3.3 branch after Charles-François's critique: instead of changing startup sequence in Thread._bootstrap_inner, stop all threads in _limbo after a fork.
msg193710 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-25 18:24
(Sorry about the extraneous XML file from my IDE, I made a mistake and allowed the diff to include it.)
msg193770 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-07-27 01:03
By the way, I'm going on vacation through August 4. When I return I'll
check if there's anything else I need to do to help resolve this.

On Thursday, July 25, 2013, A. Jesse Jiryu Davis wrote:

>
> A. Jesse Jiryu Davis added the comment:
>
> (Sorry about the extraneous XML file from my IDE, I made a mistake and
> allowed the diff to include it.)
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org <javascript:;>>
> <http://bugs.python.org/issue18418>
> _______________________________________
>
msg194160 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-08-02 07:07
I've posted another review (not sure you receive notifications).
msg194385 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-08-04 15:42
I'm back from a Zen retreat--no email!--will address your comments
momentarily.

On Fri, Aug 2, 2013 at 3:07 AM, Charles-François Natali <
report@bugs.python.org> wrote:

>
> Charles-François Natali added the comment:
>
> I've posted another review (not sure you receive notifications).
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue18418>
> _______________________________________
>
msg194387 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-08-04 15:50
New patch for 3.3 branch after Charles-François's critique: use _enumerate() where appropriate, and join the test thread before finishing the test.
msg194792 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-08-10 14:09
Bump.

I think my most recent patch (August 4) addresses all of Charles-François's comments.
msg195580 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-08-18 22:09
FYI I'm going on vacation again 8/20 through 8/25. I'll check in again as soon as I return to see if there's anything else I should do.
msg196576 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-30 21:14
Well... I can't say anything except that the patch looks harmless.
It's a pity the test failure is hard to trigger, though.
msg196581 - (view) Author: Roundup Robot (python-dev) Date: 2013-08-30 21:34
New changeset ba54011aa295 by Charles-François Natali in branch '2.7':
Issue #18418: After fork(), reinit all threads states, not only active ones.
http://hg.python.org/cpython/rev/ba54011aa295

New changeset 29fce7f31539 by Charles-François Natali in branch '3.3':
Issue #18418: After fork(), reinit all threads states, not only active ones.
http://hg.python.org/cpython/rev/29fce7f31539

New changeset 8f39e2f987fb by Charles-François Natali in branch 'default':
Issue #18418: After fork(), reinit all threads states, not only active ones.
http://hg.python.org/cpython/rev/8f39e2f987fb
msg196612 - (view) Author: A. Jesse Jiryu Davis (emptysquare) * Date: 2013-08-31 02:49
Thanks for accepting this patch!

Antoine, I initially had trouble writing a test that reliably reproduced the bug, especially in Python 3.3. But I read other threading tests and got smarter. The final patch includes tests that are very reliable at revealing the bug in 2.7 and 3.3.
msg196622 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-08-31 07:37
> The final patch includes tests that are very reliable at
> revealing the bug in 2.7 and 3.3.

Indeed, I could reproduce it systematically without the patch.

> Thanks for accepting this patch!

Well, thanks for the patch!
msg197252 - (view) Author: koobs (koobs) Date: 2013-09-08 11:37
For reference, this test is successfuly identifying failures on koobs-freebsd and koobs-freebsd10 buildbots:

======================================================================
FAIL: test_is_alive_after_fork (test.test_threading.ThreadTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/koobs-freebsd10/3.x.koobs-freebsd10/build/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork
    self.assertEqual(0, status)
AssertionError: 0 != 256
msg197254 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-08 11:40
I've identified a possible cause (and fix) for the sporadic test failures in 74dc664ad699. If the solution holds, it should be backported to 3.3 and 2.7.
History
Date User Action Args
2013-09-08 11:40:19pitrousetmessages: + msg197254
2013-09-08 11:37:04koobssetnosy: + koobs
messages: + msg197252
2013-08-31 07:37:57neologixsetstatus: open -> closed
type: behavior
messages: + msg196622

resolution: fixed
stage: commit review -> resolved
2013-08-31 02:49:17emptysquaresetmessages: + msg196612
2013-08-30 21:34:38python-devsetnosy: + python-dev
messages: + msg196581
2013-08-30 21:14:53pitrousetmessages: + msg196576
2013-08-30 16:34:20neologixsetnosy: + pitrou
stage: patch review -> commit review

versions: + Python 3.3, Python 3.4
2013-08-18 22:09:10emptysquaresetmessages: + msg195580
2013-08-18 15:12:04neologixsetstage: patch review
2013-08-10 14:09:41emptysquaresetmessages: + msg194792
2013-08-04 15:50:59emptysquaresetfiles: + issue18418-3.patch

messages: + msg194387
2013-08-04 15:42:49emptysquaresetmessages: + msg194385
2013-08-02 07:07:35neologixsetmessages: + msg194160
2013-07-27 01:03:29emptysquaresetmessages: + msg193770
2013-07-25 18:24:49emptysquaresetmessages: + msg193710
2013-07-25 18:00:42emptysquaresetfiles: + issue18418-2.patch

messages: + msg193709
2013-07-25 11:13:08emptysquaresetmessages: + msg193693
2013-07-25 09:02:24neologixsetmessages: + msg193689
2013-07-22 14:56:02emptysquaresetmessages: + msg193539
2013-07-16 00:05:05emptysquaresetfiles: + fix_is_alive_and_fork_python_33.patch
2013-07-15 23:51:03emptysquaresetfiles: + fix_is_alive_and_fork_2.patch

messages: + msg193137
2013-07-15 20:30:57neologixsetmessages: + msg193127
2013-07-15 19:15:57gregory.p.smithsetmessages: + msg193124
2013-07-15 14:17:39emptysquaresetmessages: + msg193104
2013-07-12 08:22:36neologixsetnosy: + neologix
messages: + msg192933
2013-07-09 23:03:08pitrousetnosy: + gregory.p.smith, sbt
2013-07-09 22:13:50emptysquarecreate