classification
Title: Throw away more radioactive locks that could be held across a fork in threading.py
Type: Stage:
Components: Interpreter Core, Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: Rhamphoryncus, automatthias, barry, collinwinter, gregory.p.smith, jyasskin, nadeem.vawda, rnk
Priority: release blocker Keywords: needs review, patch

Created on 2009-08-04 18:56 by rnk, last changed 2013-07-30 10:54 by automatthias. This issue is now closed.

Files
File name Uploaded Description Edit
forkjoindeadlock.py rnk, 2009-08-04 18:56 Failing fork/threads test case.
forkdeadlock.diff rnk, 2009-08-04 21:18 Patch to fix the deadlock
thread-fork-join.diff rnk, 2010-07-10 19:35 Updated patch
thread-fork-join.diff rnk, 2010-07-10 20:39 clear condition waiters also
issue6643-release26_maint_gps01.diff gregory.p.smith, 2011-01-04 01:25
test_thread.diff nadeem.vawda, 2011-01-04 16:30 Patch to fix AttributeError in test_thread
Messages (14)
msg91265 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2009-08-04 18:56
This bug is similar to the importlock deadlock, and it's really part of
a larger problem that you should release all locks before you fork. 
However, we can fix this in the threading module directly by freeing and
resetting the locks on the main thread after a fork.

I've attached a test case that inserts calls to sleep at the right
places to make the following occur:
- Main thread spawns a worker thread.
- Main thread joins worker thread.
- To join, the main thread acquires the lock on the condition variable
(worker.__block.acquire()).
== switch to worker ==
- Worker thread forks.
== switch to child process ==
- Worker thread, which is now the only thread in the process, returns.
- __bootstrap_inner calls self.__stop() to notify any other threads
waiting for it that it returned.
- __stop() tries to acquire self.__block, which has been left in an
acquired state, so the child process hangs here.
== switch to worker in parent process ==
- Worker thread calls os.waitpid(), which hangs, since the child never
returns.

So there's the deadlock.

I think I should be able to fix it just by resetting the condition
variable lock and any other locks hanging off the only thread left
standing after the fork.
msg91273 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2009-08-04 21:18
Here's a patch for 3.2 which adds the fix and a test case.  I also
verified that the problem exists in 3.1, 2.7, and 2.6 and backported the
patch to those versions, but someone should review this one before I
upload those.
msg109914 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2010-07-10 19:35
Here's an updated patch for py3k (3.2).  The test still fails without the fix, and passes with the fix.

Thinking more about this, I'll try summarizing the bug more coherently:

When the main thread joins the child threads, it acquires some locks.  If a fork in a child thread occurs while those locks are held, they remain locked in the child process.  My solution is to do here what we do elsewhere in CPython: abandon radioactive locks and allocate fresh ones.
msg109933 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2010-07-10 20:39
I realized that in a later fix for unladen-swallow, we also cleared the condition variable waiters list, since it has radioactive synchronization primitives in it as well.

Here's an updated patch that simplifies the fix by just using __init__() to completely reinitialize the condition variables and adds a test.

This corresponds to unladen-swallow revisions r799 and r834.
msg110071 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010-07-12 06:34
I don't have any direct opinions on this, as it is just a bandaid.  fork, as defined by POSIX, doesn't allow what we do with it, so we're reliant on great deal of OS and library implementation details.  The only portable and robust solution would be to replace it with a unified fork-and-exec API that's implemented directly in C.
msg110092 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2010-07-12 15:11
I completely agree, but the cat is out of the bag on this one.  I don't see how we could get rid of fork until Py4K, and even then I'm sure there will be people who don't want to see it go, and I'd rather not spend my time arguing this point.

The only application of fork that doesn't use exec that I've heard of is pre-forked Python servers.  But those don't seem like they would be very useful, since with refcounting the copy-on-write behavior doesn't get you very many wins.

The problem that this bandaid solves for me is that test_threading.py already tests thread+fork behaviors, and can fail non-deterministically.

This problem was exacerbated while I was working on making the compilation thread.

I don't think we can un-support fork and threads in the near future either, because subprocess.py uses fork, and libraries can use fork behind the user's back.
msg125236 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-01-03 20:48
fwiw a unified fork-and-exec API implemented in C is what I added in Modules/_posixsubprocess.c to at least avoid this issue as much as possible when using subprocess.
msg125240 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-01-03 21:07
patch looks good.  committed in r87710 for 3.2.  needs back porting to 3.1 and 2.7 and optionally 2.6.
msg125270 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-01-04 01:10
r87726 for release31-maint
r87727 for release27-maint - this required a bit more fiddling as _block and _started and _cond were __ private.
msg125273 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-01-04 01:25
Attached is a patch for Python 2.6 release26_maint for reference incase someone wants it.  That branch is closed - security fixes only.
msg125338 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-01-04 16:30
r87710 introduces an AttributeError in test_thread's TestForkInThread test case. If os.fork() is called from a thread created by the _thread module, threading._after_fork() will get a _DummyThread (with no _block attribute) as the current thread.

I've attached a patch that checks whether the thread has a _block attribute before trying to reinitialize it.
msg125346 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-01-04 18:34
eek, thanks for noticing that!

r87740 fixes this in py3k.  backporting to 3.1 and 2.7 now.
msg125350 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-01-04 18:44
r87741 3.1
r87742 2.7
msg193923 - (view) Author: Maciej Bliziński (automatthias) Date: 2013-07-30 10:54
Python version: 2.7.5
OS: Solaris 9

I'm still observing this issue (or Issue5114) on Solaris 9. The symptom is that test_threading hangs indefinitely (tested: overnight) and running pstack on the process, I'm seeing:

-----------------  lwp# 1 / thread# 1  --------------------
 ff3dc734 lwp_park (0, 0, 0)
 ff3d3c74 s9_lwp_park (0, 0, 0, 1, feed4f48, 18f5a4) + 28
 ff3dc698 s9_handler (0, 0, 0, 1, feed4f48, 18f5a4) + 90
 ff1dea70 _sema_wait (0, feee66a0, fed6b054, feee6000, 2a298478, d1f20) + 1d4
 ff1dec30 sema_wait (81aa8, ff1dec24, 722a5b4b, 1101c, feed4f48, 134d60) + c
 feed4f48 sem_wait (81aa8, 0, fed6b1ac, 0, 0, 1) + 20
 ff050890 PyThread_acquire_lock (81aa8, 1, fed6b214, 2, 0, 1ae778) + 5c
 ff05524c lock_PyThread_acquire_lock (0, 22030, 0, 13ee40, 16a298, 55150) + 50
 fefa779c PyCFunction_Call (1ae788, 22030, 0, ff0d8eb8, 55150, ff0551fc) + e4
 ff016b14 PyEval_EvalFrameEx (18f5a0, 0, 0, d4f66, 16a298, 22030) + 5ee8
 ff0185d0 PyEval_EvalCodeEx (12c968, 0, 18f5a0, 4, 1, 18f5a4) + 924
 ff0168f8 PyEval_EvalFrameEx (1902b8, 0, 1, 1765c0, 16a298, 1b12d0) + 5ccc
 ff0185d0 PyEval_EvalCodeEx (13f608, 0, 1902b8, 4, 1, 1902bc) + 924
 ff0168f8 PyEval_EvalFrameEx (154748, 0, 1, 31f7f, 16a298, 1b1250) + 5ccc
 ff0185d0 PyEval_EvalCodeEx (10d650, 54a50, 154748, 2203c, 0, 2203c) + 924
 fef8e11c function_call (22038, 22030, 1386f0, 2203c, 130730, 22030) + 168
 fef604e8 PyObject_Call (130730, 22030, 1386f0, ff0e0340, fef8dfb4, 0) + 60
 ff0137dc PyEval_EvalFrameEx (169110, 0, 22030, 10e62d, 16a298, 22030) + 2bb0
 ff017478 PyEval_EvalFrameEx (168f80, 0, 169114, 1769fa, 16a298, 16a298) + 684c
 ff017478 PyEval_EvalFrameEx (176cb0, 0, 168f84, 12a2c0, 16a298, 16a298) + 684c
 ff0185d0 PyEval_EvalCodeEx (13f410, 176cb4, 176cb0, 13433c, 1, 0) + 924
 fef8e040 function_call (1b26f0, 134330, 0, ff1bc000, 1b26f0, 0) + 8c
 fef604e8 PyObject_Call (1b26f0, 134330, 0, ff0e0340, fef8dfb4, 134320) + 60
 fef6e530 instancemethod_call (0, 134330, 0, 0, 1b26f0, 134bd0) + a4
 fef604e8 PyObject_Call (c3b48, 22030, 0, ff0e0340, fef6e48c, 0) + 60
 ff01051c PyEval_CallObjectWithKeywords (c3b48, 22030, 0, 0, 0, 0) + 68
 ff05568c t_bootstrap (63bd0, 0, 0, 0, 16a298, ff0e2804) + 4c
 ff1e53a4 _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 2 / thread# 2  --------------------
 ff3dc734 lwp_park (0, 0, 0)
 ff3d3c74 s9_lwp_park (0, 0, 0, 1, b64a0d58, 136818) + 28
 ff3dc698 s9_handler (0, 0, 0, 1, b64a0d58, 136818) + 90
 ff1dea70 _sema_wait (0, feee66a0, fec6b054, feee6000, 2a298478, d1f20) + 1d4
 ff1dec30 sema_wait (8ab00, ff1dec24, 722a5b4b, 1101c, feed4f48, 134d60) + c
 feed4f48 sem_wait (8ab00, 0, fec6b1ac, 0, 0, 1) + 20
 ff050890 PyThread_acquire_lock (8ab00, 1, fec6b214, 2, 0, 1ae610) + 5c
 ff05524c lock_PyThread_acquire_lock (0, 22030, 0, 13ee40, 156168, 55160) + 50
 fefa779c PyCFunction_Call (1ae620, 22030, 0, ff0d8eb8, 55160, ff0551fc) + e4
 ff016b14 PyEval_EvalFrameEx (18fe60, 0, 0, d4f66, 156168, 22030) + 5ee8
 ff0185d0 PyEval_EvalCodeEx (12c968, 0, 18fe60, 4, 1, 18fe64) + 924
 ff0168f8 PyEval_EvalFrameEx (18fce8, 0, 1, 1765c0, 156168, 1b11b0) + 5ccc
 ff0185d0 PyEval_EvalCodeEx (13f608, 0, 18fce8, 4, 1, 18fcec) + 924
 ff0168f8 PyEval_EvalFrameEx (18fb88, 0, 1, 136155, 156168, 1a2930) + 5ccc
 ff0185d0 PyEval_EvalCodeEx (48b60, 18fb8c, 18fb88, 19d41c, 1, 2203c) + 924
 fef8e11c function_call (22038, 19d410, 1b3c00, 2203c, 130370, 22030) + 168
 fef604e8 PyObject_Call (130370, 19d410, 1b3c00, ff0e0340, fef8dfb4, 19d400) + 60
 ff0137dc PyEval_EvalFrameEx (18fa20, 0, 19d410, 10e62d, 156168, 134950) + 2bb0
 ff017478 PyEval_EvalFrameEx (18f890, 0, 18fa24, 1769fa, 156168, 156168) + 684c
 ff017478 PyEval_EvalFrameEx (18f728, 0, 18f894, 12a2c0, 156168, 156168) + 684c
 ff0185d0 PyEval_EvalCodeEx (13f410, 18f72c, 18f728, 19d3fc, 1, 0) + 924
 fef8e040 function_call (1b26f0, 19d3f0, 0, ff1bc000, 1b26f0, 0) + 8c
 fef604e8 PyObject_Call (1b26f0, 19d3f0, 0, ff0e0340, fef8dfb4, 19d3e0) + 60
 fef6e530 instancemethod_call (0, 19d3f0, 0, 0, 1b26f0, 1b1250) + a4
 fef604e8 PyObject_Call (1aeaf8, 22030, 0, ff0e0340, fef6e48c, 0) + 60
 ff01051c PyEval_CallObjectWithKeywords (1aeaf8, 22030, 0, 0, 0, 0) + 68
 ff05568c t_bootstrap (63c30, 0, 0, 0, 156168, ff0e2804) + 4c
 ff1e53a4 _lwp_start (0, 0, 0, 0, 0, 0)

The problem does not occur on Solaris 10.
History
Date User Action Args
2013-07-30 10:54:40automatthiassetnosy: + automatthias
messages: + msg193923
2011-06-25 10:43:10neologixlinkissue5114 superseder
2011-01-04 18:44:17gregory.p.smithsetstatus: open -> closed

messages: + msg125350
resolution: accepted -> fixed
nosy: barry, collinwinter, gregory.p.smith, Rhamphoryncus, jyasskin, nadeem.vawda, rnk
2011-01-04 18:34:43gregory.p.smithsetpriority: normal -> release blocker
nosy: + barry
messages: + msg125346

2011-01-04 16:40:56pitrousetstatus: closed -> open
nosy: collinwinter, gregory.p.smith, Rhamphoryncus, jyasskin, nadeem.vawda, rnk
2011-01-04 16:30:45nadeem.vawdasetfiles: + test_thread.diff
nosy: + nadeem.vawda
messages: + msg125338

2011-01-04 01:25:20gregory.p.smithsetstatus: open -> closed
files: + issue6643-release26_maint_gps01.diff
versions: - Python 2.7
nosy: collinwinter, gregory.p.smith, Rhamphoryncus, jyasskin, rnk
messages: + msg125273

keywords: + patch
2011-01-04 01:10:40gregory.p.smithsetnosy: collinwinter, gregory.p.smith, Rhamphoryncus, jyasskin, rnk
messages: + msg125270
versions: - Python 3.1, Python 3.2
2011-01-03 21:07:41gregory.p.smithsetassignee: rnk -> gregory.p.smith
messages: + msg125240
resolution: accepted
nosy: collinwinter, gregory.p.smith, Rhamphoryncus, jyasskin, rnk
2011-01-03 20:48:25gregory.p.smithsetnosy: collinwinter, gregory.p.smith, Rhamphoryncus, jyasskin, rnk
messages: + msg125236
2010-07-18 14:50:49rnklinkissue6642 dependencies
2010-07-18 14:49:23rnksetkeywords: + needs review, - patch
assignee: rnk
2010-07-12 15:11:29rnksetmessages: + msg110092
2010-07-12 06:34:26Rhamphoryncussetmessages: + msg110071
2010-07-11 14:49:42pitrousetnosy: + gregory.p.smith, Rhamphoryncus
2010-07-11 13:22:50rnksettitle: joining a child that forks can deadlock in the forked child process -> Throw away more radioactive locks that could be held across a fork in threading.py
2010-07-10 20:39:02rnksetfiles: + thread-fork-join.diff

messages: + msg109933
2010-07-10 19:35:50rnksetfiles: + thread-fork-join.diff

messages: + msg109914
2009-08-11 18:19:33collinwintersetnosy: + jyasskin, collinwinter
components: + Interpreter Core
2009-08-04 21:18:43rnksetfiles: + forkdeadlock.diff
keywords: + patch
messages: + msg91273

versions: + Python 3.1, Python 2.7, Python 3.2
2009-08-04 18:56:48rnkcreate