classification
Title: deadlock in subprocess while running several threads using Popen
Type: behavior Stage: committed/rejected
Components: Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Martijn.van.Oosterhout, Stephen.White, doughellmann, glaubich, neologix, pitrou, python-dev, rosslagerwall
Priority: normal Keywords: needs review, patch

Created on 2012-01-18 18:00 by glaubich, last changed 2012-02-15 21:44 by doughellmann. This issue is now closed.

Files
File name Uploaded Description Edit
subprocess_deadlock.py glaubich, 2012-01-18 18:00
reinit_tls.diff neologix, 2012-01-19 14:45
reinit_tls_test.diff neologix, 2012-01-28 14:40
reinit_tls_27.diff neologix, 2012-02-01 22:12
Messages (9)
msg151556 - (view) Author: Christoph Glaubitz (glaubich) Date: 2012-01-18 18:00
Starting several threads, each just starting a subprocess.Popen and use communicate cause a deadlock in subprocess. (see attached file)

I can only reproduce this with python 2.7.2, not with any other versions. 2.6.5, 2.6.7 and 3.2.2 are working fine. Also 2.4.6 is working, with a problem fixed by http://bugs.python.org/issue1731717.

The attached script does not dead lock all the time, but at least every second call.

I am on a linux x86_64. The python's are compiled by hand, except of 2.6.5.
msg151621 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-19 13:36
Here's the backtrace:

"""
#0  0x0000003bfb20c9b1 in sem_wait () from /lib64/libpthread.so.0
#1  0x000000000051a7c3 in PyThread_acquire_lock (lock=0x17db0750, waitflag=1)
    at Python/thread_pthread.h:321
#2  0x000000000051a9b4 in find_key (key=1, value=0x0) at Python/thread.c:268
#3  0x000000000051abdc in PyThread_get_key_value (key=1) at Python/thread.c:360
#4  0x00000000005025b1 in PyGILState_GetThisThreadState () at Python/pystate.c:598
#5  0x00000000005024f5 in _PyGILState_Reinit () at Python/pystate.c:547
#6  0x0000000000521fc7 in PyOS_AfterFork () at ./Modules/signalmodule.c:979
#7  0x000000000052461d in posix_fork (self=0x0, noargs=0x0) at ./Modules/posixmodule.c:3695
"""

It's stuck in _PyGILState_Reinit(), when calling PyGILState_GetThisThreadState().
That's because in 2.7, TLS is emulated (see Python/thread.c), and it uses a global mutex.
If this mutex is locked at the time of fork(), then the next call to TLS primitives (even PyGILState_GetThisThreadState()) will deadlock.

Now, this particular bug is fixed in 2.7 since #13156, which backed-out  _PyGILState_Reinit() because it was only relevant for native TLS implementations.

The code is still present in 3.2 and and default, but this problem doesn't affect native TLS implementations.

Just to be extra safe, we PyThread_ReInitTLS() - which resets this global mutex on emulated implementations, and is just a no-op on pthread and windows - should be moved earlier in PyOS_AfterFork(), to avoid this type of deadlock (I mean, PyGILState_GetThisThreadState() deadlock after fork() is bad).
Patch attached.
msg151623 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-19 14:16
I think you forgot to attach the patch :)
msg151630 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-19 14:45
Here's the patch.
It's probably possible to add a test for this, however I don't have access to my development machine, so I can't write it now.
msg151683 - (view) Author: Martijn van Oosterhout (Martijn.van.Oosterhout) Date: 2012-01-20 10:51
What a concidence, we had this problem as well and just today tracked it down, and turns out someone filed this bug yesterday.

I note the referenced bug (#13156) says that it's been fixed in the release branch but not yet released. I don't suppose there will be a release of 2.7 in the near future?
msg152162 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-28 14:40
And here's the test.
To sum up:
- reinit_tls.diff makes sure to call PyThread_ReInitTLS() at the beginning of PyOS_AfterFork(), so that the TLS API is usable within PyOS_AfterFork() (e.g. in _after_fork()). It would be applied to 3.2 and default, even though they are not affected (at least on platform withs a native TLS implementation)
- reinit_tls_test.diff would be applied to 2.7, 3.2 and default: it should not trigger a failure on any of the active branches (2.7, 3.2 and default), it's just to prevent potential future regressions.
Notes regarding the test:
- I replaced subprocess with bare fork(), since it's way cheaper than spawning an entire subprocess
- I reduced the number of threads compared to Christoph's reproducer, because I feel uncomfortable forking up to 1024 processes in parallel (even though in practise they won't be that much, because child processes will terminate before the other threads call fork())
- even with 2.7.2, it won't deadlock 100%, since the race window is really short (but having a failure on one of the buildbots once would be enough to detect the problem, should it ever resurface)
msg152445 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-02-01 22:12
I thought about this a bit more, and I realized that a slight variation
of this bug also affects 2.7, and also older versions (i.e. before
_PyGILState_Reinit() was introduced), because any code that gets called
from PyOS_AfterFork() and uses the TLS API suffers from the same
problem.
I had the new test deadlock once, in theading._after_fork(): when trying
to acquire an (unlocked) lock, it got stuck in Py_END_ALLOW_THREADS,
which calls indirectly PyGILState_GetThisThreadState(), which calls
find_key().
Although less likely, this bug has been present for a *long* time, it's
surprising it had never been noticed before.
So the fix - calling PyThread_ReInitTLS() right at the beginning of PyOS_AfterFork() - should also be applied to 2.7.
I'll commit it tomorrow.
msg152474 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-02 19:39
New changeset c3649173d093 by Charles-François Natali in branch '2.7':
Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix
http://hg.python.org/cpython/rev/c3649173d093

New changeset 7b24dd587a7b by Charles-François Natali in branch '3.2':
Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix
http://hg.python.org/cpython/rev/7b24dd587a7b

New changeset a0100852b6fe by Charles-François Natali in branch 'default':
Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix
http://hg.python.org/cpython/rev/a0100852b6fe
msg152480 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-02-02 21:56
Committed.

Christoph, thanks for the report.
History
Date User Action Args
2012-02-15 21:44:07doughellmannsetnosy: + doughellmann
2012-02-02 21:56:43neologixsetstatus: open -> closed
resolution: fixed
messages: + msg152480

stage: commit review -> committed/rejected
2012-02-02 19:39:18python-devsetnosy: + python-dev
messages: + msg152474
2012-02-02 16:15:40Stephen.Whitesetnosy: + Stephen.White
2012-02-01 22:12:07neologixsetfiles: + reinit_tls_27.diff

messages: + msg152445
2012-01-31 17:35:07neologixsetstage: patch review -> commit review
2012-01-29 13:00:08rosslagerwallsetnosy: + rosslagerwall
2012-01-28 14:40:29neologixsetfiles: + reinit_tls_test.diff

messages: + msg152162
versions: + Python 2.7
2012-01-20 10:51:04Martijn.van.Oosterhoutsetnosy: + Martijn.van.Oosterhout
messages: + msg151683
2012-01-19 14:45:39neologixsetfiles: + reinit_tls.diff

messages: + msg151630
2012-01-19 14:16:31pitrousetnosy: + pitrou
messages: + msg151623
2012-01-19 13:36:59neologixsetkeywords: + patch, needs review

stage: patch review
messages: + msg151621
versions: + Python 3.2, Python 3.3, - Python 2.7
2012-01-19 12:54:46pitrousetnosy: + neologix
2012-01-18 18:00:35glaubichcreate