classification
Title: Popen wait() doesn't handle spurious wakeups
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.4, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: amscanne, gregory.p.smith, neologix, python-dev
Priority: normal Keywords: needs review, patch

Created on 2012-03-24 04:04 by amscanne, last changed 2012-11-11 05:16 by gregory.p.smith. This issue is now closed.

Files
File name Uploaded Description Edit
waitpid-2.7.patch amscanne, 2012-03-24 04:04 Fix for 2.7 based on 75701:d46c1973d3c4 review
Messages (7)
msg156683 - (view) Author: Adin Scannell (amscanne) Date: 2012-03-24 04:04
While running a complex python process that executes a bunch of subprocesses (using the subprocess module, specifically calling communicate()), I found myself with occasional zombie processes piling up. Turns out Python is not correctly wait()ing for the children. Although in my case it happens for < 5% of subprocesses, it happens for random Popen objects, used in different ways (using Popen() and then read()/write()/wait() directly or with communicate()). I'd love to find out I'm crazy, but I'm not doing anything too sneaky and the patch below fixes the problem.

I'm not sure why it's happening in my particular environment (maybe it just so happens that the child processes enter into states with particular timing, or the parent receives signals at the wrong moments) but it's very reproducible for me.

I believe that the cause of the zombie processes is as follows:

If you read the description of the waitpid system call (http://www.kernel.org/doc/man-pages/online/pages/man2/wait.2.html), there are several events that could cause waitpid() to return. I have no idea why, but even without WNOHANG set, it looks I'm getting back an occasional 0 return value from waitpid(). Interrupted system call? Stopped child process? Not sure why at the moment. The documentation is a bit ambiguous as to whether this can happen, BUT looking at the example code at the bottom, it seems to handle this spurious wakeup case (which subprocess does not). The net result is that this process has *not* exited or been killed. The python code paths don't consider this possibility (as I believe in normal circumstances, it rarely happens!).

I discovered this bug on 2.7.2. I've prepared a patch for the 2.7 branch (75701:d46c1973d3c4), although I'm certain almost all versions, including the tip suffer from this problem. I'm happy to port to other branches if necessary, although I think appropriate maintainers could whip it up in no time flat. I've tested my 2.7 fix and it solves my problem -- no more zombies. This patch does not change the behaviour of the Popen class in the normal case but allows it to handle spurious wakeups.
msg156684 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-03-24 07:05
Thanks.  I'll see that this fix gets into 2.7, 3.2 and 3.3.

Out of curiosity, what Linux kernel version and glibc version were you using?

I'm somewhat surprised that I haven't run into this before. :)
msg156685 - (view) Author: Adin Scannell (amscanne) Date: 2012-03-24 08:14
Kernel is 3.0.0-15-generic (I believe stock Ubuntu Oneric kernel). 
Version for glibc is Ubuntu EGLIBC 2.13-20ubuntu5.

I'm working on figuring out the exact conditions under which it happens and creating a harness. I'll post it when I've got it.
msg156743 - (view) Author: Charles-Fran├žois Natali (neologix) * (Python committer) Date: 2012-03-25 08:40
> I'm working on figuring out the exact conditions under which it 
> happens and creating a harness. I'll post it when I've got it.

Please do so, because I'm quite skeptical about waitpid() returning 0
without WNOHANG.
If you can reproduce it fairly consistently, you could try running
under strace to see what's happening.
msg175320 - (view) Author: Roundup Robot (python-dev) Date: 2012-11-11 05:10
New changeset d478df13abde by Gregory P. Smith in branch '3.2':
Fixes issue #14396: Handle the odd rare case of waitpid returning 0 when
http://hg.python.org/cpython/rev/d478df13abde

New changeset 61a0eace0f2e by Gregory P. Smith in branch '3.3':
Fixes issue #14396: Handle the odd rare case of waitpid returning 0
http://hg.python.org/cpython/rev/61a0eace0f2e

New changeset 512c1120332f by Gregory P. Smith in branch 'default':
Fixes issue #14396: Handle the odd rare case of waitpid returning 0
http://hg.python.org/cpython/rev/512c1120332f
msg175321 - (view) Author: Roundup Robot (python-dev) Date: 2012-11-11 05:13
New changeset 82711f5ab507 by Gregory P. Smith in branch '2.7':
Fixes issue #14396: Handle the odd rare case of waitpid returning 0
http://hg.python.org/cpython/rev/82711f5ab507
msg175322 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-11-11 05:16
regardless of knowing how to reproduce this system call behavior, the changes necessary to handle robustly it are easy enough.  fixed.

3.3+ already handled it if a timeout was specified (new feature).  I only had to fix the default no timeout case.
History
Date User Action Args
2012-11-11 05:16:10gregory.p.smithsetstatus: open -> closed
resolution: fixed
messages: + msg175322

versions: + Python 3.4
2012-11-11 05:13:36python-devsetmessages: + msg175321
2012-11-11 05:10:39python-devsetnosy: + python-dev
messages: + msg175320
2012-03-25 08:40:40neologixsetmessages: + msg156743
2012-03-24 10:44:15pitrousetnosy: + neologix
2012-03-24 08:14:38amscannesetmessages: + msg156685
2012-03-24 07:05:37gregory.p.smithsetassignee: gregory.p.smith
messages: + msg156684
nosy: + gregory.p.smith, - gps
versions: + Python 3.2, Python 3.3
2012-03-24 05:48:26brian.curtinsetkeywords: + needs review
nosy: + gps
stage: patch review

versions: - Python 2.6, Python 3.1, Python 3.2, Python 3.3, Python 3.4
2012-03-24 04:04:23amscannecreate