This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author amscanne
Recipients amscanne
Date 2012-03-24.04:04:20
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1332561864.36.0.581817043223.issue14396@psf.upfronthosting.co.za>
In-reply-to
Content
While running a complex python process that executes a bunch of subprocesses (using the subprocess module, specifically calling communicate()), I found myself with occasional zombie processes piling up. Turns out Python is not correctly wait()ing for the children. Although in my case it happens for < 5% of subprocesses, it happens for random Popen objects, used in different ways (using Popen() and then read()/write()/wait() directly or with communicate()). I'd love to find out I'm crazy, but I'm not doing anything too sneaky and the patch below fixes the problem.

I'm not sure why it's happening in my particular environment (maybe it just so happens that the child processes enter into states with particular timing, or the parent receives signals at the wrong moments) but it's very reproducible for me.

I believe that the cause of the zombie processes is as follows:

If you read the description of the waitpid system call (http://www.kernel.org/doc/man-pages/online/pages/man2/wait.2.html), there are several events that could cause waitpid() to return. I have no idea why, but even without WNOHANG set, it looks I'm getting back an occasional 0 return value from waitpid(). Interrupted system call? Stopped child process? Not sure why at the moment. The documentation is a bit ambiguous as to whether this can happen, BUT looking at the example code at the bottom, it seems to handle this spurious wakeup case (which subprocess does not). The net result is that this process has *not* exited or been killed. The python code paths don't consider this possibility (as I believe in normal circumstances, it rarely happens!).

I discovered this bug on 2.7.2. I've prepared a patch for the 2.7 branch (75701:d46c1973d3c4), although I'm certain almost all versions, including the tip suffer from this problem. I'm happy to port to other branches if necessary, although I think appropriate maintainers could whip it up in no time flat. I've tested my 2.7 fix and it solves my problem -- no more zombies. This patch does not change the behaviour of the Popen class in the normal case but allows it to handle spurious wakeups.
History
Date User Action Args
2012-03-24 04:04:24amscannesetrecipients: + amscanne
2012-03-24 04:04:24amscannesetmessageid: <1332561864.36.0.581817043223.issue14396@psf.upfronthosting.co.za>
2012-03-24 04:04:23amscannelinkissue14396 messages
2012-03-24 04:04:21amscannecreate