classification
Title: Subprocess leaks fd upon kill()
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: gabriele.trombetti, gregory.p.smith, haypo, neologix, python-dev, rosslagerwall
Priority: normal Keywords: patch

Created on 2011-07-28 13:58 by gabriele.trombetti, last changed 2011-08-19 10:27 by neologix. This issue is now closed.

Files
File name Uploaded Description Edit
issue_12650_default.diff neologix, 2011-07-28 21:30 review
issue_12650_27.diff neologix, 2011-07-28 21:30 review
issue_12650_32.diff neologix, 2011-07-28 21:30 review
Messages (13)
msg141295 - (view) Author: (gabriele.trombetti) Date: 2011-07-28 13:58
There seems to be a file descriptor (fd) leak in subprocess upon call to kill() and then destroying the current subprocess object (e.g. by scope exit) if the pipe functionality is being used.

This is a reproducer:

(Linux 2.6.25, Python 2.7.1 (r271:86832))

import subprocess, time
def leaktest():
    subp = subprocess.Popen("echo hi; sleep 200; echo bye", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    #you can do something here
    subp.terminate()
    #subp.wait() #Fixes the bug
    #time.sleep(0.001) #fixes the bug
    #time.sleep(0.0000001) #doesn't fix the bug
    return True

Launch the function multiple times interactively, each time the number of open file descriptors for the python process will increase by 2. You can see that by "ls -l /proc/<pid>/fd"
This seems to be a race condition because adding a time.sleep(0.001) before the return fixes the problem. Probably some kernel delay is responsible for the race. 

This bug is significant for daemons because the daemon will die once the number of open file descriptors hits the ulimit, usually 1024, so please fix. 

Note: until the bug is fixed, a simple workaround (not documented in module docs though) is to call Popen.wait() after Popen.kill()

Thank you
msg141307 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-07-28 16:02
There's indeed a leak in your code, because you don't close the stdout and stderr file descriptors.

Try with:

    subp.terminate()
    subp.stdout.close()
    subp.stderr.close()
    return True

And you'll be just fine.

The reason why this works with a short delay is subtle however, and there's actually a bug:
- when a subprocess.Popen is created, its stdout and stderr are wrapped with os.fdopen()
- file objects returned by fdopen() are automatically closed when the object is deallocated (i.e. when the file object goes out of scope in cPython)
- so in theory, when your subprocess.Popen goes out of scope, the corresponding FDs should be closed, and you shoudn't have a leak (note that this assumes cPython, and will maybe not work with other implementations - you shouldn't rely on this)
- the problem is that when subprocess.Popen's __del__ method is called before the process has exited (i.e. if you return from leaktest() before the process has exited), the Popen object is added to a list (subprocess._active), so that it can be waited (to avoid zombies)
- Popen objects from this list are collected (i.e. the process is waited for, and if terminated it's removed from the list) synchronously when a new Popen() object is created (_cleanup funtion)

The problem is that there's a bug in the collection code:

def _cleanup():
    for inst in _active[:]:
       res = inst._internal_poll(_deadstate=sys.maxint)
       if res is not None and res >= 0:
           try:
               _active.remove(inst)
           except ValueError:
               # This can happen if two threads create a new Popen instance.
               # It's harmless that it was already removed, so ignore.
               pass

       res = inst._internal_poll(_deadstate=sys.maxint)
       if res is not None and res >= 0:

If the process exit code is negative (in your case, -SIGTERM) then the Popen object is not removed from the list.
Two consequences:
- the object is not deallocated, the corresponding FDs neither, and you hit RLIMIT_NOFILE
- even if stdout and stderr are closed manually, the object itself is not deallocated, so you have an ever-growing _active list, and a memory leak (and walking the list takes longer and longer)

Changing
       if res is not None and res >= 0:
to
       if res is not None:

fixes this.
I'll look at this in more detail and provide a patch.
msg141317 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-07-28 21:30
Alright.
I tested this on default, and couldn't reproduce the FD leak.
It turned out to be due to another bug, affecting only the code path
which calls pure C _posixsubprocess (which is the only implementation
left in 3.3, but 3.2 still has the old pure-Python version).
The code just forgets to set Popen._child_created to true after
fork(), so when Popen.__del__() gets called before the process has
exited, the object is not added to the _active list, and gets
deallocated immediately. While this accidentaly "fixes" the FD leak,
this has another - worse - side effet: the process remains a zombie.

I'm thus attaching three patches, with tests:
- one for 2.7, which fixes the original problem (i.e. remove the
process from _active once exited, even if it got killed by a signal)
- one for default, which also sets _child_created to True after fork()
- another one for 3.2, which does the same thing as the one for
default (but the code is a little different because 3.2 has both
pure-Python and C implementation)

Reviews welcome!
msg141329 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2011-07-29 05:43
The patches look good and seem to fix the issue.
msg142337 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-18 15:18
New changeset 9ee802642d86 by Charles-François Natali in branch '2.7':
Issue #12650: Fix a race condition where a subprocess.Popen could leak
http://hg.python.org/cpython/rev/9ee802642d86
msg142348 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-18 15:54
New changeset 7d358379c5fb by Charles-François Natali in branch '2.7':
Issue #12650: only run the tests on Unix.
http://hg.python.org/cpython/rev/7d358379c5fb
msg142366 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-18 16:48
New changeset 8ca7f109ce79 by Charles-François Natali in branch '3.2':
Issue #12650: Fix a race condition where a subprocess.Popen could leak
http://hg.python.org/cpython/rev/8ca7f109ce79
msg142373 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-18 17:12
New changeset 148d75d106f1 by Charles-François Natali in branch 'default':
Issue #12650: Fix a race condition where a subprocess.Popen could leak
http://hg.python.org/cpython/rev/148d75d106f1
msg142391 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-08-18 18:21
Patch committed.
This should also fix the original problem.
Gabriele, thanks for the report.
msg142402 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-08-18 21:43
Reopening.
The new tests are failing on solaris and Debian parallel buildbots:
http://www.python.org/dev/buildbot/all/builders/x86 debian parallel 3.x/builds/2734/steps/test/logs/stdio
http://www.python.org/dev/buildbot/all/builders/sparc solaris10 gcc 3.x/builds/3528/steps/test/logs/stdio
"""
======================================================================
FAIL: test_zombie_fast_process_del (test.test_subprocess.POSIXProcessTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home2/buildbot2/slave/3.x.loewis-parallel/build/Lib/test/test_subprocess.py", line 1519, in test_zombie_fast_process_del
    self.assertRaises(OSError, os.waitpid, pid, 0)
AssertionError: OSError not raised by waitpid
"""

I've no idea what's going on here...
This means that the child process didn't get reaped by _cleanup().
It could also maybe be due to the first child's PID being reused immediately by the second.
msg142404 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-18 22:25
New changeset 122f952add3c by Charles-François Natali in branch '3.2':
Issue #12650: fix failures on some buildbots, when a subprocess takes a long
http://hg.python.org/cpython/rev/122f952add3c

New changeset a32ae2749cd1 by Charles-François Natali in branch 'default':
Issue #12650: fix failures on some buildbots, when a subprocess takes a long
http://hg.python.org/cpython/rev/a32ae2749cd1

New changeset 5333ed548257 by Charles-François Natali in branch '2.7':
Issue #12650: fix failures on some buildbots, when a subprocess takes a long
http://hg.python.org/cpython/rev/5333ed548257
msg142405 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-08-18 22:32
> This means that the child process didn't get reaped by _cleanup().

The most likely cause is that when _cleanup() calls waitpid(pid, WNOHANG), the child process hasn't exited yet.
Since I had already set a rather high timeout to avoid this problem and we're only interested in checking that the child is correctly added to the _active list, I've removed the part of test_zombie_fast_process_del which checked that the child got collected when the next Popen instance is allocated (I added this check in test_leak_fast_process_del_killed, which doesn't have this timeout problem because the process gets killed).
I'll watch the buildbots.
msg142422 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-08-19 10:27
The test now passes on the buildbots, closing.
History
Date User Action Args
2011-08-19 10:27:03neologixsetstatus: open -> closed

messages: + msg142422
2011-08-18 22:32:06neologixsetmessages: + msg142405
2011-08-18 22:25:42python-devsetmessages: + msg142404
2011-08-18 21:43:04neologixsetstatus: closed -> open

messages: + msg142402
2011-08-18 18:21:59neologixsetstatus: open -> closed
resolution: fixed
messages: + msg142391

stage: resolved
2011-08-18 17:12:56python-devsetmessages: + msg142373
2011-08-18 16:48:47python-devsetmessages: + msg142366
2011-08-18 15:54:50python-devsetmessages: + msg142348
2011-08-18 15:18:11python-devsetnosy: + python-dev
messages: + msg142337
2011-07-29 05:43:06rosslagerwallsetmessages: + msg141329
2011-07-28 21:30:25neologixsetfiles: + issue_12650_default.diff, issue_12650_27.diff, issue_12650_32.diff
keywords: + patch
messages: + msg141317
2011-07-28 16:22:37rosslagerwallsetnosy: + rosslagerwall
2011-07-28 16:08:56hayposetnosy: + haypo
2011-07-28 16:02:01neologixsetnosy: + neologix
messages: + msg141307
2011-07-28 14:13:45r.david.murraysetnosy: + gregory.p.smith
2011-07-28 13:58:59gabriele.trombetticreate