classification
Title: urllib2's urlopen() method causes a memory leak
Type: resource usage Stage:
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, a.badger, akuchling, amaury.forgeotdarc, bwelling, gdub, holdenweb, jafo, jhylton, manekcz, nswinton, orsenthil, peci, pitrou, schmir, stephbul
Priority: normal Keywords:

Created on 2005-05-25 09:20 by manekcz, last changed 2013-04-11 13:24 by BreamoreBoy. This issue is now closed.

Files
File name Uploaded Description Edit
urllib2leak.py stephbul, 2009-06-03 13:31 main test
urllib2.py peci, 2009-09-04 10:17
Messages (21)
msg60743 - (view) Author: Petr Toman (manekcz) Date: 2005-05-25 09:20
It seems that the urlopen(url) methd of the urllib2 module 
leaves some undestroyable objects in memory.

Please try the following code:
==========================
if __name__ == '__main__':
  import urllib2
  a = urllib2.urlopen('http://www.google.com')
  del a # or a = None or del(a)
  
  # check memory on memory leaks
  import gc
  gc.set_debug(gc.DEBUG_SAVEALL)
  gc.collect()
  for it in gc.garbage:
    print it
==========================

In our code, we're using lots of urlopens in a loop and 
the number of unreachable objects grows beyond all 
limits :) We also tried a.close() but it didn't help.

You can also try the following:
==========================
def print_unreachable_len():
  # check memory on memory leaks
  import gc
  gc.set_debug(gc.DEBUG_SAVEALL)
  gc.collect()
  unreachableL = []
  for it in gc.garbage:
    unreachableL.append(it)
  return len(str(unreachableL))
  
if __name__ == '__main__':
  print "at the beginning", print_unreachable_len()

  import urllib2
  print "after import of urllib2", print_unreachable_len()

  a = urllib2.urlopen('http://www.google.com')
  print 'after urllib2.urlopen', print_unreachable_len()

  del a
  print 'after del', print_unreachable_len()
==========================

We're using WindowsXP with latest patches, Python 2.4
(ActivePython 2.4 Build 243 (ActiveState Corp.) based on
Python 2.4 (#60, Nov 30 2004, 09:34:21) [MSC v.1310 
32 bit (Intel)] on win32).
msg60744 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2005-06-01 23:13
Logged In: YES 
user_id=11375

Confirmed.  The objects involved seem to be an HTTPResponse and the 
socket._fileobject wrapper; the assignment 'r.recv=r.read' around line 1013 
of urllib2.py seems to be critical to creating the cycle.
msg60745 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2005-06-29 03:27
Logged In: YES 
user_id=81797

I can reproduce this in both the python.org 2.4 RPM and in a
freshly built copy from CVS.  If I run a few thousand
urlopen()s, I get:

Traceback (most recent call last):
  File "/tmp/mt", line 26, in ?
  File "/tmp/python/dist/src/Lib/urllib2.py", line 130, in
urlopen
  File "/tmp/python/dist/src/Lib/urllib2.py", line 361, in open
  File "/tmp/python/dist/src/Lib/urllib2.py", line 379, in _open
  File "/tmp/python/dist/src/Lib/urllib2.py", line 340, in
_call_chain
  File "/tmp/python/dist/src/Lib/urllib2.py", line 1026, in
http_open
  File "/tmp/python/dist/src/Lib/urllib2.py", line 1001, in
do_open
urllib2.URLError: <urlopen error (24, 'Too many open files')>

Even if I do a a.close().  I'll investigate a bit further.

Sean
msg60746 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2005-06-29 03:52
Logged In: YES 
user_id=81797

I give up, this code is kind of a maze of twisty little
passages.  I did try doing "a.fp.close()" and that didn't
seem to help at all.  Couldn't really make any progress on
that though.  I also tried doing a "if a.headers.fp:
a.headers.fp.close()", which didn't do anything noticable.
msg60747 - (view) Author: Brian Wellington (bwelling) Date: 2005-08-12 02:22
Logged In: YES 
user_id=63197

We just ran into this same problem, and worked around it by
simply removing the 'r.recv = r.read' line in urllib2.py,
and creating a recv alias to the read function in
HTTPResponse ('recv = read' in the class).

Not sure if this is the best solution, but it seems to work.
msg60748 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2005-08-12 22:30
Logged In: YES 
user_id=81797

I've just tried it again using the current CVS version as
well as the version installed with Fedora Core 4, and in
both cases I was able to run over 100,000 retrievals of
http://127.0.0.1/test.html and http://127.0.0.1/google.html.
 test.html is just "it works" and google.html was generated
with "wget -O google.html http://google.com/".

I was able to reproduce this before, but now am not.  My
urllib2.py includes the r.recv=r.read line.  I have upgraded
from FC3 to FC4, could this be something related to an OS or
library interaction?  I was going to try to confirm the last
message, but now I can't reproduce the failure.
msg60749 - (view) Author: Brian Wellington (bwelling) Date: 2005-08-15 18:13
Logged In: YES 
user_id=63197

The real problem we were seeing wasn't the memory leak, it
was a file descriptor leak.  Leaking references within the
interpreter is bad, but the garbage collector will
eventually notice that the system is out of memory and clean
them.  Leaking file descriptors is much worse, as gc won't
be triggered when the process has reached it's limit, and
the process will start failing with "Too many file descriptors".

To easily show this problem, run the following from an
interactive python interpreter:

import urllib2
f = urllib2.urlopen('http://www.google.com')
f.close()

and from another window, run "lsof -p <pid of interpreter>".
 It should show  a TCP socket in CLOSE_WAIT, which means the
file descriptor is still open.  I'm seeing weirdness on
Fedora Core 4 today that I didn't see last week where after
a few seconds, the file descriptor is listed as "can't
identify protocol" instead of TCP, but that's not too
relevant, since it's still open.

Repeating the urllib2.urlopen()/close() pairs of statements
in the interpreter will cause more fds to be leaked, which
can also be seen by lsof.
msg60750 - (view) Author: Steve Holden (holdenweb) * (Python committer) Date: 2005-10-14 04:13
Logged In: YES 
user_id=88157

The Windows 2.4.1 build doesn't show this error, but the
Cygwin 2.4.1 build does still have uncollectable objects
after a urllib2.urlopen(), so there may be a platform
dependency here. No 2.4.2 on Cygwin yet, so nothing
conclusive as lsof isn't available.
msg60751 - (view) Author: Neil Swinton (nswinton) Date: 2005-10-18 15:00
Logged In: YES 
user_id=1363935

It's not the prettiest thing, but you can work around this
by setting the socket's recv method to None before closing it.

import urllib2
f = urllib2.urlopen('http://www.google.com')
text=f.read()
f.fp._sock.recv=None # hacky avoidance
f.close()

msg76298 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-11-24 05:20
I tried to repeat the test in http://bugs.python.org/msg60749 and found
that the descriptors will close if you read from the file before closing.

so this leads to open descriptors::

  import urllib2
  f = urllib2.urlopen('http://www.google.com')
  f.close()

while this does not::

  import urllib2
  f = urllib2.urlopen('http://www.google.com')
  f.read(1)
  f.close()
msg76300 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-11-24 05:47
One further data point.   On two rhel5 systems with identical kernels,
both x86_64, both python-2.4.3... basically, everything I've thought to
check identical, I ran the test code with f.read() in an infinite loop.
 One system only has one TCP socket in use at a time.  The other one has
multiple TCP sockets in use, but they all close eventually.

/usr/sbin/lsof -p INTERPRETER_PID|wc -l reported 

96 67 97 63 91 62 94 78

on subsequent runs.
msg76350 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2008-11-24 18:03
Python 2.4 is now in security-fix-only mode. No new features are being
added, and bugs are not fixed anymore unless they affect the stability
and security of the interpreter, or of Python applications.
http://www.python.org/download/releases/2.4.5/

This bug doesn't rise to the level of making into a 2.4.6.
msg76368 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-11-24 22:22
Reopening: I reproduce the problem consistently with both 2.6 and trunk 
versions (not with python 3.0), on Windows XP.
msg76683 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-12-01 10:40
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
> Reopening: I reproduce the problem consistently with both 2.6 and trunk 
> versions (not with python 3.0), on Windows XP.
> 

I think this bug is ONLY with respect to Windows Systems. 
I not able to reproduce this on the current trunk on Linux Ubuntu (
8.04). I tried 100 and 1000 instances of open and close and everytime
file descriptors goes through ESTABLISHED, SYNC_SENT and closes for
TCP connections.

And yeah, certain instances showed 'can't identify protocol' randomly.
But thats a different issue.

The original bug raised for Python 2.4 was originally raised on Linux
and it seems to have been fixed.
A Windows expert should comment on this, if this is consistently
reproducable on Windows.
msg88812 - (view) Author: BULOT (stephbul) Date: 2009-06-03 13:31
Hello, 

I'm facing a urllib2 memory leak issue in one of my scripts that is not
threaded. I made a few tests in order to check what was going on and I
found this already existing bug thread (but old).

I'm not able to figure out what is the issue yet, but here are a few
informations:
Platform: Debian
Python version 2.5.4

I made a script (2 attached files) in order to make access to a web page
(http://www.google.com) every second, that monitors number of file
descriptors and memory footprint.
I also introduced the gc module (Garbage Collector) in order to retrieve
numbers of objects that are not freed (like already proposed in this
thread but more focussed on gc.DEBUG_LEAK flag)

Here are my results:
First acces output:
gc: collectable <dict 0xb793c604>
gc: collectable <HTTPResponse instance at 0xb7938f6c>
gc: collectable <dict 0xb793c4f4>
gc: collectable <HTTPMessage instance at 0xb793d0ec>
gc: collectable <dict 0xb793c02c>
gc: collectable <list 0xb7938e8c>
gc: collectable <list 0xb7938ecc>
gc: collectable <instancemethod 0xb79cf824>
gc: collectable <dict 0xb793c79c>
gc: collectable <HTTPResponse instance at 0xb793d2cc>
gc: collectable <instancemethod 0xb79cf874>
unreachable objects:  11
File descriptors number: 32
Memory: 4612

Thenth access:
gc: collectable <dict 0xb78f14f4>
gc: collectable <HTTPResponse instance at 0xb78f404c>
gc: collectable <dict 0xb78f13e4>
gc: collectable <HTTPMessage instance at 0xb78f462c>
gc: collectable <dict 0xb78e5f0c>
gc: collectable <list 0xb78eeb4c>
gc: collectable <list 0xb78ee2ac>
gc: collectable <instancemethod 0xb797b7fc>
gc: collectable <dict 0xb78f168c>
gc: collectable <HTTPResponse instance at 0xb78f442c>
gc: collectable <instancemethod 0xb78eaa7c>
unreachable objects:  110
File descriptors number: 32
Memory: 4680

After hundred access:
gc: collectable <dict 0x89e2e84>
gc: collectable <HTTPResponse instance at 0x89e3e2c>
gc: collectable <dict 0x89e2d74>
gc: collectable <HTTPMessage instance at 0x89e3ccc>
gc: collectable <dict 0x89db0b4>
gc: collectable <list 0x89e3cac>
gc: collectable <list 0x89e32ec>
gc: collectable <instancemethod 0x89d8964>
gc: collectable <dict 0x89e60b4>
gc: collectable <HTTPResponse instance at 0x89e50ac>
gc: collectable <instancemethod 0x89ddb1c>
unreachable objects:  1100
File descriptors number: 32
Memory: 5284

Each call to urllib2.urlopen() gives 11 new unreachable objects,
increases memory footprint without giving new open files.

Do you have any idea?
With the hack proposed in message
http://bugs.python.org/issue1208304#msg60751, number of unreachable
objects goes down to 8 unreachable objects remaining, but still memory
increases.

Regards.

stephbul

PS
My urlib2leak.py test calls monitor script (not able to attach it):
#! /bin/sh

PROCS='urllib2leak.py'

RUNPID=`ps aux | grep "$PROCS" | grep -v "grep" | awk '{printf $2}'`
FDESC=`lsof -p $RUNPID | wc -l`
MEM=`ps aux | grep "$PROCS" | grep -v "grep" | awk '{printf $6 }'`

echo "File descriptors number: "$FDESC
echo "Memory: "$MEM
msg92245 - (view) Author: clemens pecinovsky (peci) Date: 2009-09-04 10:17
i also ran into the problem of cyclic dependencies. i know if i would
call gc.collect() the problem would be solved, but calling gc.collect()
takes a long time. 

the problem is the cyclic dependency with
r.recv=r.read

i have fixed it localy by wrapping the addinfourl into a new class (i
called it addinfourlFixCyclRef) and overloading the close method, and
within the close method set the recv to none again.

class addinfourlFixCyclRef(addinfourl):
    def close(self):
        if self.fp is not None and hasattr(self.fp, "_sock"):
            self.fp._sock.recv = None
        addinfourl.close(self)

....

        r.recv = r.read
        fp = socket._fileobject(r, close=True)

        resp = addinfourlFixCyclRef(fp, r.msg, req.get_full_url())


and when i call .close() from the response it just works. Unluckily i
had to patch even more in case there is an exception called.
For the whole fix see the attachment
msg114503 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-21 15:52
On Windows Vista I can consistently reproduce this with 2.6 and 2.7 but not with 3.1 or 3.2.
msg186550 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-11 09:27
The entire description of this issue is bogus. Reference cycles are not a bug, since Python has a cyclic garbage collector. Closing as invalid.
msg186552 - (view) Author: Ralf Schmitt (schmir) Date: 2013-04-11 09:52
I'd consider reference cycles a bug especially if they prevent filedescriptors from being closed. please read the comments.
msg186556 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-11 12:39
I see no file descriptor leak myself:

>>> f = urllib2.urlopen("http://www.google.com")
>>> f.fileno()
3
>>> os.fstat(3)
posix.stat_result(st_mode=49663, st_ino=5045244, st_dev=7L, st_nlink=1, st_uid=1000, st_gid=1000, st_size=0, st_atime=0, st_mtime=0, st_ctime=0)
>>> del f
>>> os.fstat(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor

Ditto with Python 3:

>>> f = urllib.request.urlopen("http://www.google.com")
>>> f.fileno()
3
>>> os.fstat(3)
posix.stat_result(st_mode=49663, st_ino=5071469, st_dev=7, st_nlink=1, st_uid=1000, st_gid=1000, st_size=0, st_atime=0, st_mtime=0, st_ctime=0)
>>> del f
>>> os.fstat(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor

Furthermore, you can use the `with` statement to ensure timely disposal of system resources:

>>> f = urllib.request.urlopen("http://www.google.com")
>>> with f: f.fileno()
... 
3
>>> os.fstat(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
msg186560 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2013-04-11 13:24
Where did file descriptors come into it, surely this is all about memory leaks?  In any case it's hardly a show stopper as there are at least three references above to the problem line of code and three workarounds.
History
Date User Action Args
2013-04-11 13:24:20BreamoreBoysetmessages: + msg186560
2013-04-11 12:39:18pitrousetmessages: + msg186556
2013-04-11 09:52:24schmirsetmessages: + msg186552
2013-04-11 09:27:26pitrousetstatus: open -> closed

nosy: + pitrou
messages: + msg186550

resolution: not a bug
2013-04-10 22:03:15schmirsetnosy: + schmir
2011-02-09 23:18:38gdubsetnosy: + gdub
2010-08-21 15:52:52BreamoreBoysetnosy: + BreamoreBoy
messages: + msg114503
2010-07-20 03:16:43BreamoreBoysetversions: + Python 2.6, Python 3.1, Python 2.7, Python 3.2, - Python 2.5
2009-09-04 10:17:26pecisetfiles: + urllib2.py
nosy: + peci
messages: + msg92245

2009-06-03 13:31:30stephbulsetfiles: + urllib2leak.py
versions: - Python 2.6, Python 2.7
nosy: + stephbul

messages: + msg88812
2008-12-01 10:40:12orsenthilsetnosy: + orsenthil
messages: + msg76683
2008-11-29 01:16:45gregory.p.smithsettype: resource usage
components: + Library (Lib), - Extension Modules
versions: + Python 2.6, Python 2.5, Python 2.7, - Python 2.4
2008-11-24 22:22:34amaury.forgeotdarcsetstatus: closed -> open
nosy: + amaury.forgeotdarc
resolution: wont fix -> (no value)
messages: + msg76368
2008-11-24 18:03:58jhyltonsetstatus: open -> closed
nosy: + jhylton
resolution: wont fix
messages: + msg76350
2008-11-24 05:47:08a.badgersetmessages: + msg76300
2008-11-24 05:20:15a.badgersetnosy: + a.badger
messages: + msg76298
2005-05-25 09:20:22manekczcreate