classification
Title: file read preallocs 'size' bytes which can cause memory problems
Type: resource usage Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: anthon, dalke, pitrou
Priority: normal Keywords:

Created on 2008-08-09 02:29 by dalke, last changed 2008-09-13 21:10 by pitrou. This issue is now closed.

Messages (12)
msg70924 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 02:29
I wrote a buggy PNG parser which ended up doing several file.read(large 
value).  It causes a MemoryError, which was strange because the file was 
only a few KB long.

I tracked it down to the implementation of read().  When given a size 
hint it preallocates the return string with that size.  If the hint is 
for 10MB then the string returned will be preallocated fro 10MB, even if 
the actual read is empty.

Here's a reproducible

BLOCKSIZE = 10*1024*1024

f=open("empty.txt", "w")
f.close()

f=open("empty.txt")
data = []
for i in range(10000):
    s = f.read(BLOCKSIZE)
    assert len(s) == 0
    data.append(s)


I wasn't sure if this is properly a bug, but since the MemoryError 
exception I got was quite unexpected and required digging into the 
source code to figure out, I'll say that it is.
msg70925 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 09:39
I can't reproduce, your code snippet works fine. What Python version is it?
msg70926 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 10:00
I tested it with Python 2.5 on a Mac, Python 2.5 on FreeBSD, and Python 
2.6b2+ (from SVN as of this morning) on a Mac.

Perhaps the memory allocator on your machine is making a promise it can't 
keep?
msg70927 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 10:15
Perhaps. I'm under Linux.

However, at the end of the file_read() implementation in fileobject.c,
you can find the following lines:

if (bytesread != buffersize)
    _PyString_Resize(&v, bytesread);

Which means that the string *is* resized at the end.
msg70930 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 11:26
You're right.  I mistook the string implementation for the list one 
which does keep a preallocated section in case of growth.  Strings of 
course don't grow so there's no need for that.

I tracked the memory allocation all the way down to 
obmalloc.c:PyObject_Realloc .  The call goes to realloc(p, nbytes) which 
is a C lib call.  It appears that the memory space is not reallocated.

That was enough to be able to find the python-dev thread "Darwin's 
realloc(...) implementation never shrinks allocations" from Jan. 2005, 
Bob Ippolito's post "realloc.. doesn’t?" 
(http://bob.pythonmac.org/archives/2005/01/01/realloc-doesnt/ ) and Issue1092502 .

Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin 
specific.
msg70931 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 11:31
Le samedi 09 août 2008 à 11:26 +0000, Andrew Dalke a écrit :
> Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin 
> specific.

Darwin and the BSD's supposedly share a lot of common stuff.
But FreeBSD 2.6 is a bit old, isn't it?
msg70934 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 16:42
FreeBSD is why my hosting provider uses.  Freebsd.org calls 2.6 "legacy" 
but the latest update was earlier this year.

There is shared history with Macs.  I don't know the details though.  I 
just point out that the problem isn't only on Darwin.
msg73010 - (view) Author: Anthon van der Neut (anthon) * Date: 2008-09-11 09:37
FWIW:
I have performance problems on Windows XP (SP2) with Python 2.5.1 that
could be caused by this behaviour.
My code regularily calculates the sha1 sum of 10.000 files and because
in another reuse of the code had to deal with files too big to fit in memory
I set a limit of 256Mb. It looks like that is now allocated and
deallocated for every one of the 10.000 files, making things *very* slow.
msg73022 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-11 13:23
> My code regularily calculates the sha1 sum of 10.000 files and because
> in another reuse of the code had to deal with files too big to fit in
> memory I set a limit of 256Mb.

Why don't you use a sensible buffer size, e.g. 1MB? Reading data in
256MB chunks sounds foolish in any case.
msg73023 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-11 13:28
Andrew, as for memory reallocation issues, you may take a look at #3526
where someone has similar problems on SunOS.

If nobody objects, I will close the present bug as invalid.
msg73030 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-11 14:07
Le jeudi 11 septembre 2008 à 16:01 +0200, Anthon van der Neut a écrit :
> The thing however was resolved by reading multiple smaller chunks indeed
> 1Mb if the filesize exceeds 1Mb (in the latter case the original read()
> is done.

It's too complicated. Just use chunks in all cases (even small files)
and you are done. There should be no visible performance downside to
doing so.

Using fixed-size chunks to read binary data from a file of an unknown
size isn't a Python-specific idiom, really. It's the same in C, C++,
PHP, etc.
msg73066 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-09-11 22:58
I'm still undecided on if this is a bug or not.  The problem occurs even 
when I'm not reading "data from a file of an unknown size."  My example 
causes a MemoryError on my machine even though the file I'm reading 
contains 0 bytes.

The problem is Python's implementation is "alloc the requested bytes and 
truncate if needed" vs what I expected "read chunks at a time up to the 
requested number of bytes."  There's nothing in the documentation which 
states the implementation, although "Note that this method may call the 
underlying C function fread more than once in an effort to acquire as 
close to size bytes as possible." leans slightly towards my 
interpretation.

I looked a little for real-world cases that could cause a denial-of-
service attack but didn't find one.

If there is a problem, it will occur very rarely.  Go ahead an mark it 
as "will not fix" or something similar.  I don't think the change in the 
code is justifiable.
History
Date User Action Args
2008-09-13 21:10:47pitrousetstatus: open -> closed
resolution: rejected
2008-09-11 22:58:40dalkesetmessages: + msg73066
2008-09-11 14:07:12pitrousetmessages: + msg73030
title: file read preallocs 'size' bytes which can cause memory problems -> file read preallocs 'size' bytes which can cause memory problems
2008-09-11 13:28:23pitrousetmessages: + msg73023
2008-09-11 13:23:48pitrousetmessages: + msg73022
2008-09-11 09:37:49anthonsetnosy: + anthon
messages: + msg73010
2008-08-09 16:42:06dalkesetmessages: + msg70934
2008-08-09 11:31:41pitrousetmessages: + msg70931
2008-08-09 11:26:49dalkesetmessages: + msg70930
2008-08-09 10:15:08pitrousetmessages: + msg70927
2008-08-09 10:00:09dalkesetmessages: + msg70926
2008-08-09 09:39:07pitrousetnosy: + pitrou
messages: + msg70925
2008-08-09 02:29:44dalkecreate