msg70924 - (view) |
Author: Andrew Dalke (dalke) * |
Date: 2008-08-09 02:29 |
I wrote a buggy PNG parser which ended up doing several file.read(large
value). It causes a MemoryError, which was strange because the file was
only a few KB long.
I tracked it down to the implementation of read(). When given a size
hint it preallocates the return string with that size. If the hint is
for 10MB then the string returned will be preallocated fro 10MB, even if
the actual read is empty.
Here's a reproducible
BLOCKSIZE = 10*1024*1024
f=open("empty.txt", "w")
f.close()
f=open("empty.txt")
data = []
for i in range(10000):
s = f.read(BLOCKSIZE)
assert len(s) == 0
data.append(s)
I wasn't sure if this is properly a bug, but since the MemoryError
exception I got was quite unexpected and required digging into the
source code to figure out, I'll say that it is.
|
msg70925 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-08-09 09:39 |
I can't reproduce, your code snippet works fine. What Python version is it?
|
msg70926 - (view) |
Author: Andrew Dalke (dalke) * |
Date: 2008-08-09 10:00 |
I tested it with Python 2.5 on a Mac, Python 2.5 on FreeBSD, and Python
2.6b2+ (from SVN as of this morning) on a Mac.
Perhaps the memory allocator on your machine is making a promise it can't
keep?
|
msg70927 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-08-09 10:15 |
Perhaps. I'm under Linux.
However, at the end of the file_read() implementation in fileobject.c,
you can find the following lines:
if (bytesread != buffersize)
_PyString_Resize(&v, bytesread);
Which means that the string *is* resized at the end.
|
msg70930 - (view) |
Author: Andrew Dalke (dalke) * |
Date: 2008-08-09 11:26 |
You're right. I mistook the string implementation for the list one
which does keep a preallocated section in case of growth. Strings of
course don't grow so there's no need for that.
I tracked the memory allocation all the way down to
obmalloc.c:PyObject_Realloc . The call goes to realloc(p, nbytes) which
is a C lib call. It appears that the memory space is not reallocated.
That was enough to be able to find the python-dev thread "Darwin's
realloc(...) implementation never shrinks allocations" from Jan. 2005,
Bob Ippolito's post "realloc.. doesn’t?"
(http://bob.pythonmac.org/archives/2005/01/01/realloc-doesnt/ ) and Issue1092502 .
Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin
specific.
|
msg70931 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-08-09 11:31 |
Le samedi 09 août 2008 à 11:26 +0000, Andrew Dalke a écrit :
> Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin
> specific.
Darwin and the BSD's supposedly share a lot of common stuff.
But FreeBSD 2.6 is a bit old, isn't it?
|
msg70934 - (view) |
Author: Andrew Dalke (dalke) * |
Date: 2008-08-09 16:42 |
FreeBSD is why my hosting provider uses. Freebsd.org calls 2.6 "legacy"
but the latest update was earlier this year.
There is shared history with Macs. I don't know the details though. I
just point out that the problem isn't only on Darwin.
|
msg73010 - (view) |
Author: Anthon van der Neut (anthon) * |
Date: 2008-09-11 09:37 |
FWIW:
I have performance problems on Windows XP (SP2) with Python 2.5.1 that
could be caused by this behaviour.
My code regularily calculates the sha1 sum of 10.000 files and because
in another reuse of the code had to deal with files too big to fit in memory
I set a limit of 256Mb. It looks like that is now allocated and
deallocated for every one of the 10.000 files, making things *very* slow.
|
msg73022 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-09-11 13:23 |
> My code regularily calculates the sha1 sum of 10.000 files and because
> in another reuse of the code had to deal with files too big to fit in
> memory I set a limit of 256Mb.
Why don't you use a sensible buffer size, e.g. 1MB? Reading data in
256MB chunks sounds foolish in any case.
|
msg73023 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-09-11 13:28 |
Andrew, as for memory reallocation issues, you may take a look at #3526
where someone has similar problems on SunOS.
If nobody objects, I will close the present bug as invalid.
|
msg73030 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-09-11 14:07 |
Le jeudi 11 septembre 2008 à 16:01 +0200, Anthon van der Neut a écrit :
> The thing however was resolved by reading multiple smaller chunks indeed
> 1Mb if the filesize exceeds 1Mb (in the latter case the original read()
> is done.
It's too complicated. Just use chunks in all cases (even small files)
and you are done. There should be no visible performance downside to
doing so.
Using fixed-size chunks to read binary data from a file of an unknown
size isn't a Python-specific idiom, really. It's the same in C, C++,
PHP, etc.
|
msg73066 - (view) |
Author: Andrew Dalke (dalke) * |
Date: 2008-09-11 22:58 |
I'm still undecided on if this is a bug or not. The problem occurs even
when I'm not reading "data from a file of an unknown size." My example
causes a MemoryError on my machine even though the file I'm reading
contains 0 bytes.
The problem is Python's implementation is "alloc the requested bytes and
truncate if needed" vs what I expected "read chunks at a time up to the
requested number of bytes." There's nothing in the documentation which
states the implementation, although "Note that this method may call the
underlying C function fread more than once in an effort to acquire as
close to size bytes as possible." leans slightly towards my
interpretation.
I looked a little for real-world cases that could cause a denial-of-
service attack but didn't find one.
If there is a problem, it will occur very rarely. Go ahead an mark it
as "will not fix" or something similar. I don't think the change in the
code is justifiable.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:37 | admin | set | github: 47781 |
2008-09-13 21:10:47 | pitrou | set | status: open -> closed resolution: rejected |
2008-09-11 22:58:40 | dalke | set | messages:
+ msg73066 |
2008-09-11 14:07:12 | pitrou | set | messages:
+ msg73030 title: file read preallocs 'size' bytes which can cause memory problems -> file read preallocs 'size' bytes which can cause memory problems |
2008-09-11 13:28:23 | pitrou | set | messages:
+ msg73023 |
2008-09-11 13:23:48 | pitrou | set | messages:
+ msg73022 |
2008-09-11 09:37:49 | anthon | set | nosy:
+ anthon messages:
+ msg73010 |
2008-08-09 16:42:06 | dalke | set | messages:
+ msg70934 |
2008-08-09 11:31:41 | pitrou | set | messages:
+ msg70931 |
2008-08-09 11:26:49 | dalke | set | messages:
+ msg70930 |
2008-08-09 10:15:08 | pitrou | set | messages:
+ msg70927 |
2008-08-09 10:00:09 | dalke | set | messages:
+ msg70926 |
2008-08-09 09:39:07 | pitrou | set | nosy:
+ pitrou messages:
+ msg70925 |
2008-08-09 02:29:44 | dalke | create | |