Issue 3531: file read preallocs 'size' bytes which can cause memory problems

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47781

classification

Title:	file read preallocs 'size' bytes which can cause memory problems
Type:	resource usage	Stage:
Components:	Interpreter Core	Versions:

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	anthon, dalke, pitrou
Priority:	normal	Keywords:

Created on 2008-08-09 02:29 by dalke, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (12)
msg70924 - (view)	Author: Andrew Dalke (dalke) *	Date: 2008-08-09 02:29
I wrote a buggy PNG parser which ended up doing several file.read(large value). It causes a MemoryError, which was strange because the file was only a few KB long. I tracked it down to the implementation of read(). When given a size hint it preallocates the return string with that size. If the hint is for 10MB then the string returned will be preallocated fro 10MB, even if the actual read is empty. Here's a reproducible BLOCKSIZE = 1010241024 f=open("empty.txt", "w") f.close() f=open("empty.txt") data = [] for i in range(10000): s = f.read(BLOCKSIZE) assert len(s) == 0 data.append(s) I wasn't sure if this is properly a bug, but since the MemoryError exception I got was quite unexpected and required digging into the source code to figure out, I'll say that it is.
msg70925 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-09 09:39
I can't reproduce, your code snippet works fine. What Python version is it?
msg70926 - (view)	Author: Andrew Dalke (dalke) *	Date: 2008-08-09 10:00
I tested it with Python 2.5 on a Mac, Python 2.5 on FreeBSD, and Python 2.6b2+ (from SVN as of this morning) on a Mac. Perhaps the memory allocator on your machine is making a promise it can't keep?
msg70927 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-09 10:15
Perhaps. I'm under Linux. However, at the end of the file_read() implementation in fileobject.c, you can find the following lines: if (bytesread != buffersize) _PyString_Resize(&v, bytesread); Which means that the string is resized at the end.
msg70930 - (view)	Author: Andrew Dalke (dalke) *	Date: 2008-08-09 11:26
You're right. I mistook the string implementation for the list one which does keep a preallocated section in case of growth. Strings of course don't grow so there's no need for that. I tracked the memory allocation all the way down to obmalloc.c:PyObject_Realloc . The call goes to realloc(p, nbytes) which is a C lib call. It appears that the memory space is not reallocated. That was enough to be able to find the python-dev thread "Darwin's realloc(...) implementation never shrinks allocations" from Jan. 2005, Bob Ippolito's post "realloc.. doesn’t?" (http://bob.pythonmac.org/archives/2005/01/01/realloc-doesnt/ ) and Issue1092502 . Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin specific.
msg70931 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-09 11:31
Le samedi 09 août 2008 à 11:26 +0000, Andrew Dalke a écrit : > Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin > specific. Darwin and the BSD's supposedly share a lot of common stuff. But FreeBSD 2.6 is a bit old, isn't it?
msg70934 - (view)	Author: Andrew Dalke (dalke) *	Date: 2008-08-09 16:42
FreeBSD is why my hosting provider uses. Freebsd.org calls 2.6 "legacy" but the latest update was earlier this year. There is shared history with Macs. I don't know the details though. I just point out that the problem isn't only on Darwin.
msg73010 - (view)	Author: Anthon van der Neut (anthon) *	Date: 2008-09-11 09:37
FWIW: I have performance problems on Windows XP (SP2) with Python 2.5.1 that could be caused by this behaviour. My code regularily calculates the sha1 sum of 10.000 files and because in another reuse of the code had to deal with files too big to fit in memory I set a limit of 256Mb. It looks like that is now allocated and deallocated for every one of the 10.000 files, making things very slow.
msg73022 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-09-11 13:23
> My code regularily calculates the sha1 sum of 10.000 files and because > in another reuse of the code had to deal with files too big to fit in > memory I set a limit of 256Mb. Why don't you use a sensible buffer size, e.g. 1MB? Reading data in 256MB chunks sounds foolish in any case.
msg73023 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-09-11 13:28
Andrew, as for memory reallocation issues, you may take a look at #3526 where someone has similar problems on SunOS. If nobody objects, I will close the present bug as invalid.
msg73030 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-09-11 14:07
Le jeudi 11 septembre 2008 à 16:01 +0200, Anthon van der Neut a écrit : > The thing however was resolved by reading multiple smaller chunks indeed > 1Mb if the filesize exceeds 1Mb (in the latter case the original read() > is done. It's too complicated. Just use chunks in all cases (even small files) and you are done. There should be no visible performance downside to doing so. Using fixed-size chunks to read binary data from a file of an unknown size isn't a Python-specific idiom, really. It's the same in C, C++, PHP, etc.
msg73066 - (view)	Author: Andrew Dalke (dalke) *	Date: 2008-09-11 22:58
I'm still undecided on if this is a bug or not. The problem occurs even when I'm not reading "data from a file of an unknown size." My example causes a MemoryError on my machine even though the file I'm reading contains 0 bytes. The problem is Python's implementation is "alloc the requested bytes and truncate if needed" vs what I expected "read chunks at a time up to the requested number of bytes." There's nothing in the documentation which states the implementation, although "Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible." leans slightly towards my interpretation. I looked a little for real-world cases that could cause a denial-of- service attack but didn't find one. If there is a problem, it will occur very rarely. Go ahead an mark it as "will not fix" or something similar. I don't think the change in the code is justifiable.

History
Date	User	Action	Args
2022-04-11 14:56:37	admin	set	github: 47781
2008-09-13 21:10:47	pitrou	set	status: open -> closed resolution: rejected
2008-09-11 22:58:40	dalke	set	messages: + msg73066
2008-09-11 14:07:12	pitrou	set	messages: + msg73030 title: file read preallocs 'size' bytes which can cause memory problems -> file read preallocs 'size' bytes which can cause memory problems
2008-09-11 13:28:23	pitrou	set	messages: + msg73023
2008-09-11 13:23:48	pitrou	set	messages: + msg73022
2008-09-11 09:37:49	anthon	set	nosy: + anthon messages: + msg73010
2008-08-09 16:42:06	dalke	set	messages: + msg70934
2008-08-09 11:31:41	pitrou	set	messages: + msg70931
2008-08-09 11:26:49	dalke	set	messages: + msg70930
2008-08-09 10:15:08	pitrou	set	messages: + msg70927
2008-08-09 10:00:09	dalke	set	messages: + msg70926
2008-08-09 09:39:07	pitrou	set	nosy: + pitrou messages: + msg70925
2008-08-09 02:29:44	dalke	create