Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file read preallocs 'size' bytes which can cause memory problems #47781

Closed
dalke mannequin opened this issue Aug 9, 2008 · 12 comments
Closed

file read preallocs 'size' bytes which can cause memory problems #47781

dalke mannequin opened this issue Aug 9, 2008 · 12 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@dalke
Copy link
Mannequin

dalke mannequin commented Aug 9, 2008

BPO 3531
Nosy @AvdN, @pitrou

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2008-09-13.21:10:47.769>
created_at = <Date 2008-08-09.02:29:44.060>
labels = ['interpreter-core', 'performance']
title = "file read preallocs 'size' bytes which can cause memory\tproblems"
updated_at = <Date 2008-09-13.21:10:47.768>
user = 'https://bugs.python.org/dalke'

bugs.python.org fields:

activity = <Date 2008-09-13.21:10:47.768>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2008-09-13.21:10:47.769>
closer = 'pitrou'
components = ['Interpreter Core']
creation = <Date 2008-08-09.02:29:44.060>
creator = 'dalke'
dependencies = []
files = []
hgrepos = []
issue_num = 3531
keywords = []
message_count = 12.0
messages = ['70924', '70925', '70926', '70927', '70930', '70931', '70934', '73010', '73022', '73023', '73030', '73066']
nosy_count = 3.0
nosy_names = ['anthon', 'dalke', 'pitrou']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = None
status = 'closed'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue3531'
versions = []

@dalke
Copy link
Mannequin Author

dalke mannequin commented Aug 9, 2008

I wrote a buggy PNG parser which ended up doing several file.read(large
value). It causes a MemoryError, which was strange because the file was
only a few KB long.

I tracked it down to the implementation of read(). When given a size
hint it preallocates the return string with that size. If the hint is
for 10MB then the string returned will be preallocated fro 10MB, even if
the actual read is empty.

Here's a reproducible

BLOCKSIZE = 10*1024*1024

f=open("empty.txt", "w")
f.close()

f=open("empty.txt")
data = []
for i in range(10000):
    s = f.read(BLOCKSIZE)
    assert len(s) == 0
    data.append(s)

I wasn't sure if this is properly a bug, but since the MemoryError
exception I got was quite unexpected and required digging into the
source code to figure out, I'll say that it is.

@dalke dalke mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Aug 9, 2008
@pitrou
Copy link
Member

pitrou commented Aug 9, 2008

I can't reproduce, your code snippet works fine. What Python version is it?

@dalke
Copy link
Mannequin Author

dalke mannequin commented Aug 9, 2008

I tested it with Python 2.5 on a Mac, Python 2.5 on FreeBSD, and Python
2.6b2+ (from SVN as of this morning) on a Mac.

Perhaps the memory allocator on your machine is making a promise it can't
keep?

@pitrou
Copy link
Member

pitrou commented Aug 9, 2008

Perhaps. I'm under Linux.

However, at the end of the file_read() implementation in fileobject.c,
you can find the following lines:

if (bytesread != buffersize)
_PyString_Resize(&v, bytesread);

Which means that the string *is* resized at the end.

@dalke
Copy link
Mannequin Author

dalke mannequin commented Aug 9, 2008

You're right. I mistook the string implementation for the list one
which does keep a preallocated section in case of growth. Strings of
course don't grow so there's no need for that.

I tracked the memory allocation all the way down to
obmalloc.c:PyObject_Realloc . The call goes to realloc(p, nbytes) which
is a C lib call. It appears that the memory space is not reallocated.

That was enough to be able to find the python-dev thread "Darwin's
realloc(...) implementation never shrinks allocations" from Jan. 2005,
Bob Ippolito's post "realloc.. doesn’t?"
(http://bob.pythonmac.org/archives/2005/01/01/realloc-doesnt/ ) and bpo-1092502 .

Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin
specific.

@pitrou
Copy link
Member

pitrou commented Aug 9, 2008

Le samedi 09 août 2008 à 11:26 +0000, Andrew Dalke a écrit :

Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin
specific.

Darwin and the BSD's supposedly share a lot of common stuff.
But FreeBSD 2.6 is a bit old, isn't it?

@dalke
Copy link
Mannequin Author

dalke mannequin commented Aug 9, 2008

FreeBSD is why my hosting provider uses. Freebsd.org calls 2.6 "legacy"
but the latest update was earlier this year.

There is shared history with Macs. I don't know the details though. I
just point out that the problem isn't only on Darwin.

@AvdN
Copy link
Mannequin

AvdN mannequin commented Sep 11, 2008

FWIW:
I have performance problems on Windows XP (SP2) with Python 2.5.1 that
could be caused by this behaviour.
My code regularily calculates the sha1 sum of 10.000 files and because
in another reuse of the code had to deal with files too big to fit in memory
I set a limit of 256Mb. It looks like that is now allocated and
deallocated for every one of the 10.000 files, making things *very* slow.

@pitrou
Copy link
Member

pitrou commented Sep 11, 2008

My code regularily calculates the sha1 sum of 10.000 files and because
in another reuse of the code had to deal with files too big to fit in
memory I set a limit of 256Mb.

Why don't you use a sensible buffer size, e.g. 1MB? Reading data in
256MB chunks sounds foolish in any case.

@pitrou
Copy link
Member

pitrou commented Sep 11, 2008

Andrew, as for memory reallocation issues, you may take a look at bpo-3526
where someone has similar problems on SunOS.

If nobody objects, I will close the present bug as invalid.

@pitrou
Copy link
Member

pitrou commented Sep 11, 2008

Le jeudi 11 septembre 2008 à 16:01 +0200, Anthon van der Neut a écrit :

The thing however was resolved by reading multiple smaller chunks indeed
1Mb if the filesize exceeds 1Mb (in the latter case the original read()
is done.

It's too complicated. Just use chunks in all cases (even small files)
and you are done. There should be no visible performance downside to
doing so.

Using fixed-size chunks to read binary data from a file of an unknown
size isn't a Python-specific idiom, really. It's the same in C, C++,
PHP, etc.

@pitrou pitrou changed the title file read preallocs 'size' bytes which can cause memory problems file read preallocs 'size' bytes which can cause memory problems Sep 11, 2008
@dalke
Copy link
Mannequin Author

dalke mannequin commented Sep 11, 2008

I'm still undecided on if this is a bug or not. The problem occurs even
when I'm not reading "data from a file of an unknown size." My example
causes a MemoryError on my machine even though the file I'm reading
contains 0 bytes.

The problem is Python's implementation is "alloc the requested bytes and
truncate if needed" vs what I expected "read chunks at a time up to the
requested number of bytes." There's nothing in the documentation which
states the implementation, although "Note that this method may call the
underlying C function fread more than once in an effort to acquire as
close to size bytes as possible." leans slightly towards my
interpretation.

I looked a little for real-world cases that could cause a denial-of-
service attack but didn't find one.

If there is a problem, it will occur very rarely. Go ahead an mark it
as "will not fix" or something similar. I don't think the change in the
code is justifiable.

@pitrou pitrou closed this as completed Sep 13, 2008
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

1 participant