Message 210371 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gromgull
Recipients	gromgull
Date	2014-02-06.10:08:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1391681325.2.0.183803213887.issue20528@psf.upfronthosting.co.za>
In-reply-to

Content
When reading large files with fileinput, it will work as expected and only process a line at a time when used normally, but if you add an hook_encoded openhook it will read the whole file into memory before returning the first line. Verify by running this program on a large text file: import fileinput for l in fileinput.input(openhook=fileinput.hook_encoded('iso-8859-1')): raw_input() and check how much memory it uses. Remove the openhook and memory usage goes down to nothing. The problem is that fileinput calls readlines with a size-hint and in codecs.StreamReader, readlines explicitly ignores this hint and reads all lines into memory. http://bugs.python.org/issue20501 is open for fixing up the documentation for fileinput, but a fix would also be nice. I see two options: 1. As suggested by r.david.murray: Give us a way of signaling to fileinput that it should not use readlines, for instance by setting buffer=None 2. Fix the codecs module to allow StreamReader to respect the hint if given. Although the comment there says it's no efficient way to do this, at least an inefficient way would be better than reading a possibly infinite stream in. A simple solution would be to repeatedly call readline. A more complicated solution would be to read chunks from the stream, and then encode them, just like the readline method does. BTW - this issue is py2.7 only, I tested a file object from io.open with encoding in 3.3 and it supports readlines just fine.

When reading large files with fileinput, it will work as expected and only process a line at a time when used normally, but if you add an hook_encoded openhook it will read the whole file into memory before returning the first line. 

Verify by running this program on a large text file: 

import fileinput

for l in fileinput.input(openhook=fileinput.hook_encoded('iso-8859-1')):
    raw_input()

and check how much memory it uses. Remove the openhook and memory usage goes down to nothing.

The problem is that fileinput calls readlines with a size-hint and in codecs.StreamReader, readlines explicitly ignores this hint and reads all lines into memory. 

http://bugs.python.org/issue20501 is open for fixing up the documentation for fileinput, but a fix would also be nice.

I see two options: 

1. As suggested by r.david.murray: Give us a way of signaling to fileinput that it should not use readlines, for instance by setting buffer=None

2. Fix the codecs module to allow StreamReader to respect the hint if given. Although the comment there says it's no efficient way to do this, at least an inefficient way would be better than reading a possibly infinite stream in. A simple solution would be to repeatedly call readline. A more complicated solution would be to read chunks from the stream, and then encode them, just like the readline method does. 

BTW - this issue is py2.7 only, I tested a file object from io.open with encoding in 3.3 and it supports readlines just fine.

History
Date	User	Action	Args
2014-02-06 10:08:45	gromgull	set	recipients: + gromgull
2014-02-06 10:08:45	gromgull	set	messageid: <1391681325.2.0.183803213887.issue20528@psf.upfronthosting.co.za>
2014-02-06 10:08:45	gromgull	link	issue20528 messages
2014-02-06 10:08:44	gromgull	create