Issue 20501: fileinput module will read whole file into memory when using fileinput.hook_encoded

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64700

classification

Title:	fileinput module will read whole file into memory when using fileinput.hook_encoded
Type:	behavior	Stage:	resolved
Components:	Documentation, IO	Versions:	Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	benjamin.peterson, docs@python, gromgull, hynek, pitrou, python-dev, r.david.murray, serhiy.storchaka, stutzbach, vajrasky, zach.ware
Priority:	normal	Keywords:	patch

Created on 2014-02-03 14:27 by gromgull, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
20501.output	zach.ware, 2014-02-26 15:30		review
fileinput_hook_encoded_3.patch	serhiy.storchaka, 2014-02-26 17:15		review

Messages (22)
msg210130 - (view)	Author: Gunnar Aastrand Grimnes (gromgull)	Date: 2014-02-03 14:27
When reading large files with fileinput, it will work as expected and only process a line at a time when used normally, but if you add an hook_encoded openhook it will read the whole file into memory before returning the first line. Verify by running this program on a large text file: import fileinput for l in fileinput.input(openhook=fileinput.hook_encoded('iso-8859-1')): raw_input() and check how much memory it uses. Remove the openhook and memory usage goes down to nothing.
msg210133 - (view)	Author: Gunnar Aastrand Grimnes (gromgull)	Date: 2014-02-03 14:55
The problem lies in codecs.py here: http://hg.python.org/cpython/file/ae7facd874ba/Lib/codecs.py#l581
msg210144 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-02-03 15:48
This requires (1) a doc update to indicate the problem and (2) a way to tell fileinput to not use readlines to optimize by calling readlines(bufsize), since in the case of using codecs it can be a problem. Presumably buffer=None would be the logical way to spell this. I'm marking this as a doc bug, a separate issue should be opened for adding the fileinput buffer=None enhancement. (The codec module's comments indicate there's no practical way to implement sizehint.)
msg210317 - (view)	Author: Gunnar Aastrand Grimnes (gromgull)	Date: 2014-02-05 14:12
Agreed that a doc-fix is the first step. Shall I make another issue for the fix in fileinput? I also wonder if the comment in codecs is not too negative - it is not easy to get it 100% right, but the method just above readlines is readline, which does somehow manage to read a single line. To me it seems like it would be better to makes readlines repeatedly call readline. It would also be inefficient, but still better than reading the whole (possibly infinite) stream in ...
msg210318 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-02-05 14:16
Yes, please make another issue for the enhancement request. That issue can start out agnostic as to whether it is codecs or fileinput that could use improvement, if you prefer.
msg210342 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-05 19:42
As far as the issue is in the codecs module, this is 2.7 only issue, because codecs.open is used only in 2.7. And I suppose that replacing codecs.open by io.open should fix this issue.
msg212132 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-24 19:06
Can anyone please test the patch on Windows?
msg212202 - (view)	Author: Zachary Ware (zach.ware) *	Date: 2014-02-25 19:32
> Can anyone please test the patch on Windows? It seems to work; memory usage is much lower with the patch than without using an 8 MB file. I don't notice any behavioral change with the patch; if there's anything specific to look for on that front, give me a test script in either English or Python :)
msg212216 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-25 21:36
Thank you Zachary. Here is a patch with a test. I'm not sure that it is successful on Windows.
msg212218 - (view)	Author: Zachary Ware (zach.ware) *	Date: 2014-02-25 21:50
The new test passes on Windows with the whole patch applied, but fails without the changes to fileinput.py. Is this change meant to fix behavior, or just the memory usage issue? Just for completeness, here's the failure output (with unpatched fileinput.py): ====================================================================== FAIL: test_modes (__main__.Test_hook_encoded) ---------------------------------------------------------------------- Traceback (most recent call last): File "P:\ath\to\2.7\cpython\lib\test\test_fileinput.py", line 235, in test_modes check('r', [u'A\n', u'B\r\n', u'C\rD\u20ac']) File "P:\ath\to\2.7\cpython\lib\test\test_fileinput.py", line 233, in check self.assertEqual(lines, expected_lines) AssertionError: Lists differ: [u'A\n', u'B\r\n', u'C\r', u'D... != [u'A\n', u'B\r\n', u'C\rD\u20a... First differing element 2: C D\u20ac First list contains 1 additional elements. First extra element 3: D\u20ac - [u'A\n', u'B\r\n', u'C\r', u'D\u20ac'] ? ----- + [u'A\n', u'B\r\n', u'C\rD\u20ac'] ----------------------------------------------------------------------
msg212224 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-25 22:01
What would be the failure output when comment out first 1, 2 or three checks?
msg212263 - (view)	Author: Zachary Ware (zach.ware) *	Date: 2014-02-26 15:30
Output attached.
msg212277 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-26 17:15
Same as on Linux (and differs from 3.x). Thank you Zachary. Here is corrected patch. Added also a test which tests that readline() doesn't read whole file.
msg212280 - (view)	Author: Zachary Ware (zach.ware) *	Date: 2014-02-26 18:00
New patch passes on Windows. Without the patch to fileinput.py, the new hook_encoded tests pass and the new test_readline test fails (as expected).
msg212290 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-26 19:05
New changeset 1a1a9d6fb278 by Serhiy Storchaka in branch '2.7': Issue #20501: fileinput module no longer reads whole file into memory when using http://hg.python.org/cpython/rev/1a1a9d6fb278 New changeset b4a139713b3b by Serhiy Storchaka in branch '3.3': Added tests for issue #20501. http://hg.python.org/cpython/rev/b4a139713b3b New changeset 1a38fa1f701d by Serhiy Storchaka in branch 'default': Added tests for issue #20501. http://hg.python.org/cpython/rev/1a38fa1f701d
msg212292 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-26 19:13
Thank you Zachary that ran tests for me. Thank you Gunnar for your report.
msg212334 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2014-02-27 02:28
From http://hg.python.org/cpython/rev/1a1a9d6fb278 + t1 = TESTFN + #t1 = writeTmp(1, ['A\nB\r\nC\rD+IKw-'], mode='wb') + self.addCleanup(safe_unlink, TESTFN) You left out the debugging statement. And this English statement: # Unlikely UTF-7 is locale encoding could be made clearer. Does it mean: "UTF-7 is an unlikely locale encoding"?
msg212357 - (view)	Author: Zachary Ware (zach.ware) *	Date: 2014-02-27 14:42
Agreed with Vajrasky, the 2.7 commit wasn't as clean as it could be (sorry I didn't get an actual review done before you committed, I posted my last test results in a bit of a hurry). The commented out line should either be removed, or used instead of the 'with' block above it. I would actually vote for using writeTmp, and make a new issue for making writeTmp use a context manager. Also, I think the new test_readline could be made a little clearer by putting + fi = FileInput(files=TESTFN, openhook=hook_encoded('ascii'), bufsize=8) + self.assertEqual(fi.readline(), u'A\n') + self.assertEqual(fi.readline(), u'B\r\n') + self.assertEqual(fi.readline(), u'C\r') inside a `try ... except UnicodeDecodeError: self.fail('Read to end of file')` block.
msg212474 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-28 21:26
My bad. I had cleaned patch before committing to 3.x, but forgot to revert and correct it in 2.7. Thank you Vajrasky an Zachary. > Does it mean: "UTF-7 is an unlikely locale encoding"? Sorry for my bad English. I meant that UTF-7 is used here because it is unlikely that it is locale encoding. Is "an" needed here? > I would actually vote for using writeTmp, and make a new issue for making writeTmp use a context manager. I think that explicit two-line code is better than calling superfluous function. We don't need generated filename, we write only one chunk of code, we use context manager, so the code is only two lines long. > inside a `try ... except UnicodeDecodeError: self.fail('Read to end of file')` block. May be, but in 2.7 a traceback doesn't contain info about original exception and this can hide necessary details in case of failure.
msg212479 - (view)	Author: Zachary Ware (zach.ware) *	Date: 2014-02-28 22:22
For the UTF-7 comment, I'd go with something like "locale encoding is probably not UTF-7" or "UTF-7 is a convenient, seldom used encoding", or something to that effect. About writeTmp vs. context manager: as you wish. It would still be good to convert writeTmp to use a context manager though, but that is a separate issue. Just cleaning up 2.7 to match 3.x is enough to make me happy :) For the explicit failure message, perhaps the best we can do in 2.7 is a comment saying that the most likely failure is a UnicodeDecodeError due to the entire file being read when it shouldn't have been. That can be deduced from the separate comments you already have there (and especially from the issue link), but I think it may be better to just come right out and say it. If exception chaining means the original exception is kept with a self.fail call in 3.x, I think 3.x should get the explicit failure message, though. I don't feel too strongly about it, though; I'll leave it to you to change or not as you like :)
msg212657 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-03-03 19:21
New changeset d37f963394aa by Serhiy Storchaka in branch '2.7': Correct and improve comments in test_fileinput (closes #20501). http://hg.python.org/cpython/rev/d37f963394aa New changeset 204ef3bca9c8 by Serhiy Storchaka in branch '3.3': Correct comments and improve failure reports in test_fileinput (closes #20501). http://hg.python.org/cpython/rev/204ef3bca9c8 New changeset c47cc6351ce7 by Serhiy Storchaka in branch 'default': Correct comments and improve failure reports in test_fileinput (closes #20501). http://hg.python.org/cpython/rev/c47cc6351ce7
msg212658 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-03-03 19:24
Thank you Vajrasky and Zachary.

History
Date	User	Action	Args
2022-04-11 14:57:58	admin	set	github: 64700
2014-03-03 19:24:04	serhiy.storchaka	set	messages: + msg212658
2014-03-03 19:21:11	python-dev	set	messages: + msg212657
2014-02-28 22:22:39	zach.ware	set	messages: + msg212479
2014-02-28 21:26:12	serhiy.storchaka	set	messages: + msg212474
2014-02-27 14:42:52	zach.ware	set	messages: + msg212357
2014-02-27 02:28:14	vajrasky	set	nosy: + vajrasky messages: + msg212334
2014-02-26 19:13:50	serhiy.storchaka	set	assignee: docs@python -> serhiy.storchaka
2014-02-26 19:13:25	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg212292 stage: patch review -> resolved
2014-02-26 19:05:27	python-dev	set	nosy: + python-dev messages: + msg212290
2014-02-26 18:00:35	zach.ware	set	messages: + msg212280
2014-02-26 17:29:55	serhiy.storchaka	set	files: - fileinput_hook_encoded_2.patch
2014-02-26 17:24:32	serhiy.storchaka	set	files: - fileinput_hook_encoded.patch
2014-02-26 17:15:02	serhiy.storchaka	set	files: + fileinput_hook_encoded_3.patch messages: + msg212277
2014-02-26 15:30:19	zach.ware	set	files: + 20501.output messages: + msg212263
2014-02-25 22:01:10	serhiy.storchaka	set	messages: + msg212224
2014-02-25 21:50:13	zach.ware	set	messages: + msg212218
2014-02-25 21:36:09	serhiy.storchaka	set	files: + fileinput_hook_encoded_2.patch messages: + msg212216
2014-02-25 19:32:30	zach.ware	set	nosy: + zach.ware messages: + msg212202
2014-02-24 19:06:06	serhiy.storchaka	set	messages: + msg212132
2014-02-06 14:10:06	r.david.murray	link	issue20528 superseder
2014-02-05 19:42:46	serhiy.storchaka	set	files: + fileinput_hook_encoded.patch versions: - Python 3.3, Python 3.4 keywords: + patch nosy: + hynek, stutzbach, benjamin.peterson, pitrou, serhiy.storchaka messages: + msg210342 stage: patch review
2014-02-05 14:16:42	r.david.murray	set	messages: + msg210318
2014-02-05 14:12:48	gromgull	set	messages: + msg210317
2014-02-03 15:48:35	r.david.murray	set	versions: + Python 3.3, Python 3.4 nosy: + docs@python, r.david.murray messages: + msg210144 assignee: docs@python components: + Documentation
2014-02-03 14:55:51	gromgull	set	messages: + msg210133
2014-02-03 14:27:33	gromgull	create