Issue 19051: Unify buffered readers

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63251

classification

Title:	Unify buffered readers
Type:	enhancement	Stage:	patch review
Components:	IO, Library (Lib)	Versions:	Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	alanmcintyre, benjamin.peterson, martin.panter, nadeem.vawda, pitrou, serhiy.storchaka, stutzbach
Priority:	normal	Keywords:	patch

Created on 2013-09-19 14:22 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
buffered_reader.diff	serhiy.storchaka, 2013-09-19 14:22		review
read_bench.py	serhiy.storchaka, 2013-09-19 14:33	Benchmark script
read_bench_cmp	serhiy.storchaka, 2013-09-19 14:34	Benchmark results

Messages (9)
msg198075 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-19 14:22
There are some classes in gzip, bz2, lzma, and zipfile modules which implement buffered reader interface. They read chunks of data from underlied file object, decompress it, save in internal buffer, and provide common methods to read from this buffer. Maintaining of duplicated code is cumbersome and error prone. Proposed preliminary patch moves common code into new private class _io2._BufferedReaderMixin. If the proposition will be accepted in general, I'm going to write C version and move it into the io module. Perhaps even then merge it with io.BufferedIOBase. The idea is that all buffered reading functions (read(), read1(), readline(), peek(), etc) can be expressed in the term of one function which returns raw unbuffered data. Subclasses need define only one such function and will got all buffered reader interface. In case of mentioned above classes this functions reads and decompresses a chunk of data from underlied file. The HTTPResponse class perhaps will benefit too (issue19009).
msg198077 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-19 14:33
Here are benchmark script and its results.
msg198078 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-19 14:47
See issue12053 for a more flexible primitive.
msg198082 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-19 15:26
This primitive doesn't not well fit a case of compressed streams. A chunk of compressed data read from underlied file object can be uncompressed to unpredictable large data. We can't limit the size of buffer. Another point is that buffer interface is not very appropriate for Python implementation. And we want left as much Python code in gzip, bz2, lzma and zipfile as possible. Copying from bytes into buffer and back will just waste resources.
msg198143 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-20 12:32
If you want this, I think it should be somehow folded into existing classes (for example BufferedIOBase). Yet another implementation of readline() isn't really a good idea.
msg233747 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-09 12:22
For what it’s worth, it would be better if compressed streams did limit the amount of data they decompressed, so that they are not susceptible to decompression bombs; see Issue 15955. But having a flexible-sized buffer could be useful in other cases. I haven’t looked closely at the code, but I wonder if there is much difference from the existing BufferedReader. Perhaps just that the underlying raw stream in this case can deliver data in arbitrary-sized chunks, but BufferedReader expects its raw stream to deliver data in limited-sized chunks? If you exposed the buffer it could be useful to do many things more efficiently: * readline() with custom newline or end-of-record codes, solving Issue 1152248, Issue 17083 * scan the buffer using string operations or regular expressions etc, e.g. to skip whitespace, read a run of unescaped symbols * tentatively read data to see if a keyword is present, but roll back if the data doesn’t match the keyword
msg233806 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-10 08:57
Parts of the patch here actually do the same thing as my LZMAFile patch for Issue 15955. I wish I had looked at the patch earlier! The difference is I used a proposed max_length parameter for the decompressor rather than unlimited decompression, and I used the existing BufferedReader class rather than implementing a new custom one. The changes for the “gzip” module could probably be merged with my GzipFile patch at Issue 15955 and made to use BufferedReader.
msg240695 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-04-13 18:18
Is it still relevant, now that #23529 has been committed?
msg241403 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-04-18 08:22
The LZMA, gzip and bzip modules now all use BufferedReader, so Serhiy’s patch is no longer relevant for them. Serhiy’s patch also changed the zipfile module, which may be still relevant. On the other hand, perhaps it would be more ideal to use BufferedReader for the zipfile module as well.

History
Date	User	Action	Args
2022-04-11 14:57:51	admin	set	github: 63251
2015-04-18 08:22:44	martin.panter	set	messages: + msg241403
2015-04-13 18:18:54	pitrou	set	messages: + msg240695
2015-01-10 08:57:31	martin.panter	set	messages: + msg233806
2015-01-09 12:22:56	martin.panter	set	messages: + msg233747
2014-12-21 03:08:24	martin.panter	set	nosy: + martin.panter
2013-09-20 12:32:24	pitrou	set	messages: + msg198143
2013-09-19 15:46:33	serhiy.storchaka	set	messages: - msg198083
2013-09-19 15:26:50	serhiy.storchaka	set	messages: + msg198083
2013-09-19 15:26:00	serhiy.storchaka	set	messages: + msg198082
2013-09-19 14:47:25	pitrou	set	messages: + msg198078
2013-09-19 14:34:26	serhiy.storchaka	set	files: + read_bench_cmp
2013-09-19 14:33:17	serhiy.storchaka	set	files: + read_bench.py messages: + msg198077
2013-09-19 14:22:09	serhiy.storchaka	create