classification
Title: Unify buffered readers
Type: enhancement Stage: patch review
Components: IO, Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: alanmcintyre, benjamin.peterson, martin.panter, nadeem.vawda, pitrou, serhiy.storchaka, stutzbach
Priority: normal Keywords: patch

Created on 2013-09-19 14:22 by serhiy.storchaka, last changed 2015-04-18 08:22 by martin.panter.

Files
File name Uploaded Description Edit
buffered_reader.diff serhiy.storchaka, 2013-09-19 14:22 review
read_bench.py serhiy.storchaka, 2013-09-19 14:33 Benchmark script
read_bench_cmp serhiy.storchaka, 2013-09-19 14:34 Benchmark results
Messages (9)
msg198075 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 14:22
There are some classes in gzip, bz2, lzma, and zipfile modules which implement buffered reader interface. They read chunks of data from underlied file object, decompress it, save in internal buffer, and provide common methods to read from this buffer. Maintaining of duplicated code is cumbersome and error prone. Proposed preliminary patch moves common code into new private class _io2._BufferedReaderMixin. If the proposition will be accepted in general, I'm going to write C version and move it into the io module. Perhaps even then merge it with io.BufferedIOBase.

The idea is that all buffered reading functions (read(), read1(), readline(), peek(), etc) can be expressed in the term of one function which returns raw unbuffered data. Subclasses need define only one such function and will got all buffered reader interface. In case of mentioned above classes this functions reads and decompresses a chunk of data from underlied file. The HTTPResponse class perhaps will benefit too (issue19009).
msg198077 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 14:33
Here are benchmark script and its results.
msg198078 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 14:47
See issue12053 for a more flexible primitive.
msg198082 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 15:26
This primitive doesn't not well fit a case of compressed streams. A chunk of compressed data read from underlied file object can be uncompressed to unpredictable large data. We can't limit the size of buffer.

Another point is that buffer interface is not very appropriate for Python implementation. And we want left as much Python code in gzip, bz2, lzma and zipfile as possible. Copying from bytes into buffer and back will just waste resources.
msg198143 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-20 12:32
If you want this, I think it should be somehow folded into existing classes (for example BufferedIOBase). Yet another implementation of readline() isn't really a good idea.
msg233747 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-09 12:22
For what it’s worth, it would be better if compressed streams did limit the amount of data they decompressed, so that they are not susceptible to decompression bombs; see Issue 15955. But having a flexible-sized buffer could be useful in other cases.

I haven’t looked closely at the code, but I wonder if there is much difference from the existing BufferedReader. Perhaps just that the underlying raw stream in this case can deliver data in arbitrary-sized chunks, but BufferedReader expects its raw stream to deliver data in limited-sized chunks?

If you exposed the buffer it could be useful to do many things more efficiently:

* readline() with custom newline or end-of-record codes, solving Issue 1152248, Issue 17083
* scan the buffer using string operations or regular expressions etc, e.g. to skip whitespace, read a run of unescaped symbols
* tentatively read data to see if a keyword is present, but roll back if the data doesn’t match the keyword
msg233806 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-10 08:57
Parts of the patch here actually do the same thing as my LZMAFile patch for Issue 15955. I wish I had looked at the patch earlier! The difference is I used a proposed max_length parameter for the decompressor rather than unlimited decompression, and I used the existing BufferedReader class rather than implementing a new custom one.

The changes for the “gzip” module could probably be merged with my GzipFile patch at Issue 15955 and made to use BufferedReader.
msg240695 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-13 18:18
Is it still relevant, now that #23529 has been committed?
msg241403 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-18 08:22
The LZMA, gzip and bzip modules now all use BufferedReader, so Serhiy’s patch is no longer relevant for them. Serhiy’s patch also changed the zipfile module, which may be still relevant. On the other hand, perhaps it would be more ideal to use BufferedReader for the zipfile module as well.
History
Date User Action Args
2015-04-18 08:22:44martin.pantersetmessages: + msg241403
2015-04-13 18:18:54pitrousetmessages: + msg240695
2015-01-10 08:57:31martin.pantersetmessages: + msg233806
2015-01-09 12:22:56martin.pantersetmessages: + msg233747
2014-12-21 03:08:24martin.pantersetnosy: + martin.panter
2013-09-20 12:32:24pitrousetmessages: + msg198143
2013-09-19 15:46:33serhiy.storchakasetmessages: - msg198083
2013-09-19 15:26:50serhiy.storchakasetmessages: + msg198083
2013-09-19 15:26:00serhiy.storchakasetmessages: + msg198082
2013-09-19 14:47:25pitrousetmessages: + msg198078
2013-09-19 14:34:26serhiy.storchakasetfiles: + read_bench_cmp
2013-09-19 14:33:17serhiy.storchakasetfiles: + read_bench.py

messages: + msg198077
2013-09-19 14:22:09serhiy.storchakacreate