classification
Title: Add io.BinaryTransformWrapper and a "transform" parameter to open()
Type: enhancement Stage: needs patch
Components: Interpreter Core, IO, Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, hynek, lemburg, martin.panter, ncoghlan, pitrou, serhiy.storchaka, stutzbach, vstinner
Priority: normal Keywords:

Created on 2014-01-27 05:24 by ncoghlan, last changed 2014-02-10 22:27 by martin.panter.

Messages (12)
msg209398 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-01-27 05:24
Issue 20404 points out that io.TextIOWrapper can't be used with binary transform codecs like bz2 because the types are wrong.

By contrast, codecs.open() still defaults to working in binary mode, and just switches to returning a different type based on the specified encoding (exactly the kind of value-driven output type changes we're trying to eliminate from the core text model):

>>> import codecs
>>> print(codecs.open('hex.txt').read())
b'aabbccddeeff'
>>> print(codecs.open('hex.txt', encoding='hex').read())
b'\xaa\xbb\xcc\xdd\xee\xff'
>>> print(codecs.open('hex.txt', encoding='utf-8').read())
aabbccddeeff

While for 3.4, I plan to just extend the issue 19619 blacklist to also cover TextIOWrapper (and hence open()), it seems to me that there is a valid use case for bytes-to-bytes transform support directly in the IO stack.

A PEP for 3.5 could propose:

- providing a public API that allows codecs to be classified into at least the following groups ("binary" = memorview compatible data exporters, including both bytes and bytearray):
  - text encodings (decodes binary to str, encodes str to bytes)
  - binary transforms (decodes *and* encodes binary to bytes)
  - text transforms (decodes and encodes str to str)
  - hybrid transforms (acts as both a binary transform *and* as a text transform)
  - hybrid encodings (decodes binary and potentially str to str, encodes binary and str to bytes)
  - arbitrary encodings (decodes and encodes object to object, without fitting any of the above categories)

- adding io.BinaryTransformWrapper that applies binary transforms when reading and writing data (similar to the way TextIOWrapper applies text encodings)

- adding a "transform" parameter to open that inserts BinaryTransformWrapper into the stack at the appropriate place (the PEP process would need to decide between supporting just a single transform per stream or multiple). In text mode, TextIOWrapper would be added to the stack after any binary transforms.

Optionally, the idea could also be extended to adding io.TextTransformWrapper and a "text_transform" parameter, but those seem somewhat less useful.
msg209404 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-27 06:51
I think this is redundant because codecs.StreamReader and codecs.StreamWriter already exist. They are buggy, but now they are less buggy then at the time when Victor wrote PEP 400 and can be improved more. TextIOWrapper serves important special case, but for binary->binary and text->text transformations codecs.Stream* should be enough (after fixing some misbehaving codecs of course).
msg209406 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-01-27 07:25
That's certainly a reasonable position to take - they use the same object->object model that the codecs module in general provides, which means Python 3.x can already handle the relevant use cases.

Any such PEP would be about deciding whether or not binary transforms are a case worth having additional infrastructure to support, or whether we just say that anyone wanting to deal with codecs other than test encodings should use the type neutral codec APIs.

In the latter case, all that would be needed is a simple "is_text_encoding" flag, inspired by the private flag we already added to implement the non-text-encoding blacklist in 3.4.
msg209426 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-01-27 09:55
That doesn't sound terribly useful indeed. The "hex" example is a toy example. Real-world examples would involve compression (zlib...) but then it is probably much more efficient to have a dedicated implementation (GzipFile) rather than blindly call zlib.compress() or zlib.decompress() at each round.
msg209427 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-27 09:59
I agree with Antoine, I dislike the idea of BinaryTransformWrapper, it remembers me the evil codecs.EncodedFile thing.

What are the usecases?
msg209430 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-01-27 10:26
On 27.01.2014 11:00, STINNER Victor wrote:
> 
> STINNER Victor added the comment:
> 
> I agree with Antoine, I dislike the idea of BinaryTransformWrapper, it remembers me the evil codecs.EncodedFile thing.
>
> What are the usecases?

Ever used "recode" ?

The purpose of EncodedFile/StreamRecoder was to convert an externally
used encoding to a standard internal one - mainly to allow programs
that didn't want to use Unicode for processing to still benefit from
the codecs that come with Python.

E.g. the example at the end of codecs.py allows using Latin-1 within
the application, while talking to the console using UTF-8.
msg209431 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-27 10:29
Nobody talks to the console using hex_codec.
msg209433 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-01-27 11:05
I only used hex as the example because it was trivial to generate test data for.

The stackable streaming IO model is an extremely powerful one - the approach we already use in the io module has some similarities to the one Texas Instruments use in DSP/BIOS (http://www.ti.com/tool/dspbios) and I know from experience how convenient that is. The model means you can push a lot of your data manipulation into your stream definitions, and keep all that data transformation logic out of your main application. (In my case, it let us mostly ignore the differences in a-law, u-law and ADPCM encoded audio, since we just built the IO streams differently depending on which one we were dealing with).

However, relative to DSP/BIOS, our stream model is currently missing the "stackable" piece - it's difficult to plug additional wrappers into the stream, because we don't have either the "binary in, binary out" or the "text in, text out" component.

A well designed streaming codec should be able to sit in the pipeline providing transparent encryption whether you're piping to a file, to another process or to a socket. If you're handling audio or video data, then you would also be able to place your codecs directly in the stream pipeline, rather than needing to come up with your own custom data pipeline model.

This isn't a novel design overall - it's the way the signal processing world has been doing things for decades (I first learned this model when using DSP/BIOS more than a decade ago, and Linux STREAMS, which includes some similar concepts, is substantially older than that). The only novel concept here is the idea of offering this feature as part of Python 3's native io model.

DSP/BIOS and STREAMS also have some solid design concepts around using gather/scatter devices for stream multiplexing, but that's not related to codec handling improvements.
msg209434 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-01-27 11:11
Note that this is something that could (and should) start life as a module on PyPI, which would also provide cross version support.
msg210069 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-03 01:24
> Ever used "recode" ?

No, what is it? I once used iconv for short tests, but I never required iconv to convert a real document.

> E.g. the example at the end of codecs.py allows using Latin-1 within
> the application, while talking to the console using UTF-8.

It doesn't make sense anymore in Python 3, strings are now store as Unicode within the application.
msg210114 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-02-03 11:18
On 03.02.2014 02:24, STINNER Victor wrote:
> 
> STINNER Victor added the comment:
> 
>> Ever used "recode" ?
> 
> No, what is it? I once used iconv for short tests, but I never required iconv to convert a real document.

It's a command line tool to convert documents in various encodings
to other encodings:

http://recode.progiciels-bpi.ca/index.html
https://github.com/pinard/Recode

It's similar to iconv.
msg210115 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-03 11:34
We already have stackable pieces for gzip, bz2 and lzma compressed streams -- GzipFile, BZ2File and LZMAFile. They are more powerful and more efficient than generic codecs.StreamReader/codecs.StreamWriter (and note that most binary codecs are just don't work correctly with codecs streams).
History
Date User Action Args
2014-02-10 22:27:34martin.pantersetnosy: + martin.panter
2014-02-03 11:34:08serhiy.storchakasetmessages: + msg210115
2014-02-03 11:18:18lemburgsetmessages: + msg210114
2014-02-03 01:24:22vstinnersetmessages: + msg210069
2014-01-27 11:11:21ncoghlansetmessages: + msg209434
2014-01-27 11:05:14ncoghlansetmessages: + msg209433
2014-01-27 10:29:54serhiy.storchakasetmessages: + msg209431
2014-01-27 10:26:11lemburgsetmessages: + msg209430
2014-01-27 10:00:00vstinnersetmessages: + msg209427
2014-01-27 09:55:56pitrousetmessages: + msg209426
2014-01-27 07:25:06ncoghlansetmessages: + msg209406
2014-01-27 06:51:10serhiy.storchakasetnosy: lemburg, ncoghlan, pitrou, vstinner, benjamin.peterson, stutzbach, ezio.melotti, hynek, serhiy.storchaka
messages: + msg209404
2014-01-27 05:24:39ncoghlancreate