Message 203039 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	doerwalter, ezio.melotti, lemburg, ncoghlan, serhiy.storchaka, vstinner
Date	2013-11-16.13:26:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<5287727F.9020702@egenix.com>
In-reply-to	<1384605873.95.0.519157543259.issue19619@psf.upfronthosting.co.za>

Content
On 16.11.2013 13:44, Nick Coghlan wrote: > > Nick Coghlan added the comment: > > Now that I understand Victor's proposal better, I actually agree with it, I just think the attribute names need to be "encodes_to" and "decodes_to". > > With Victor's proposal, input validity checks (including type checks) would remain the responsibility of the codec itself. What the new attributes would enable is output type checks without having to perform the encoding or decoding operation first. codecs will be free to leave these as None to retain the current behaviour of "try it and see". > > The specific field names "input_type" and "output_type" aren't accurate, since the acceptable input types for encoding or decoding are likely to be more permissive than the specific output type for the other operation. Most of the binary codecs, for example, accept any bytes-like object as input, but produce bytes objects as output for both encoding and decoding. For Unicode encodings, encoding is strictly str->bytes, but decoding is generally the more permissive bytes-like object -> str. > > I would still suggest providing the following helper function in the codecs module (the name has changed from my earlier suggestion and I now suggest implementing it in terms of Victor's suggestion with more appropriate field names): > > def is_text_encoding(name): > """Returns true if the named encoding is a Unicode text encoding""" > info = codecs.lookup(name) > return info.encodes_to is bytes and info.decodes_to is str > > This approach covers all the current stdlib codecs: > > - the text encodings encode to bytes and decode to str > - the binary transforms encode to bytes and also decode to bytes > - the lone text transform (rot_13) encodes and decodes to str > > This approach also makes it possible for a type inference engine (like mypy) to potentially analyse codec use, and could be expanded in 3.5 to offer type checked binary and text transform APIs that filtered codecs appropriately according to their output types. Nick, you are missing an important point: codecs can have any number of input/output type combinations, e.g. they may convert bytes -> str and str->str (output type depends on input type). For this reason the simplistic approach with just one type conversion will not work. Codecs will have to provide a mapping of input to output types for each direction (encoding and decoding) - either as Python mapping or as list of mapping tuples.

On 16.11.2013 13:44, Nick Coghlan wrote:
> 
> Nick Coghlan added the comment:
> 
> Now that I understand Victor's proposal better, I actually agree with it, I just think the attribute names need to be "encodes_to" and "decodes_to".
> 
> With Victor's proposal, *input* validity checks (including type checks) would remain the responsibility of the codec itself. What the new attributes would enable is *output* type checks *without having to perform the encoding or decoding operation first*. codecs will be free to leave these as None to retain the current behaviour of "try it and see".
> 
> The specific field names "input_type" and "output_type" aren't accurate, since the acceptable input types for encoding or decoding are likely to be more permissive than the specific output type for the other operation. Most of the binary codecs, for example, accept any bytes-like object as input, but produce bytes objects as output for both encoding and decoding. For Unicode encodings, encoding is strictly str->bytes, but decoding is generally the more permissive bytes-like object -> str.
> 
> I would still suggest providing the following helper function in the codecs module (the name has changed from my earlier suggestion and I now suggest implementing it in terms of Victor's suggestion with more appropriate field names):
> 
>     def is_text_encoding(name):
>         """Returns true if the named encoding is a Unicode text encoding"""
>         info = codecs.lookup(name)
>         return info.encodes_to is bytes and info.decodes_to is str
> 
> This approach covers all the current stdlib codecs:
> 
> - the text encodings encode to bytes and decode to str
> - the binary transforms encode to bytes and also decode to bytes
> - the lone text transform (rot_13) encodes and decodes to str
> 
> This approach also makes it possible for a type inference engine (like mypy) to potentially analyse codec use, and could be expanded in 3.5 to offer type checked binary and text transform APIs that filtered codecs appropriately according to their output types.

Nick, you are missing an important point: codecs can have any
number of input/output type combinations, e.g. they may
convert bytes -> str and str->str (output type depends on
input type).

For this reason the simplistic approach with just one type
conversion will not work. Codecs will have to provide a
*mapping* of input to output types for each direction
(encoding and decoding) - either as Python mapping or
as list of mapping tuples.

History
Date	User	Action	Args
2013-11-16 13:26:27	lemburg	set	recipients: + lemburg, doerwalter, ncoghlan, vstinner, ezio.melotti, serhiy.storchaka
2013-11-16 13:26:27	lemburg	link	issue19619 messages
2013-11-16 13:26:27	lemburg	create