Message 145979 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	belopolsky, benjamin.peterson, cben, eric.araujo, flox, georg.brandl, gvanrossum, lemburg, loewis, ncoghlan, ssbarnea, vstinner
Date	2011-10-19.22:09:42
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1319062183.68.0.0151270752774.issue7475@psf.upfronthosting.co.za>
In-reply-to

Content
Some further comments after getting back up to speed with the actual status of this problem (i.e. that we had issues with the error checking and reporting in the original 3.2 commit). 1. I agree with the position that the codecs module itself is intended to be a type neutral codec registry. It encodes and decodes things, but shouldn't actually care about the types involved. If that is currently not the case in 3.x, it needs to be fixed. This type neutrality was blurred in 2.x by the fact that it only implemented str->str translations, and even further obscured by the coupling to the .encode() and .decode() convenience APIs. The fact that the type neutrality of the registry itself is currently broken in 3.x is a regression, not an improvement. (The convenience APIs, on the other hand, are definitely not type neutral, and aren't intended to be) 2. To assist in producing nice error messages, and to allow restrictions to be enforced on type-specific convenience APIs, the CodecInfo objects should grow additional state as MAL suggests. To avoid redundancy (and inaccurate overspecification), my suggested colour for that particular bikeshed is: Character encoding codec: .decoded_format = 'text' .encoded_format = 'binary' Binary transform codec: .decoded_format = 'binary' .encoded_format = 'binary' Text transform codec: .decoded_format = 'text' .encoded_format = 'text' I suggest using the fuzzy format labels mainly due to the existence of the buffer API - most codec operations that consume binary data will accept anything that implements the buffer API, so referring specifically to 'bytes' in error messages would be inaccurate. The convenience APIs can then emit errors like: 'a'.encode('rot_13') ==> CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text) 'a'.decode('rot_13') ==> CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text) 'a'.transform('bz2') ==> CodecLookupError: text <-> text codec expected ('bz2' is binary <-> binary) 'a'.transform('ascii') ==> CodecLookupError: text <-> text codec expected ('ascii' is text <-> binary) b'a'.transform('ascii') ==> CodecLookupError: binary <-> binary codec expected ('ascii' is text <-> binary) For backwards compatibility with 3.2, codecs that do not specify their formats should be treated as character encoding codecs (i.e. decoded format is 'text', encoded format is 'binary')

Some further comments after getting back up to speed with the actual status of this problem (i.e. that we had issues with the error checking and reporting in the original 3.2 commit).

1. I agree with the position that the codecs module itself is intended to be a type neutral codec registry. It encodes and decodes things, but shouldn't actually care about the types involved. If that is currently not the case in 3.x, it needs to be fixed.

This type neutrality was blurred in 2.x by the fact that it only implemented str->str translations, and even further obscured by the coupling to the .encode() and .decode() convenience APIs. The fact that the type neutrality of the registry itself is currently broken in 3.x is a *regression*, not an improvement. (The convenience APIs, on the other hand, are definitely *not* type neutral, and aren't intended to be)

2. To assist in producing nice error messages, and to allow restrictions to be enforced on type-specific convenience APIs, the CodecInfo objects should grow additional state as MAL suggests. To avoid redundancy (and inaccurate overspecification), my suggested colour for that particular bikeshed is:

Character encoding codec:
  .decoded_format = 'text'
  .encoded_format = 'binary'

Binary transform codec:
  .decoded_format = 'binary'
  .encoded_format = 'binary'

Text transform codec:
  .decoded_format = 'text'
  .encoded_format = 'text'

I suggest using the fuzzy format labels mainly due to the existence of the buffer API - most codec operations that consume binary data will accept anything that implements the buffer API, so referring specifically to 'bytes' in error messages would be inaccurate.

The convenience APIs can then emit errors like:

  'a'.encode('rot_13') ==>
  CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text)

  'a'.decode('rot_13') ==>
  CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text)

  'a'.transform('bz2') ==>
  CodecLookupError: text <-> text codec expected ('bz2' is binary <-> binary)

  'a'.transform('ascii') ==>
  CodecLookupError: text <-> text codec expected ('ascii' is text <-> binary)

  b'a'.transform('ascii') ==>
  CodecLookupError: binary <-> binary codec expected ('ascii' is text <-> binary)

For backwards compatibility with 3.2, codecs that do not specify their formats should be treated as character encoding codecs (i.e. decoded format is 'text', encoded format is 'binary')

History
Date	User	Action	Args
2011-10-19 22:09:43	ncoghlan	set	recipients: + ncoghlan, lemburg, gvanrossum, loewis, georg.brandl, cben, belopolsky, vstinner, benjamin.peterson, eric.araujo, ssbarnea, flox
2011-10-19 22:09:43	ncoghlan	set	messageid: <1319062183.68.0.0151270752774.issue7475@psf.upfronthosting.co.za>
2011-10-19 22:09:43	ncoghlan	link	issue7475 messages
2011-10-19 22:09:42	ncoghlan	create