This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ncoghlan
Recipients belopolsky, benjamin.peterson, cben, eric.araujo, flox, georg.brandl, gvanrossum, lemburg, loewis, ncoghlan, ssbarnea, vstinner
Date 2011-10-19.22:09:42
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1319062183.68.0.0151270752774.issue7475@psf.upfronthosting.co.za>
In-reply-to
Content
Some further comments after getting back up to speed with the actual status of this problem (i.e. that we had issues with the error checking and reporting in the original 3.2 commit).

1. I agree with the position that the codecs module itself is intended to be a type neutral codec registry. It encodes and decodes things, but shouldn't actually care about the types involved. If that is currently not the case in 3.x, it needs to be fixed.

This type neutrality was blurred in 2.x by the fact that it only implemented str->str translations, and even further obscured by the coupling to the .encode() and .decode() convenience APIs. The fact that the type neutrality of the registry itself is currently broken in 3.x is a *regression*, not an improvement. (The convenience APIs, on the other hand, are definitely *not* type neutral, and aren't intended to be)

2. To assist in producing nice error messages, and to allow restrictions to be enforced on type-specific convenience APIs, the CodecInfo objects should grow additional state as MAL suggests. To avoid redundancy (and inaccurate overspecification), my suggested colour for that particular bikeshed is:

Character encoding codec:
  .decoded_format = 'text'
  .encoded_format = 'binary'

Binary transform codec:
  .decoded_format = 'binary'
  .encoded_format = 'binary'

Text transform codec:
  .decoded_format = 'text'
  .encoded_format = 'text'

I suggest using the fuzzy format labels mainly due to the existence of the buffer API - most codec operations that consume binary data will accept anything that implements the buffer API, so referring specifically to 'bytes' in error messages would be inaccurate.

The convenience APIs can then emit errors like:

  'a'.encode('rot_13') ==>
  CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text)

  'a'.decode('rot_13') ==>
  CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text)

  'a'.transform('bz2') ==>
  CodecLookupError: text <-> text codec expected ('bz2' is binary <-> binary)

  'a'.transform('ascii') ==>
  CodecLookupError: text <-> text codec expected ('ascii' is text <-> binary)

  b'a'.transform('ascii') ==>
  CodecLookupError: binary <-> binary codec expected ('ascii' is text <-> binary)

For backwards compatibility with 3.2, codecs that do not specify their formats should be treated as character encoding codecs (i.e. decoded format is 'text', encoded format is 'binary')
History
Date User Action Args
2011-10-19 22:09:43ncoghlansetrecipients: + ncoghlan, lemburg, gvanrossum, loewis, georg.brandl, cben, belopolsky, vstinner, benjamin.peterson, eric.araujo, ssbarnea, flox
2011-10-19 22:09:43ncoghlansetmessageid: <1319062183.68.0.0151270752774.issue7475@psf.upfronthosting.co.za>
2011-10-19 22:09:43ncoghlanlinkissue7475 messages
2011-10-19 22:09:42ncoghlancreate