This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author pjenvey
Recipients pjenvey
Date 2012-11-30.20:20:21
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1354306822.13.0.130768178209.issue16585@psf.upfronthosting.co.za>
In-reply-to
Content
surrogateescape claims to be "implemented by all standard Python codecs"

http://docs.python.org/3/library/codecs.html#codec-base-classes

However it fails w/ multibytecodecs on encode:

Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "\u30fb".encode('gb18030')
b'\x819\xa79'
>>> "\u30fb\udc80".encode('gb18030', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: encoding error handler must return (unicode, int) tuple

The problem being that multibytecodec.c forces error handler return results to always be unicode and surrogateescape returns bytes here.

(surrogatepass also similarly returns bytes but it claims to be utf-8 only)

The error handler spec seems to imply that error handlers should always return unicode, because "The encoder will encode the replacement"

http://docs.python.org/3/library/codecs.html#codecs.register_error

but obviously that's not really the case: some codecs special case bytes results and copy them directly to the output, e.g.:

http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305
History
Date User Action Args
2012-11-30 20:20:22pjenveysetrecipients: + pjenvey
2012-11-30 20:20:22pjenveysetmessageid: <1354306822.13.0.130768178209.issue16585@psf.upfronthosting.co.za>
2012-11-30 20:20:22pjenveylinkissue16585 messages
2012-11-30 20:20:21pjenveycreate