Message176711
surrogateescape claims to be "implemented by all standard Python codecs"
http://docs.python.org/3/library/codecs.html#codec-base-classes
However it fails w/ multibytecodecs on encode:
Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "\u30fb".encode('gb18030')
b'\x819\xa79'
>>> "\u30fb\udc80".encode('gb18030', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encoding error handler must return (unicode, int) tuple
The problem being that multibytecodec.c forces error handler return results to always be unicode and surrogateescape returns bytes here.
(surrogatepass also similarly returns bytes but it claims to be utf-8 only)
The error handler spec seems to imply that error handlers should always return unicode, because "The encoder will encode the replacement"
http://docs.python.org/3/library/codecs.html#codecs.register_error
but obviously that's not really the case: some codecs special case bytes results and copy them directly to the output, e.g.:
http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305 |
|
Date |
User |
Action |
Args |
2012-11-30 20:20:22 | pjenvey | set | recipients:
+ pjenvey |
2012-11-30 20:20:22 | pjenvey | set | messageid: <1354306822.13.0.130768178209.issue16585@psf.upfronthosting.co.za> |
2012-11-30 20:20:22 | pjenvey | link | issue16585 messages |
2012-11-30 20:20:21 | pjenvey | create | |
|