classification
Title: surrogateescape broken w/ multibytecodecs' encode
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, docs@python, doerwalter, ezio.melotti, haypo, lemburg, pitrou, pjenvey, python-dev, serhiy.storchaka
Priority: normal Keywords:

Created on 2012-11-30 20:20 by pjenvey, last changed 2012-12-02 16:33 by python-dev. This issue is now closed.

Messages (5)
msg176711 - (view) Author: Philip Jenvey (pjenvey) * (Python committer) Date: 2012-11-30 20:20
surrogateescape claims to be "implemented by all standard Python codecs"

http://docs.python.org/3/library/codecs.html#codec-base-classes

However it fails w/ multibytecodecs on encode:

Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "\u30fb".encode('gb18030')
b'\x819\xa79'
>>> "\u30fb\udc80".encode('gb18030', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: encoding error handler must return (unicode, int) tuple

The problem being that multibytecodec.c forces error handler return results to always be unicode and surrogateescape returns bytes here.

(surrogatepass also similarly returns bytes but it claims to be utf-8 only)

The error handler spec seems to imply that error handlers should always return unicode, because "The encoder will encode the replacement"

http://docs.python.org/3/library/codecs.html#codecs.register_error

but obviously that's not really the case: some codecs special case bytes results and copy them directly to the output, e.g.:

http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305
msg176717 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-11-30 20:50
Codecs should be fixed to accept bytes from the error handler and the definition in the docs loosened. Returning bytes seems to be useful.
msg176780 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2012-12-02 10:38
And returning bytes is documented in PEP 383, as an extension to the PEP 293 machinery:

"""To convert non-decodable bytes, a new error handler ([2]) "surrogateescape" is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables.

The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again (also see the discussion below)."""
msg176798 - (view) Author: Roundup Robot (python-dev) Date: 2012-12-02 16:21
New changeset 5c88c72dec60 by Benjamin Peterson in branch '3.3':
support encoding error handlers that return bytes (closes #16585)
http://hg.python.org/cpython/rev/5c88c72dec60

New changeset 2181c37977d3 by Benjamin Peterson in branch 'default':
merge 3.3 (#16585)
http://hg.python.org/cpython/rev/2181c37977d3
msg176799 - (view) Author: Roundup Robot (python-dev) Date: 2012-12-02 16:33
New changeset 777aabdff35a by Benjamin Peterson in branch '3.3':
document that encoding error handlers may return bytes (#16585)
http://hg.python.org/cpython/rev/777aabdff35a
History
Date User Action Args
2012-12-02 16:33:24python-devsetmessages: + msg176799
2012-12-02 16:21:14python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg176798

resolution: fixed
stage: needs patch -> resolved
2012-12-02 12:04:01pitrousetassignee: docs@python ->
components: + Library (Lib), - Documentation, Interpreter Core, Unicode
2012-12-02 10:38:36doerwaltersetnosy: + doerwalter
messages: + msg176780
2012-11-30 21:29:14serhiy.storchakasetassignee: docs@python

nosy: + docs@python
components: + Documentation
stage: needs patch
2012-11-30 20:50:11benjamin.petersonsetmessages: + msg176717
2012-11-30 20:28:55serhiy.storchakasetnosy: + lemburg, pitrou, haypo, benjamin.peterson, ezio.melotti, serhiy.storchaka

type: behavior
components: + Unicode
versions: + Python 3.4
2012-11-30 20:20:22pjenveycreate