classification
Title: utf-7 inconsistent with surrogates
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, loewis, petri.lehtinen, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2011-11-03 12:13 by pitrou, last changed 2011-11-15 00:58 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
utf7.patch pitrou, 2011-11-14 21:33
utf7-nogit.patch pitrou, 2011-11-14 23:42 review
Messages (11)
msg146919 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-03 12:13
The utf-7 codec happily encodes lone surrogates, but it won't decode them:

>>> "\ud801".encode("utf-7")
b'+2AE-'
>>> "\ud801\ud801".encode("utf-7")
b'+2AHYAQ-'
>>> "\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence
>>> "\ud801\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing


I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec.
msg146951 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-03 17:28
RFC 2152 talks about encoding 16-bit unicode, and clarifies

 Surrogate pairs (UTF-16) are converted by treating each half 
 of the pair as a separate 16 bit quantity (i.e., no special
 treatment).

So lone surrogates clearly should be supported.

This text could be interpreted as saying that decoding surrogate pairs should also keep them (rather than combining them). However, the RFC also assumes that the decoded form will use 16-bit code units; for Python, I think we should continue combining surrogate pairs on decoding UTF-7 when we find them.
msg147457 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-12 01:55
FWIW Wikipedia says "Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates) and then in modified Base64."

So one possible interpretation is that while encoding a non-BMP char, it should be first converted in a surrogate pair and then each of the surrogates should be encoded just like any other 16bit code unit.
While decoding, it seems reasonable to do the opposite, i.e. recombine the surrogate pair.

The RFC doesn't say anything about lone surrogates, but I think that the fact that surrogates are used internally doesn't necessarily mean that the codec should be able to encode/decode them when they are not paired.  The other UTF-* codecs reject them, but that's because it is explicitly forbidden by their respective standards.

So I'm +1 about recombining them while decoding, and ±0 about allowing lone surrogates.
msg147635 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-14 21:33
Here is a patch.
msg147639 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-14 23:26
Can you please regenerate the patch against default's head?
msg147640 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-14 23:29
It's a patch for 3.2.
msg147643 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-14 23:32
Please don't use git-style diffs then, since otherwise the review can't figure out what the patch applies to (and neither could I figure that out).
msg147646 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-14 23:42
Here is a non-git diff then :)
msg147647 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-15 00:16
LGTM.
msg147648 - (view) Author: Roundup Robot (python-dev) Date: 2011-11-15 00:55
New changeset ddfcb0de564f by Antoine Pitrou in branch '3.2':
Issue #13333: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/ddfcb0de564f

New changeset 250091e60f28 by Antoine Pitrou in branch 'default':
Issue #13333: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/250091e60f28

New changeset 050772822bde by Antoine Pitrou in branch '2.7':
Issue #13333: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/050772822bde
msg147649 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-15 00:58
I made a little fix to the patch for wide unicode builds and then committed it. Thank you!
History
Date User Action Args
2011-11-15 00:58:41pitrousetstatus: open -> closed
resolution: fixed
messages: + msg147649

stage: patch review -> resolved
2011-11-15 00:55:06python-devsetnosy: + python-dev
messages: + msg147648
2011-11-15 00:16:36loewissetmessages: + msg147647
2011-11-14 23:42:40pitrousetfiles: + utf7-nogit.patch

messages: + msg147646
2011-11-14 23:32:57loewissetmessages: + msg147643
2011-11-14 23:29:37pitrousetmessages: + msg147640
2011-11-14 23:26:28loewissetmessages: + msg147639
2011-11-14 21:33:26pitrousetfiles: + utf7.patch
keywords: + patch
messages: + msg147635

stage: patch review
2011-11-12 01:55:54ezio.melottisetmessages: + msg147457
2011-11-03 17:28:59loewissetmessages: + msg146951
2011-11-03 12:24:00petri.lehtinensetnosy: + petri.lehtinen
2011-11-03 12:13:46pitroucreate