Message 146919 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	ezio.melotti, loewis, pitrou
Date	2011-11-03.12:13:46
SpamBayes Score	7.7327034e-14
Marked as misclassified	No
Message-id	<1320322427.22.0.464090396955.issue13333@psf.upfronthosting.co.za>
In-reply-to

Content
The utf-7 codec happily encodes lone surrogates, but it won't decode them: >>> "\ud801".encode("utf-7") b'+2AE-' >>> "\ud801\ud801".encode("utf-7") b'+2AHYAQ-' >>> "\ud801".encode("utf-7").decode("utf-7") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence >>> "\ud801\ud801".encode("utf-7").decode("utf-7") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec.

The utf-7 codec happily encodes lone surrogates, but it won't decode them:

>>> "\ud801".encode("utf-7")
b'+2AE-'
>>> "\ud801\ud801".encode("utf-7")
b'+2AHYAQ-'
>>> "\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence
>>> "\ud801\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing


I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec.

History
Date	User	Action	Args
2011-11-03 12:13:47	pitrou	set	recipients: + pitrou, loewis, ezio.melotti
2011-11-03 12:13:47	pitrou	set	messageid: <1320322427.22.0.464090396955.issue13333@psf.upfronthosting.co.za>
2011-11-03 12:13:46	pitrou	link	issue13333 messages
2011-11-03 12:13:46	pitrou	create