This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author pitrou
Recipients ezio.melotti, loewis, pitrou
Date 2011-11-03.12:13:46
SpamBayes Score 7.7327034e-14
Marked as misclassified No
Message-id <1320322427.22.0.464090396955.issue13333@psf.upfronthosting.co.za>
In-reply-to
Content
The utf-7 codec happily encodes lone surrogates, but it won't decode them:

>>> "\ud801".encode("utf-7")
b'+2AE-'
>>> "\ud801\ud801".encode("utf-7")
b'+2AHYAQ-'
>>> "\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence
>>> "\ud801\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing


I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec.
History
Date User Action Args
2011-11-03 12:13:47pitrousetrecipients: + pitrou, loewis, ezio.melotti
2011-11-03 12:13:47pitrousetmessageid: <1320322427.22.0.464090396955.issue13333@psf.upfronthosting.co.za>
2011-11-03 12:13:46pitroulinkissue13333 messages
2011-11-03 12:13:46pitroucreate