Message146919
The utf-7 codec happily encodes lone surrogates, but it won't decode them:
>>> "\ud801".encode("utf-7")
b'+2AE-'
>>> "\ud801\ud801".encode("utf-7")
b'+2AHYAQ-'
>>> "\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence
>>> "\ud801\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing
I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec. |
|
Date |
User |
Action |
Args |
2011-11-03 12:13:47 | pitrou | set | recipients:
+ pitrou, loewis, ezio.melotti |
2011-11-03 12:13:47 | pitrou | set | messageid: <1320322427.22.0.464090396955.issue13333@psf.upfronthosting.co.za> |
2011-11-03 12:13:46 | pitrou | link | issue13333 messages |
2011-11-03 12:13:46 | pitrou | create | |
|