Message248450
Trying to implement UTF-7 codec in Python I found some warts in error handling.
1. Non-ASCII bytes.
No errors:
>>> 'a€b'.encode('utf-7')
b'a+IKw-b'
>>> b'a+IKw-b'.decode('utf-7')
'a€b'
Terminating '-' at the end of the string is optional.
>>> b'a+IKw'.decode('utf-7')
'a€'
And sometimes it is optional in the middle of the string (if following char is not used in BASE64).
>>> b'a+IKw;b'.decode('utf-7')
'a€;b'
But if following char is not ASCII, it is accepted as well, and this looks as a bug.
>>> b'a+IKw\xffb'.decode('utf-7')
'a€ÿb'
In all other cases non-ASCII byte causes an error:
>>> b'a\xffb'.decode('utf-7')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'
2. Ending lone high surrogate.
Lone surrogates are silently accepted by utf-7 codec.
>>> '\ud8e4\U0001d121'.encode('utf-7')
b'+2OTYNN0h-'
>>> '\U0001d121\ud8e4'.encode('utf-7')
b'+2DTdIdjk-'
>>> b'+2OTYNN0h-'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2OTYNN0h'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2DTdIdjk-'.decode('utf-7')
'𝄡\ud8e4'
Except at the end of unterminated shift sequence:
>>> b'+2DTdIdjk'.decode('utf-7')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence
3. Incorrect shift sequence.
Strange behavior happens when shift sequence ends with wrong bits.
>>> b'a+IKx-b'.decode('utf-7', 'ignore')
'a€b'
>>> b'a+IKx-b'.decode('utf-7', 'replace')
'a€�b'
>>> b'a+IKx-b'.decode('utf-7', 'backslashreplace')
'a€\\x2b\\x49\\x4b\\x78\\x2db'
The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders. |
|
Date |
User |
Action |
Args |
2015-08-12 07:36:05 | serhiy.storchaka | set | recipients:
+ serhiy.storchaka, lemburg, loewis, vstinner, ezio.melotti |
2015-08-12 07:36:05 | serhiy.storchaka | set | messageid: <1439364965.73.0.526700604341.issue24848@psf.upfronthosting.co.za> |
2015-08-12 07:36:05 | serhiy.storchaka | link | issue24848 messages |
2015-08-12 07:36:04 | serhiy.storchaka | create | |
|