Message 248450 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	ezio.melotti, lemburg, loewis, serhiy.storchaka, vstinner
Date	2015-08-12.07:36:04
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1439364965.73.0.526700604341.issue24848@psf.upfronthosting.co.za>
In-reply-to

Content
Trying to implement UTF-7 codec in Python I found some warts in error handling. 1. Non-ASCII bytes. No errors: >>> 'a€b'.encode('utf-7') b'a+IKw-b' >>> b'a+IKw-b'.decode('utf-7') 'a€b' Terminating '-' at the end of the string is optional. >>> b'a+IKw'.decode('utf-7') 'a€' And sometimes it is optional in the middle of the string (if following char is not used in BASE64). >>> b'a+IKw;b'.decode('utf-7') 'a€;b' But if following char is not ASCII, it is accepted as well, and this looks as a bug. >>> b'a+IKw\xffb'.decode('utf-7') 'a€ÿb' In all other cases non-ASCII byte causes an error: >>> b'a\xffb'.decode('utf-7') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character >>> b'a\xffb'.decode('utf-7', 'replace') 'a�b' 2. Ending lone high surrogate. Lone surrogates are silently accepted by utf-7 codec. >>> '\ud8e4\U0001d121'.encode('utf-7') b'+2OTYNN0h-' >>> '\U0001d121\ud8e4'.encode('utf-7') b'+2DTdIdjk-' >>> b'+2OTYNN0h-'.decode('utf-7') '\ud8e4𝄡' >>> b'+2OTYNN0h'.decode('utf-7') '\ud8e4𝄡' >>> b'+2DTdIdjk-'.decode('utf-7') '𝄡\ud8e4' Except at the end of unterminated shift sequence: >>> b'+2DTdIdjk'.decode('utf-7') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence 3. Incorrect shift sequence. Strange behavior happens when shift sequence ends with wrong bits. >>> b'a+IKx-b'.decode('utf-7', 'ignore') 'a€b' >>> b'a+IKx-b'.decode('utf-7', 'replace') 'a€�b' >>> b'a+IKx-b'.decode('utf-7', 'backslashreplace') 'a€\\x2b\\x49\\x4b\\x78\\x2db' The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders.

Trying to implement UTF-7 codec in Python I found some warts in error handling.

1. Non-ASCII bytes.

No errors:
>>> 'a€b'.encode('utf-7')
b'a+IKw-b'
>>> b'a+IKw-b'.decode('utf-7')
'a€b'

Terminating '-' at the end of the string is optional.
>>> b'a+IKw'.decode('utf-7')
'a€'

And sometimes it is optional in the middle of the string (if following char is not used in BASE64).
>>> b'a+IKw;b'.decode('utf-7')
'a€;b'

But if following char is not ASCII, it is accepted as well, and this looks as a bug.
>>> b'a+IKw\xffb'.decode('utf-7')
'a€ÿb'

In all other cases non-ASCII byte causes an error:
>>> b'a\xffb'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'

2. Ending lone high surrogate.

Lone surrogates are silently accepted by utf-7 codec.

>>> '\ud8e4\U0001d121'.encode('utf-7')
b'+2OTYNN0h-'
>>> '\U0001d121\ud8e4'.encode('utf-7')
b'+2DTdIdjk-'
>>> b'+2OTYNN0h-'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2OTYNN0h'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2DTdIdjk-'.decode('utf-7')
'𝄡\ud8e4'

Except at the end of unterminated shift sequence:
>>> b'+2DTdIdjk'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence

3. Incorrect shift sequence.

Strange behavior happens when shift sequence ends with wrong bits.
>>> b'a+IKx-b'.decode('utf-7', 'ignore')
'a€b'
>>> b'a+IKx-b'.decode('utf-7', 'replace')
'a€�b'
>>> b'a+IKx-b'.decode('utf-7', 'backslashreplace')
'a€\\x2b\\x49\\x4b\\x78\\x2db'

The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders.

History
Date	User	Action	Args
2015-08-12 07:36:05	serhiy.storchaka	set	recipients: + serhiy.storchaka, lemburg, loewis, vstinner, ezio.melotti
2015-08-12 07:36:05	serhiy.storchaka	set	messageid: <1439364965.73.0.526700604341.issue24848@psf.upfronthosting.co.za>
2015-08-12 07:36:05	serhiy.storchaka	link	issue24848 messages
2015-08-12 07:36:04	serhiy.storchaka	create