This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author serhiy.storchaka
Recipients ezio.melotti, lemburg, loewis, serhiy.storchaka, vstinner
Date 2015-08-12.07:36:04
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1439364965.73.0.526700604341.issue24848@psf.upfronthosting.co.za>
In-reply-to
Content
Trying to implement UTF-7 codec in Python I found some warts in error handling.

1. Non-ASCII bytes.

No errors:
>>> 'a€b'.encode('utf-7')
b'a+IKw-b'
>>> b'a+IKw-b'.decode('utf-7')
'a€b'

Terminating '-' at the end of the string is optional.
>>> b'a+IKw'.decode('utf-7')
'a€'

And sometimes it is optional in the middle of the string (if following char is not used in BASE64).
>>> b'a+IKw;b'.decode('utf-7')
'a€;b'

But if following char is not ASCII, it is accepted as well, and this looks as a bug.
>>> b'a+IKw\xffb'.decode('utf-7')
'a€ÿb'

In all other cases non-ASCII byte causes an error:
>>> b'a\xffb'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'

2. Ending lone high surrogate.

Lone surrogates are silently accepted by utf-7 codec.

>>> '\ud8e4\U0001d121'.encode('utf-7')
b'+2OTYNN0h-'
>>> '\U0001d121\ud8e4'.encode('utf-7')
b'+2DTdIdjk-'
>>> b'+2OTYNN0h-'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2OTYNN0h'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2DTdIdjk-'.decode('utf-7')
'𝄡\ud8e4'

Except at the end of unterminated shift sequence:
>>> b'+2DTdIdjk'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence

3. Incorrect shift sequence.

Strange behavior happens when shift sequence ends with wrong bits.
>>> b'a+IKx-b'.decode('utf-7', 'ignore')
'a€b'
>>> b'a+IKx-b'.decode('utf-7', 'replace')
'a€�b'
>>> b'a+IKx-b'.decode('utf-7', 'backslashreplace')
'a€\\x2b\\x49\\x4b\\x78\\x2db'

The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders.
History
Date User Action Args
2015-08-12 07:36:05serhiy.storchakasetrecipients: + serhiy.storchaka, lemburg, loewis, vstinner, ezio.melotti
2015-08-12 07:36:05serhiy.storchakasetmessageid: <1439364965.73.0.526700604341.issue24848@psf.upfronthosting.co.za>
2015-08-12 07:36:05serhiy.storchakalinkissue24848 messages
2015-08-12 07:36:04serhiy.storchakacreate