New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warts in UTF-7 error handling #69036
Comments
Trying to implement UTF-7 codec in Python I found some warts in error handling.
No errors:
>>> 'a€b'.encode('utf-7')
b'a+IKw-b'
>>> b'a+IKw-b'.decode('utf-7')
'a€b'
Terminating '-' at the end of the string is optional.
>>> b'a+IKw'.decode('utf-7')
'a€'
And sometimes it is optional in the middle of the string (if following char is not used in BASE64).
>>> b'a+IKw;b'.decode('utf-7')
'a€;b'
But if following char is not ASCII, it is accepted as well, and this looks as a bug.
>>> b'a+IKw\xffb'.decode('utf-7')
'a€ÿb'
In all other cases non-ASCII byte causes an error:
>>> b'a\xffb'.decode('utf-7')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'
Lone surrogates are silently accepted by utf-7 codec. >>> '\ud8e4\U0001d121'.encode('utf-7')
b'+2OTYNN0h-'
>>> '\U0001d121\ud8e4'.encode('utf-7')
b'+2DTdIdjk-'
>>> b'+2OTYNN0h-'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2OTYNN0h'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2DTdIdjk-'.decode('utf-7')
'𝄡\ud8e4'
Except at the end of unterminated shift sequence:
>>> b'+2DTdIdjk'.decode('utf-7')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence
Strange behavior happens when shift sequence ends with wrong bits.
>>> b'a+IKx-b'.decode('utf-7', 'ignore')
'a€b'
>>> b'a+IKx-b'.decode('utf-7', 'replace')
'a€�b'
>>> b'a+IKx-b'.decode('utf-7', 'backslashreplace')
'a€\\x2b\\x49\\x4b\\x78\\x2db' The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders. |
There is a reason for behavior in case 2. This is likely a truncated data and it is safer to raise an exception than silently produce lone surrogate. Current UTF-7 encoder always adds '-' after ending shift sequence. I suppose this is not a bug. However there are yet three bugs.
>>> b'+2DTdI-'.decode('utf-7', 'replace')
'\ud834�' A low surrogate is a part of incomplete astral character and shouldn't emitted in case of error in encoded astral character.
>>> b'a+,b'.decode('utf-7')
'a,b'
>>> b'a+'.decode('utf-7')
'a'
>>> b'\xff'.decode('utf-7', 'replace')
'�'
>>> b'a\xff'.decode('utf-7', 'replace')
'a�'
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'
>>> b'\xffb'.decode('utf-7', 'replace')
'ýb' This bug is reproduced only in 3.4+. Following patch fixes bugs 1 and 4 and adds more tests. Corner cases 2 and 3 are likely not bugs. I doubt about fixing bug 5. iconv accepts such ill-formed sequences. In any case I think the fix of this bug can be applied only for default branch. I have no idea how to fix bug 6. I afraid it can be a bug in _PyUnicodeWriter and therefore can affect other decoders. |
Updated patch fixes also a bug in _PyUnicodeWriter. Other affected encoding is "unicode-escape": >>> br'\u;'.decode('unicode-escape', 'replace')
'ý;' |
New changeset 3c13567ea642 by Serhiy Storchaka in branch '3.4': New changeset a61fa2b08f87 by Serhiy Storchaka in branch '3.5': New changeset 037253b7cd6d by Serhiy Storchaka in branch 'default': New changeset c6eaa722e2c1 by Serhiy Storchaka in branch '2.7': |
http://buildbot.python.org/all/builders/x86%20XP-4%202.7/builds/3431/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "d:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_codecs.py", line 709, in test_errors
self.assertEqual(raw.decode('utf-7', 'replace'), expected)
AssertionError: u'a\u20ac\ufffd' != u'a\u20ac\ufffdb'
- a\u20ac\ufffd
+ a\u20ac\ufffdb
? + ====================================================================== Traceback (most recent call last):
File "d:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_codecs.py", line 743, in test_lone_surrogates
self.assertEqual(raw.decode('utf-7', 'replace'), expected)
AssertionError: u'a\ufffd' != u'a\ufffdb'
- a\ufffd
+ a\ufffdb
? + |
Have no ideas why tests are failed and only on this buildbot. |
test_codecs always crash on Python 3.6 with Python compiled in debug mode: test_errors (test.test_codecs.UTF7Test) ... python: Objects/unicodeobject.c:1263: _copy_characters: Assertion `ch <= to_maxchar' failed. Current thread 0x00007f1489057700 (most recent call first): |
Oops, ignore my comment, I forgot to recompile Python. "make" and the bug is done :-) |
Test failure is random. With build 3435 tests are successful, with all other are failed. The same with other buildbot: http://buildbot.python.org/all/builders/x86%20Windows7%202.7/ . 3345 and 3347 are green, others are red. |
The difference between 2.7 and 3.x is that 2.7 uses isalnum() in IS_BASE64, and 3.x test concrete ranges. Therefore depending on platform and locale 2.7 can accept wrong bytes as BASE64 characters and return incorrect result. Following patch makes 2.7 code the same as 3.x. Tests are changed to fail with large probability with unpatched code ('\xe1' is an alnum on almost all 8-bit locales). |
The patch looks good to me. |
New changeset ff1366ff2761 by Serhiy Storchaka in branch '2.7': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: