Issue 24848: Warts in UTF-7 error handling

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69036

classification

Title:	Warts in UTF-7 error handling
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.6, Python 3.4, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	ezio.melotti, lemburg, loewis, pitrou, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2015-08-12 07:36 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
utf7_error_handling.patch	serhiy.storchaka, 2015-08-21 20:52		review
utf7_error_handling-2.patch	serhiy.storchaka, 2015-09-27 22:14		review
decode_utf7_locale-2.7.patch	serhiy.storchaka, 2015-10-08 15:55		review

Messages (12)
msg248450 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-08-12 07:36
Trying to implement UTF-7 codec in Python I found some warts in error handling. 1. Non-ASCII bytes. No errors: >>> 'a€b'.encode('utf-7') b'a+IKw-b' >>> b'a+IKw-b'.decode('utf-7') 'a€b' Terminating '-' at the end of the string is optional. >>> b'a+IKw'.decode('utf-7') 'a€' And sometimes it is optional in the middle of the string (if following char is not used in BASE64). >>> b'a+IKw;b'.decode('utf-7') 'a€;b' But if following char is not ASCII, it is accepted as well, and this looks as a bug. >>> b'a+IKw\xffb'.decode('utf-7') 'a€ÿb' In all other cases non-ASCII byte causes an error: >>> b'a\xffb'.decode('utf-7') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character >>> b'a\xffb'.decode('utf-7', 'replace') 'a�b' 2. Ending lone high surrogate. Lone surrogates are silently accepted by utf-7 codec. >>> '\ud8e4\U0001d121'.encode('utf-7') b'+2OTYNN0h-' >>> '\U0001d121\ud8e4'.encode('utf-7') b'+2DTdIdjk-' >>> b'+2OTYNN0h-'.decode('utf-7') '\ud8e4𝄡' >>> b'+2OTYNN0h'.decode('utf-7') '\ud8e4𝄡' >>> b'+2DTdIdjk-'.decode('utf-7') '𝄡\ud8e4' Except at the end of unterminated shift sequence: >>> b'+2DTdIdjk'.decode('utf-7') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence 3. Incorrect shift sequence. Strange behavior happens when shift sequence ends with wrong bits. >>> b'a+IKx-b'.decode('utf-7', 'ignore') 'a€b' >>> b'a+IKx-b'.decode('utf-7', 'replace') 'a€�b' >>> b'a+IKx-b'.decode('utf-7', 'backslashreplace') 'a€\\x2b\\x49\\x4b\\x78\\x2db' The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders.
msg248980 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-08-21 20:52
There is a reason for behavior in case 2. This is likely a truncated data and it is safer to raise an exception than silently produce lone surrogate. Current UTF-7 encoder always adds '-' after ending shift sequence. I suppose this is not a bug. However there are yet three bugs. 4. Decoder can emit lone low surrogate before replacement character in case of error. >>> b'+2DTdI-'.decode('utf-7', 'replace') '\ud834�' A low surrogate is a part of incomplete astral character and shouldn't emitted in case of error in encoded astral character. 5. According to RFC 2152: "A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence." But this is accepted by current decoder as empty shift sequence that is decoded to empty string. >>> b'a+,b'.decode('utf-7') 'a,b' >>> b'a+'.decode('utf-7') 'a' 6. Replacement character '\ufffd' can be replaced with character 'ý' ('\xfd'): >>> b'\xff'.decode('utf-7', 'replace') '�' >>> b'a\xff'.decode('utf-7', 'replace') 'a�' >>> b'a\xffb'.decode('utf-7', 'replace') 'a�b' >>> b'\xffb'.decode('utf-7', 'replace') 'ýb' This bug is reproduced only in 3.4+. Following patch fixes bugs 1 and 4 and adds more tests. Corner cases 2 and 3 are likely not bugs. I doubt about fixing bug 5. iconv accepts such ill-formed sequences. In any case I think the fix of this bug can be applied only for default branch. I have no idea how to fix bug 6. I afraid it can be a bug in _PyUnicodeWriter and therefore can affect other decoders.
msg251729 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-09-27 22:14
Updated patch fixes also a bug in _PyUnicodeWriter. Other affected encoding is "unicode-escape": >>> br'\u;'.decode('unicode-escape', 'replace') 'ý;'
msg252103 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-10-02 10:15
New changeset 3c13567ea642 by Serhiy Storchaka in branch '3.4': Issue #24848: Fixed bugs in UTF-7 decoding of misformed data: https://hg.python.org/cpython/rev/3c13567ea642 New changeset a61fa2b08f87 by Serhiy Storchaka in branch '3.5': Issue #24848: Fixed bugs in UTF-7 decoding of misformed data: https://hg.python.org/cpython/rev/a61fa2b08f87 New changeset 037253b7cd6d by Serhiy Storchaka in branch 'default': Issue #24848: Fixed bugs in UTF-7 decoding of misformed data: https://hg.python.org/cpython/rev/037253b7cd6d New changeset c6eaa722e2c1 by Serhiy Storchaka in branch '2.7': Issue #24848: Fixed bugs in UTF-7 decoding of misformed data: https://hg.python.org/cpython/rev/c6eaa722e2c1
msg252109 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-10-02 12:09
http://buildbot.python.org/all/builders/x86%20XP-4%202.7/builds/3431/steps/test/logs/stdio ====================================================================== FAIL: test_errors (test.test_codecs.UTF7Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "d:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_codecs.py", line 709, in test_errors self.assertEqual(raw.decode('utf-7', 'replace'), expected) AssertionError: u'a\u20ac\ufffd' != u'a\u20ac\ufffdb' - a\u20ac\ufffd + a\u20ac\ufffdb ? + ====================================================================== FAIL: test_lone_surrogates (test.test_codecs.UTF7Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "d:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_codecs.py", line 743, in test_lone_surrogates self.assertEqual(raw.decode('utf-7', 'replace'), expected) AssertionError: u'a\ufffd' != u'a\ufffdb' - a\ufffd + a\ufffdb ? +
msg252147 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-10-02 18:29
Have no ideas why tests are failed and only on this buildbot.
msg252173 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-10-02 22:52
> Have no ideas why tests are failed and only on this buildbot. test_codecs always crash on Python 3.6 with Python compiled in debug mode: test_errors (test.test_codecs.UTF7Test) ... python: Objects/unicodeobject.c:1263: _copy_characters: Assertion `ch <= to_maxchar' failed. Fatal Python error: Aborted Current thread 0x00007f1489057700 (most recent call first): File "/home/haypo/prog/python/default/Lib/encodings/utf_7.py", line 12 in decode File "/home/haypo/prog/python/default/Lib/test/test_codecs.py", line 1021 in test_errors File "/home/haypo/prog/python/default/Lib/unittest/case.py", line 600 in run File "/home/haypo/prog/python/default/Lib/unittest/case.py", line 648 in __call__ File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 122 in run File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 84 in __call__ File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 122 in run File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 84 in __call__ File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 122 in run File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 84 in __call__ File "/home/haypo/prog/python/default/Lib/unittest/runner.py", line 176 in run File "/home/haypo/prog/python/default/Lib/test/support/__init__.py", line 1775 in _run_suite File "/home/haypo/prog/python/default/Lib/test/support/__init__.py", line 1809 in run_unittest File "/home/haypo/prog/python/default/Lib/test/libregrtest/runtest.py", line 159 in test_runner File "/home/haypo/prog/python/default/Lib/test/libregrtest/runtest.py", line 160 in runtest_inner File "/home/haypo/prog/python/default/Lib/test/libregrtest/runtest.py", line 124 in runtest File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 285 in run_tests_sequential File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 344 in run_tests File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 380 in main File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 421 in main File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 443 in main_in_temp_cwd File "/home/haypo/prog/python/default/Lib/test/__main__.py", line 3 in <module> File "/home/haypo/prog/python/default/Lib/runpy.py", line 85 in _run_code File "/home/haypo/prog/python/default/Lib/runpy.py", line 170 in _run_module_as_main Abandon (core dumped)
msg252174 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-10-02 22:54
Oops, ignore my comment, I forgot to recompile Python. "make" and the bug is done :-)
msg252265 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-10-04 10:21
Test failure is random. With build 3435 tests are successful, with all other are failed. The same with other buildbot: http://buildbot.python.org/all/builders/x86%20Windows7%202.7/ . 3345 and 3347 are green, others are red.
msg252549 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-10-08 15:55
The difference between 2.7 and 3.x is that 2.7 uses isalnum() in IS_BASE64, and 3.x test concrete ranges. Therefore depending on platform and locale 2.7 can accept wrong bytes as BASE64 characters and return incorrect result. Following patch makes 2.7 code the same as 3.x. Tests are changed to fail with large probability with unpatched code ('\xe1' is an alnum on almost all 8-bit locales).
msg252562 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-10-08 17:34
The patch looks good to me.
msg252694 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-10-10 06:33
New changeset ff1366ff2761 by Serhiy Storchaka in branch '2.7': Issue #24848: Fixed yet one bug in UTF-7 decoder. Testing for BASE64 character https://hg.python.org/cpython/rev/ff1366ff2761

History
Date	User	Action	Args
2022-04-11 14:58:19	admin	set	github: 69036
2015-11-10 21:15:12	serhiy.storchaka	set	status: open -> closed assignee: serhiy.storchaka resolution: fixed stage: patch review -> resolved
2015-10-10 06:33:55	python-dev	set	messages: + msg252694
2015-10-08 17:34:58	vstinner	set	messages: + msg252562
2015-10-08 15:56:00	serhiy.storchaka	set	files: + decode_utf7_locale-2.7.patch messages: + msg252549
2015-10-04 10:21:51	serhiy.storchaka	set	messages: + msg252265
2015-10-02 22:54:07	vstinner	set	messages: + msg252174
2015-10-02 22:52:15	vstinner	set	messages: + msg252173
2015-10-02 18:29:46	serhiy.storchaka	set	messages: + msg252147
2015-10-02 12:09:53	vstinner	set	messages: + msg252109
2015-10-02 10:15:43	python-dev	set	nosy: + python-dev messages: + msg252103
2015-09-27 22:15:20	serhiy.storchaka	set	nosy: + pitrou
2015-09-27 22:14:23	serhiy.storchaka	set	files: + utf7_error_handling-2.patch messages: + msg251729
2015-08-21 20:52:39	serhiy.storchaka	set	files: + utf7_error_handling.patch keywords: + patch messages: + msg248980 stage: patch review
2015-08-12 07:43:59	serhiy.storchaka	link	issue22598 dependencies
2015-08-12 07:36:05	serhiy.storchaka	create