classification
Title: Crash during decoding using UTF-16/32 and custom error handler
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, lemburg, ned.deily, serhiy.storchaka, sibiryakov, terry.reedy, vstinner, xiang.zhang
Priority: normal Keywords: patch

Created on 2018-01-17 15:12 by sibiryakov, last changed 2018-01-31 23:33 by ned.deily. This issue is now closed.

Files
File name Uploaded Description Edit
decode_crash.py sibiryakov, 2018-01-17 15:12 The source code to replicate the issue
test_string.bin sibiryakov, 2018-01-17 15:13 Content file needed to run the crash script
valgrind.log sibiryakov, 2018-01-17 15:27 Valgrind log
issue32583.patch xiang.zhang, 2018-01-21 15:56
Pull Requests
URL Status Linked Edit
PR 5325 closed xiang.zhang, 2018-01-25 18:24
PR 5459 merged xiang.zhang, 2018-01-31 13:01
Messages (8)
msg310188 - (view) Author: Alexander Sibiryakov (sibiryakov) Date: 2018-01-17 15:27
The CPython interpreter gets SIGSEGV or SIGABRT during the run. The script attempts to decode binary file using UTF-16-LE encoding and custom error handler. The error handler is poorly built, and doesn't respect the unicode standard with wrong calculation of the new position for decoder to continue. This somehow interfere with internal C code doing memory allocation. The result is invalid writes outside of allocated block.

Here is how it looks like with Python 3.7.0a4+ (heads/master:44a70e9, Jan 17 2018, 12:18:45) run under Valgrind 3.11.0. Please see the full Valgrind output in attached valgrind.log.

==24836== Invalid write of size 4
==24836==    at 0x4C6B17: ucs4lib_utf16_decode (codecs.h:540)
==24836==    by 0x4C6B17: PyUnicode_DecodeUTF16Stateful (unicodeobject.c:5600)
==24836==    by 0x55AAD3: _codecs_utf_16_le_decode_impl (_codecsmodule.c:363)
==24836==    by 0x55AB6C: _codecs_utf_16_le_decode (_codecsmodule.c.h:371)
==24836==    by 0x4315D6: _PyMethodDef_RawFastCallKeywords (call.c:651)
==24836==    by 0x431840: _PyCFunction_FastCallKeywords (call.c:730)
==24836==    by 0x4ED159: call_function (ceval.c:4580)
==24836==    by 0x4ED159: _PyEval_EvalFrameDefault (ceval.c:3134)
==24836==    by 0x4E302D: PyEval_EvalFrameEx (ceval.c:545)
==24836==    by 0x4E3A42: _PyEval_EvalCodeWithName (ceval.c:3971)
==24836==    by 0x430EDD: _PyFunction_FastCallDict (call.c:376)
==24836==    by 0x4336B0: PyObject_Call (call.c:226)
==24836==    by 0x433839: PyEval_CallObjectWithKeywords (call.c:826)
==24836==    by 0x4FEAA6: _PyCodec_DecodeInternal (codecs.c:471)
==24836==  Address 0x6cf4bf8 is 0 bytes after a block of size 339,112 alloc'd
==24836==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24836==    by 0x467635: _PyMem_RawMalloc (obmalloc.c:75)
==24836==    by 0x467B7D: _PyMem_DebugRawAlloc (obmalloc.c:2033)
==24836==    by 0x467C1F: _PyMem_DebugRawMalloc (obmalloc.c:2062)
==24836==    by 0x467C40: _PyMem_DebugMalloc (obmalloc.c:2202)
==24836==    by 0x468BFF: PyObject_Malloc (obmalloc.c:616)
==24836==    by 0x493902: PyUnicode_New (unicodeobject.c:1293)
==24836==    by 0x4BEA4F: _PyUnicodeWriter_PrepareInternal (unicodeobject.c:13456)
==24836==    by 0x4C6D39: _PyUnicodeWriter_WriteCharInline (unicodeobject.c:13494)
==24836==    by 0x4C6D39: PyUnicode_DecodeUTF16Stateful (unicodeobject.c:5637)
==24836==    by 0x55AAD3: _codecs_utf_16_le_decode_impl (_codecsmodule.c:363)
==24836==    by 0x55AB6C: _codecs_utf_16_le_decode (_codecsmodule.c.h:371)
==24836==    by 0x4315D6: _PyMethodDef_RawFastCallKeywords (call.c:651)
msg310289 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-01-19 21:37
As written, decode_crash.py crashes on Windows also.  Passing 'replace' instead of 'w3lib_replace' results in no crash and lots of boxes and blanks.
msg310357 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-01-20 19:28
The problem is utf16 decoder almost always assumes that two bytes decodes to one unicode character, so when allocating memory, it assumes (bytes_number+1)/2 unicode slots is enough, there is even a comment in the code. And in unicode_decode_call_errorhandler_writer, it only allocates more memory when the error handler returns a unicode longer than 1, but doesn't take care pace by one, in which case one byte to one unicode character. So it's possible for the decoder to write out of bound.

This example could steadily crash on my Mac with debug version, it writes across the bound of the internal unicode buffer:

>>> import codecs
>>> def pace_by_one(exc):
...     return ('\ufffd', exc.start+1)
...
>>> codecs.register_error('pace_by_one', pace_by_one)
>>> b'\xd8\xd8\xd8\xd8\xd8\xd8\x00\x00\x00'.decode('utf-16-le', 'pace_by_one')
Debug memory block at address p=0x10210c260: API 'o'
    100 bytes originally requested
    The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0x10210c2c4 are not all FORBIDDENBYTE (0xfb):
        at tail+0: 0x00 *** OUCH
        at tail+1: 0x00 *** OUCH
        at tail+2: 0xfb
        at tail+3: 0xfb
        at tail+4: 0xfb
        at tail+5: 0xfb
        at tail+6: 0xfb
        at tail+7: 0xfb
    The block was made by call #30672 to debug malloc/realloc.
    Data at p: 00 00 00 00 00 00 00 00 ... fd ff fd ff fd ff d8 00

Fatal Python error: bad trailing pad byte

Current thread 0x00007fffab9b4340 (most recent call first):
  File "/Users/angwer/Repositories/cpython/Lib/encodings/utf_16_le.py", line 16 in decode
  File "<stdin>", line 1 in <module>
[1]    63997 abort      ~/Repositories/cpython/python.exe

I'll try to make a fix tomorrow.
msg310359 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-01-20 19:55
Another way to crash:

>>> import codecs
>>> def replace_with_longer(exc):
...     exc.object = b'\xa0\x00' * 100
...     return ('\ufffd', exc.end)
...
>>> codecs.register
codecs.register(       codecs.register_error(
>>> codecs.register_error('replace_with_longer', rep
replace_with_longer( repr(
>>> codecs.register_error('replace_with_longer', replace_with_longer)
>>> b'\xd8\xd8'.decode('utf-16-le', 'replace_with_longer')
Debug memory block at address p=0x10b3b8c40: API 'o'
    92 bytes originally requested
    The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0x10b3b8c9c are not all FORBIDDENBYTE (0xfb):
        at tail+0: 0xa0 *** OUCH
        at tail+1: 0x00 *** OUCH
        at tail+2: 0xa0 *** OUCH
        at tail+3: 0x00 *** OUCH
        at tail+4: 0xa0 *** OUCH
        at tail+5: 0x00 *** OUCH
        at tail+6: 0xa0 *** OUCH
        at tail+7: 0x00 *** OUCH
    The block was made by call #11529390970613309440 to debug malloc/realloc.
    Data at p: 00 00 00 00 00 00 00 00 ... 00 00 00 00 fd ff a0 00

Fatal Python error: bad trailing pad byte

Current thread 0x00007fffab9b4340 (most recent call first):
  File "/Users/angwer/Repositories/cpython/Lib/encodings/utf_16_le.py", line 16 in decode
  File "<stdin>", line 1 in <module>
[1]    64081 abort      ~/Repositories/cpython/python.exe
msg310376 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-01-21 15:56
I write a draft patch, without tests yet. I'll add them later. Reviews are appreciated. I also check the Windows codepage equivalent and encoders, look to me they don't suffer the problem.
msg311327 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-01-31 12:48
New changeset 2c7fd46e11333ef5e5cce34212f7d087694f3658 by Xiang Zhang in branch 'master':
bpo-32583: Fix possible crashing in builtin Unicode decoders (#5325)
https://github.com/python/cpython/commit/2c7fd46e11333ef5e5cce34212f7d087694f3658
msg311329 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-01-31 13:34
New changeset ea94fce6960d90fffeeda131e31024617912d231 by Xiang Zhang in branch '3.6':
[3.6] bpo-32583: Fix possible crashing in builtin Unicode decoders (GH-5325) (#5459)
https://github.com/python/cpython/commit/ea94fce6960d90fffeeda131e31024617912d231
msg311388 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-01-31 23:33
New changeset 86fdad093b863db7ef6a3a00c9cff724c09442e7 by Ned Deily (Xiang Zhang) in branch '3.7':
bpo-32583: Fix possible crashing in builtin Unicode decoders (#5325)
https://github.com/python/cpython/commit/86fdad093b863db7ef6a3a00c9cff724c09442e7
History
Date User Action Args
2018-01-31 23:33:09ned.deilysetnosy: + ned.deily
messages: + msg311388
2018-01-31 13:35:16xiang.zhangsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-01-31 13:34:19xiang.zhangsetmessages: + msg311329
2018-01-31 13:01:25xiang.zhangsetpull_requests: + pull_request5285
2018-01-31 12:48:12xiang.zhangsetmessages: + msg311327
2018-01-25 18:24:58xiang.zhangsetpull_requests: + pull_request5170
2018-01-21 15:56:38xiang.zhangsetfiles: + issue32583.patch
keywords: + patch
messages: + msg310376

stage: needs patch -> patch review
2018-01-20 19:55:51xiang.zhangsetmessages: + msg310359
2018-01-20 19:28:46xiang.zhangsetstage: patch review -> needs patch
2018-01-20 19:28:21xiang.zhangsetnosy: + xiang.zhang
messages: + msg310357
2018-01-20 13:32:40serhiy.storchakasetnosy: + serhiy.storchaka
stage: test needed -> patch review

components: + Interpreter Core
versions: + Python 3.6, - Python 3.5
2018-01-19 21:37:09terry.reedysetnosy: + terry.reedy, ezio.melotti, lemburg, benjamin.peterson, vstinner

messages: + msg310289
stage: test needed
2018-01-17 15:27:50sibiryakovsetfiles: + valgrind.log

messages: + msg310188
2018-01-17 15:13:32sibiryakovsetfiles: + test_string.bin
2018-01-17 15:12:30sibiryakovcreate