Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash during decoding using UTF-16/32 and custom error handler #76764

Closed
sibiryakov mannequin opened this issue Jan 17, 2018 · 8 comments
Closed

Crash during decoding using UTF-16/32 and custom error handler #76764

sibiryakov mannequin opened this issue Jan 17, 2018 · 8 comments
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@sibiryakov
Copy link
Mannequin

sibiryakov mannequin commented Jan 17, 2018

BPO 32583
Nosy @malemburg, @terryjreedy, @vstinner, @benjaminp, @ned-deily, @ezio-melotti, @serhiy-storchaka, @zhangyangyu, @sibiryakov
PRs
  • bpo-32583: Fix possible crashing in builtin Unicode decoders #5325
  • [3.6] bpo-32583: Fix possible crashing in builtin Unicode decoders (GH-5325) #5459
  • Files
  • decode_crash.py: The source code to replicate the issue
  • test_string.bin: Content file needed to run the crash script
  • valgrind.log: Valgrind log
  • issue32583.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-01-31.13:35:16.894>
    created_at = <Date 2018-01-17.15:12:30.618>
    labels = ['interpreter-core', '3.7', 'type-crash']
    title = 'Crash during decoding using UTF-16/32 and custom error handler'
    updated_at = <Date 2018-01-31.23:33:09.923>
    user = 'https://github.com/sibiryakov'

    bugs.python.org fields:

    activity = <Date 2018-01-31.23:33:09.923>
    actor = 'ned.deily'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-01-31.13:35:16.894>
    closer = 'xiang.zhang'
    components = ['Interpreter Core']
    creation = <Date 2018-01-17.15:12:30.618>
    creator = 'sibiryakov'
    dependencies = []
    files = ['47391', '47392', '47393', '47399']
    hgrepos = []
    issue_num = 32583
    keywords = ['patch']
    message_count = 8.0
    messages = ['310188', '310289', '310357', '310359', '310376', '311327', '311329', '311388']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'terry.reedy', 'vstinner', 'benjamin.peterson', 'ned.deily', 'ezio.melotti', 'serhiy.storchaka', 'xiang.zhang', 'sibiryakov']
    pr_nums = ['5325', '5459']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue32583'
    versions = ['Python 3.6', 'Python 3.7']

    @sibiryakov sibiryakov mannequin added 3.7 (EOL) end of life type-crash A hard crash of the interpreter, possibly with a core dump labels Jan 17, 2018
    @sibiryakov
    Copy link
    Mannequin Author

    sibiryakov mannequin commented Jan 17, 2018

    The CPython interpreter gets SIGSEGV or SIGABRT during the run. The script attempts to decode binary file using UTF-16-LE encoding and custom error handler. The error handler is poorly built, and doesn't respect the unicode standard with wrong calculation of the new position for decoder to continue. This somehow interfere with internal C code doing memory allocation. The result is invalid writes outside of allocated block.

    Here is how it looks like with Python 3.7.0a4+ (heads/master:44a70e9, Jan 17 2018, 12:18:45) run under Valgrind 3.11.0. Please see the full Valgrind output in attached valgrind.log.

    ==24836== Invalid write of size 4
    ==24836== at 0x4C6B17: ucs4lib_utf16_decode (codecs.h:540)
    ==24836== by 0x4C6B17: PyUnicode_DecodeUTF16Stateful (unicodeobject.c:5600)
    ==24836== by 0x55AAD3: _codecs_utf_16_le_decode_impl (_codecsmodule.c:363)
    ==24836== by 0x55AB6C: _codecs_utf_16_le_decode (_codecsmodule.c.h:371)
    ==24836== by 0x4315D6: _PyMethodDef_RawFastCallKeywords (call.c:651)
    ==24836== by 0x431840: _PyCFunction_FastCallKeywords (call.c:730)
    ==24836== by 0x4ED159: call_function (ceval.c:4580)
    ==24836== by 0x4ED159: _PyEval_EvalFrameDefault (ceval.c:3134)
    ==24836== by 0x4E302D: PyEval_EvalFrameEx (ceval.c:545)
    ==24836== by 0x4E3A42: _PyEval_EvalCodeWithName (ceval.c:3971)
    ==24836== by 0x430EDD: _PyFunction_FastCallDict (call.c:376)
    ==24836== by 0x4336B0: PyObject_Call (call.c:226)
    ==24836== by 0x433839: PyEval_CallObjectWithKeywords (call.c:826)
    ==24836== by 0x4FEAA6: _PyCodec_DecodeInternal (codecs.c:471)
    ==24836== Address 0x6cf4bf8 is 0 bytes after a block of size 339,112 alloc'd
    ==24836== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==24836== by 0x467635: _PyMem_RawMalloc (obmalloc.c:75)
    ==24836== by 0x467B7D: _PyMem_DebugRawAlloc (obmalloc.c:2033)
    ==24836== by 0x467C1F: _PyMem_DebugRawMalloc (obmalloc.c:2062)
    ==24836== by 0x467C40: _PyMem_DebugMalloc (obmalloc.c:2202)
    ==24836== by 0x468BFF: PyObject_Malloc (obmalloc.c:616)
    ==24836== by 0x493902: PyUnicode_New (unicodeobject.c:1293)
    ==24836== by 0x4BEA4F: _PyUnicodeWriter_PrepareInternal (unicodeobject.c:13456)
    ==24836== by 0x4C6D39: _PyUnicodeWriter_WriteCharInline (unicodeobject.c:13494)
    ==24836== by 0x4C6D39: PyUnicode_DecodeUTF16Stateful (unicodeobject.c:5637)
    ==24836== by 0x55AAD3: _codecs_utf_16_le_decode_impl (_codecsmodule.c:363)
    ==24836== by 0x55AB6C: _codecs_utf_16_le_decode (_codecsmodule.c.h:371)
    ==24836== by 0x4315D6: _PyMethodDef_RawFastCallKeywords (call.c:651)

    @terryjreedy
    Copy link
    Member

    As written, decode_crash.py crashes on Windows also. Passing 'replace' instead of 'w3lib_replace' results in no crash and lots of boxes and blanks.

    @serhiy-storchaka serhiy-storchaka added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 20, 2018
    @zhangyangyu
    Copy link
    Member

    The problem is utf16 decoder almost always assumes that two bytes decodes to one unicode character, so when allocating memory, it assumes (bytes_number+1)/2 unicode slots is enough, there is even a comment in the code. And in unicode_decode_call_errorhandler_writer, it only allocates more memory when the error handler returns a unicode longer than 1, but doesn't take care pace by one, in which case one byte to one unicode character. So it's possible for the decoder to write out of bound.

    This example could steadily crash on my Mac with debug version, it writes across the bound of the internal unicode buffer:

    >>> import codecs
    >>> def pace_by_one(exc):
    ...     return ('\ufffd', exc.start+1)
    ...
    >>> codecs.register_error('pace_by_one', pace_by_one)
    >>> b'\xd8\xd8\xd8\xd8\xd8\xd8\x00\x00\x00'.decode('utf-16-le', 'pace_by_one')
    Debug memory block at address p=0x10210c260: API 'o'
        100 bytes originally requested
        The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
        The 8 pad bytes at tail=0x10210c2c4 are not all FORBIDDENBYTE (0xfb):
            at tail+0: 0x00 *** OUCH
            at tail+1: 0x00 *** OUCH
            at tail+2: 0xfb
            at tail+3: 0xfb
            at tail+4: 0xfb
            at tail+5: 0xfb
            at tail+6: 0xfb
            at tail+7: 0xfb
        The block was made by call python/cpython#74857 to debug malloc/realloc.
        Data at p: 00 00 00 00 00 00 00 00 ... fd ff fd ff fd ff d8 00

    Fatal Python error: bad trailing pad byte

    Current thread 0x00007fffab9b4340 (most recent call first):
    File "/Users/angwer/Repositories/cpython/Lib/encodings/utf_16_le.py", line 16 in decode
    File "<stdin>", line 1 in <module>
    [1] 63997 abort ~/Repositories/cpython/python.exe

    I'll try to make a fix tomorrow.

    @zhangyangyu
    Copy link
    Member

    Another way to crash:

    >>> import codecs
    >>> def replace_with_longer(exc):
    ...     exc.object = b'\xa0\x00' * 100
    ...     return ('\ufffd', exc.end)
    ...
    >>> codecs.register
    codecs.register(       codecs.register_error(
    >>> codecs.register_error('replace_with_longer', rep
    replace_with_longer( repr(
    >>> codecs.register_error('replace_with_longer', replace_with_longer)
    >>> b'\xd8\xd8'.decode('utf-16-le', 'replace_with_longer')
    Debug memory block at address p=0x10b3b8c40: API 'o'
        92 bytes originally requested
        The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
        The 8 pad bytes at tail=0x10b3b8c9c are not all FORBIDDENBYTE (0xfb):
            at tail+0: 0xa0 *** OUCH
            at tail+1: 0x00 *** OUCH
            at tail+2: 0xa0 *** OUCH
            at tail+3: 0x00 *** OUCH
            at tail+4: 0xa0 *** OUCH
            at tail+5: 0x00 *** OUCH
            at tail+6: 0xa0 *** OUCH
            at tail+7: 0x00 *** OUCH
        The block was made by call #11529390970613309440 to debug malloc/realloc.
        Data at p: 00 00 00 00 00 00 00 00 ... 00 00 00 00 fd ff a0 00

    Fatal Python error: bad trailing pad byte

    Current thread 0x00007fffab9b4340 (most recent call first):
    File "/Users/angwer/Repositories/cpython/Lib/encodings/utf_16_le.py", line 16 in decode
    File "<stdin>", line 1 in <module>
    [1] 64081 abort ~/Repositories/cpython/python.exe

    @zhangyangyu
    Copy link
    Member

    I write a draft patch, without tests yet. I'll add them later. Reviews are appreciated. I also check the Windows codepage equivalent and encoders, look to me they don't suffer the problem.

    @zhangyangyu
    Copy link
    Member

    New changeset 2c7fd46 by Xiang Zhang in branch 'master':
    bpo-32583: Fix possible crashing in builtin Unicode decoders (bpo-5325)
    2c7fd46

    @zhangyangyu
    Copy link
    Member

    New changeset ea94fce by Xiang Zhang in branch '3.6':
    [3.6] bpo-32583: Fix possible crashing in builtin Unicode decoders (GH-5325) (bpo-5459)
    ea94fce

    @ned-deily
    Copy link
    Member

    New changeset 86fdad0 by Ned Deily (Xiang Zhang) in branch '3.7':
    bpo-32583: Fix possible crashing in builtin Unicode decoders (bpo-5325)
    86fdad0

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants