Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer crash/misbehavior -- heap use-after-free #69575

Closed
BrianCain mannequin opened this issue Oct 13, 2015 · 8 comments
Closed

tokenizer crash/misbehavior -- heap use-after-free #69575

BrianCain mannequin opened this issue Oct 13, 2015 · 8 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@BrianCain
Copy link
Mannequin

BrianCain mannequin commented Oct 13, 2015

BPO 25388
Nosy @terryjreedy, @benjaminp, @serhiy-storchaka
Files
  • vuln.patch: test case illustrating problem
  • asan.txt: Output from test run with ASan enabled
  • issue25388.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-11-14.19:24:42.763>
    created_at = <Date 2015-10-13.03:15:15.870>
    labels = ['interpreter-core', 'type-crash']
    title = 'tokenizer crash/misbehavior -- heap use-after-free'
    updated_at = <Date 2015-11-14.19:24:42.762>
    user = 'https://bugs.python.org/BrianCain'

    bugs.python.org fields:

    activity = <Date 2015-11-14.19:24:42.762>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-11-14.19:24:42.763>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2015-10-13.03:15:15.870>
    creator = 'Brian.Cain'
    dependencies = []
    files = ['40764', '40765', '40965']
    hgrepos = []
    issue_num = 25388
    keywords = ['patch']
    message_count = 8.0
    messages = ['252905', '252906', '253114', '253879', '254033', '254034', '254225', '254656']
    nosy_count = 5.0
    nosy_names = ['terry.reedy', 'benjamin.peterson', 'Brian.Cain', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue25388'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

    @BrianCain
    Copy link
    Mannequin Author

    BrianCain mannequin commented Oct 13, 2015

    This issue is similar to (but I believe distinct from) the one reported earlier as http://bugs.python.org/issue24022. Tokenizer failures strike me as difficult to exploit, but risky nonetheless.

    Attached is a test case that illustrates the problem and the output from ASan when it encounters the failure.

    All of the versions below that I tested failed in one way or another (segfault, assertion failure, printing enormous blank output to console). Some fail frequently and some exhibit this failure only occasionally.

    Python 3.4.3 (default, Mar 26 2015, 22:03:40)
    Python 2.7.9 (default, Apr 2 2015, 15:33:21) [GCC 4.9.2] on linux2
    Python 3.6.0a0 (default:2a8a39640aa2+, Jul 9 2015, 12:28:50) [GCC 4.9.2] on linux

    @BrianCain BrianCain mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 13, 2015
    @BrianCain
    Copy link
    Mannequin Author

    BrianCain mannequin commented Oct 13, 2015

    asan output

    @BrianCain BrianCain mannequin changed the title tokenizer crash/misbehavior tokenizer crash/misbehavior -- heap use-after-free Oct 13, 2015
    @BrianCain BrianCain mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Oct 13, 2015
    @terryjreedy
    Copy link
    Member

    According to https://docs.python.org/3/reference/lexical_analysis.html#lexical-analysis, the encoding of a sourcefile (in Python 3) defaults to utf-8* and a decoding error is (should be) reported as a SyntaxError. Since b"\x7f\x00\x00\n''s\x01\xfd\n'S" is not invalid as utf-8, I expect a UnicodeDecodeError converted to SyntaxError.

    • compile(bytes, filename, mode) defaults to latin1 instead. It has no decoding problem, but quits with "ValueError: source code string cannot contain null bytes". On 2.7, I might expect that as the error.

    I expect '''self.assertIn(b"Non-UTF-8", res.err)''' to always fail because error messages are strings, not bytes. That aside, have you ever seen that particular text (as a string) in a SyntaxError message?).

    Why do you think the crash is during the tokenizing phase? I could not see anything in the AS report.

    @serhiy-storchaka
    Copy link
    Member

    Stack trace:

    #0 ascii_decode (start=0xa72f2008 "", end=0xfffff891 <error: Cannot access memory at address 0xfffff891>, dest=<optimized out>) at Objects/unicodeobject.c:4795
    #1 0x08100c0f in PyUnicode_DecodeUTF8Stateful (s=s@entry=0xa72f2008 "", size=size@entry=1490081929, errors=errors@entry=0x81f4303 "replace", consumed=consumed@entry=0x0)
    at Objects/unicodeobject.c:4871
    #2 0x081029c7 in PyUnicode_DecodeUTF8 (s=0xa72f2008 "", size=1490081929, errors=errors@entry=0x81f4303 "replace") at Objects/unicodeobject.c:4743
    #3 0x0815179a in err_input (err=0xbfffec04) at Python/pythonrun.c:1352
    #4 0x081525cf in PyParser_ASTFromFileObject (arena=0x8348118, errcode=0x0, flags=<optimized out>, ps2=0x0, ps1=0x0, start=257, enc=0x0, filename=0xb7950e00, fp=0x8347fb0)
    at Python/pythonrun.c:1163
    #5 PyRun_FileExFlags (fp=0x8347fb0, filename_str=0xb79e2eb8 "vuln.py", start=257, globals=0xb79e3d8c, locals=0xb79e3d8c, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:916
    #6 0x08152744 in PyRun_SimpleFileExFlags (fp=0x8347fb0, filename=<optimized out>, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:396
    #7 0x08063919 in run_file (p_cf=0xbfffecec, filename=0x82eda10 L"vuln.py", fp=0x8347fb0) at Modules/main.c:318
    #8 Py_Main (argc=argc@entry=2, argv=argv@entry=0x82ed008) at Modules/main.c:768
    #9 0x0805f345 in main (argc=2, argv=0xbfffee44) at ./Programs/python.c:69

    At #2 PyUnicode_DecodeUTF8 is called with s="" and size=1490081929. size is err->offset, and err->offset is set only in parsetok() in Parser/parsetok.c. This is the tokenizer bug.

    Minimal reproducer:

    ./python -c 'with open("vuln.py", "wb") as f: f.write(b"\x7f\x00\n\xfd\n")
    ./python vuln.py

    The crash is gone if comment out the code at the end of decoding_fgets() that tests UTF-8.

    @serhiy-storchaka serhiy-storchaka self-assigned this Nov 3, 2015
    @BrianCain
    Copy link
    Mannequin Author

    BrianCain mannequin commented Nov 4, 2015

    Sorry, the report would have been clearer if I'd included a build with symbols and a stack trace.

    The test was inspired by the test from bpo-24022 (https://hg.python.org/cpython/rev/03b2259c6cd3), it sounds like it should not have been.

    But indeed it seems like you've reproduced this issue, and you agree it's a bug?

    @BrianCain
    Copy link
    Mannequin Author

    BrianCain mannequin commented Nov 4, 2015

    Here is a more useful ASan report:

    =================================================================
    ==12168==ERROR: AddressSanitizer: heap-use-after-free on address 0x62500001e110 at pc 0x000000697238 bp 0x7fff412b9240 sp 0x7fff412b9238
    READ of size 1 at 0x62500001e110 thread T0
    #0 0x697237 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20
    #1 0x68c63b in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1460:13
    #2 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18
    #3 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16
    #4 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12
    #5 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
    #6 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
    #7 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
    #8 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
    #9 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
    #10 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
    #11 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
    #12 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)
    #13 0x431548 in _start (/home/brian/src/fuzzpy/cpython/python+0x431548)

    0x62500001e110 is located 16 bytes inside of 8224-byte region [0x62500001e100,0x625000020120)
    freed by thread T0 here:
    #0 0x4cdef0 in realloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:61
    #1 0x501280 in _PyMem_RawRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:84:12
    #2 0x4fc68d in _PyMem_DebugRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1921:18
    #3 0x4fdf42 in PyMem_Realloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:343:12
    #4 0x69a338 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1050:34
    #5 0x68a2c9 in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1357:17
    #6 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18
    #7 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16
    #8 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12
    #9 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
    #10 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
    #11 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
    #12 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
    #13 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
    #14 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
    #15 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
    #16 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)

    previously allocated by thread T0 here:
    #0 0x4cdb88 in malloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:40
    #1 0x501030 in _PyMem_RawMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:62:12
    #2 0x5074db in _PyMem_DebugAlloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1838:22
    #3 0x4fc213 in _PyMem_DebugMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1861:12
    #4 0x4fdbfa in PyMem_Malloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:325:12
    #5 0x68791d in PyTokenizer_FromFile /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:861:29
    #6 0x68359e in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:126:16
    #7 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
    #8 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
    #9 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
    #10 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
    #11 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
    #12 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
    #13 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
    #14 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)

    SUMMARY: AddressSanitizer: heap-use-after-free /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20 in tok_nextc
    Shadow bytes around the buggy address:
    0x0c4a7fffbbd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x0c4a7fffbbe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x0c4a7fffbbf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x0c4a7fffbc00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x0c4a7fffbc10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    =>0x0c4a7fffbc20: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x0c4a7fffbc30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x0c4a7fffbc40: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x0c4a7fffbc50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x0c4a7fffbc60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x0c4a7fffbc70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable: 00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone: fa
    Heap right redzone: fb
    Freed heap region: fd
    Stack left redzone: f1
    Stack mid redzone: f2
    Stack right redzone: f3
    Stack partial redzone: f4
    Stack after return: f5
    Stack use after scope: f8
    Global redzone: f9
    Global init order: f6
    Poisoned by user: f7
    Container overflow: fc
    Array cookie: ac
    Intra object redzone: bb
    ASan internal: fe
    Left alloca redzone: ca
    Right alloca redzone: cb
    ==12168==ABORTING

    @serhiy-storchaka
    Copy link
    Member

    Yes, there is a bug. When decoding_fgets() encounter non-UTF-8 bytes, it fails and free input buffer in error_ret(). But since tok->cur != tok->inp, next call of tok_nextc() reads freed memory.

            if (tok->cur != tok->inp) {
                return Py_CHARMASK(*tok->cur++); /* Fast path */
            }

    If Python is not crashed here, new buffer is allocated and assigned to tok->buf, then PyTokenizer_Get returns error, parsetok() calculates the position of the error

            err_ret->offset = (int)(tok->cur - tok->buf);
    

    but tok->cur points inside old freed buffer, and the offset becomes too large integer. err_input() tries to decode the part of the string before error with the "replace" error handler, but since the position was wrongly calculated, it reads out of allocated memory.

    Proposed patch fixes the issue. It sets tok->done and pointers in case of decoding error, so they now are in consistent state. It also removes some duplicated or dead code.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 14, 2015

    New changeset 73da4fd7542b by Serhiy Storchaka in branch '3.4':
    Issue bpo-25388: Fixed tokenizer crash when processing undecodable source code
    https://hg.python.org/cpython/rev/73da4fd7542b

    New changeset e4a69eb34ad7 by Serhiy Storchaka in branch '3.5':
    Issue bpo-25388: Fixed tokenizer crash when processing undecodable source code
    https://hg.python.org/cpython/rev/e4a69eb34ad7

    New changeset ea0c4b811eae by Serhiy Storchaka in branch 'default':
    Issue bpo-25388: Fixed tokenizer crash when processing undecodable source code
    https://hg.python.org/cpython/rev/ea0c4b811eae

    New changeset 8e472cc258ec by Serhiy Storchaka in branch '2.7':
    Issue bpo-25388: Fixed tokenizer hang when processing undecodable source code
    https://hg.python.org/cpython/rev/8e472cc258ec

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants