tokenizer crash/misbehavior -- heap use-after-free #69575

BrianCain · 2015-10-13T03:15:16Z

BPO	25388
Nosy	@terryjreedy, @benjaminp, @serhiy-storchaka
Files	vuln.patch: test case illustrating problem asan.txt: Output from test run with ASan enabled issue25388.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2015-11-14.19:24:42.763>
created_at = <Date 2015-10-13.03:15:15.870>
labels = ['interpreter-core', 'type-crash']
title = 'tokenizer crash/misbehavior -- heap use-after-free'
updated_at = <Date 2015-11-14.19:24:42.762>
user = 'https://bugs.python.org/BrianCain'

bugs.python.org fields:

activity = <Date 2015-11-14.19:24:42.762>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2015-11-14.19:24:42.763>
closer = 'serhiy.storchaka'
components = ['Interpreter Core']
creation = <Date 2015-10-13.03:15:15.870>
creator = 'Brian.Cain'
dependencies = []
files = ['40764', '40765', '40965']
hgrepos = []
issue_num = 25388
keywords = ['patch']
message_count = 8.0
messages = ['252905', '252906', '253114', '253879', '254033', '254034', '254225', '254656']
nosy_count = 5.0
nosy_names = ['terry.reedy', 'benjamin.peterson', 'Brian.Cain', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue25388'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

BrianCain · 2015-10-13T03:15:14Z

This issue is similar to (but I believe distinct from) the one reported earlier as http://bugs.python.org/issue24022. Tokenizer failures strike me as difficult to exploit, but risky nonetheless.

Attached is a test case that illustrates the problem and the output from ASan when it encounters the failure.

All of the versions below that I tested failed in one way or another (segfault, assertion failure, printing enormous blank output to console). Some fail frequently and some exhibit this failure only occasionally.

Python 3.4.3 (default, Mar 26 2015, 22:03:40)
Python 2.7.9 (default, Apr 2 2015, 15:33:21) [GCC 4.9.2] on linux2
Python 3.6.0a0 (default:2a8a39640aa2+, Jul 9 2015, 12:28:50) [GCC 4.9.2] on linux

BrianCain · 2015-10-13T03:15:59Z

asan output

terryjreedy · 2015-10-17T02:36:08Z

According to https://docs.python.org/3/reference/lexical_analysis.html#lexical-analysis, the encoding of a sourcefile (in Python 3) defaults to utf-8* and a decoding error is (should be) reported as a SyntaxError. Since b"\x7f\x00\x00\n''s\x01\xfd\n'S" is not invalid as utf-8, I expect a UnicodeDecodeError converted to SyntaxError.

compile(bytes, filename, mode) defaults to latin1 instead. It has no decoding problem, but quits with "ValueError: source code string cannot contain null bytes". On 2.7, I might expect that as the error.

I expect '''self.assertIn(b"Non-UTF-8", res.err)''' to always fail because error messages are strings, not bytes. That aside, have you ever seen that particular text (as a string) in a SyntaxError message?).

Why do you think the crash is during the tokenizing phase? I could not see anything in the AS report.

serhiy-storchaka · 2015-11-01T21:47:18Z

Stack trace:

#0 ascii_decode (start=0xa72f2008 "", end=0xfffff891 <error: Cannot access memory at address 0xfffff891>, dest=<optimized out>) at Objects/unicodeobject.c:4795
#1 0x08100c0f in PyUnicode_DecodeUTF8Stateful (s=s@entry=0xa72f2008 "", size=size@entry=1490081929, errors=errors@entry=0x81f4303 "replace", consumed=consumed@entry=0x0)
at Objects/unicodeobject.c:4871
#2 0x081029c7 in PyUnicode_DecodeUTF8 (s=0xa72f2008 "", size=1490081929, errors=errors@entry=0x81f4303 "replace") at Objects/unicodeobject.c:4743
#3 0x0815179a in err_input (err=0xbfffec04) at Python/pythonrun.c:1352
#4 0x081525cf in PyParser_ASTFromFileObject (arena=0x8348118, errcode=0x0, flags=<optimized out>, ps2=0x0, ps1=0x0, start=257, enc=0x0, filename=0xb7950e00, fp=0x8347fb0)
at Python/pythonrun.c:1163
#5 PyRun_FileExFlags (fp=0x8347fb0, filename_str=0xb79e2eb8 "vuln.py", start=257, globals=0xb79e3d8c, locals=0xb79e3d8c, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:916
#6 0x08152744 in PyRun_SimpleFileExFlags (fp=0x8347fb0, filename=<optimized out>, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:396
#7 0x08063919 in run_file (p_cf=0xbfffecec, filename=0x82eda10 L"vuln.py", fp=0x8347fb0) at Modules/main.c:318
#8 Py_Main (argc=argc@entry=2, argv=argv@entry=0x82ed008) at Modules/main.c:768
#9 0x0805f345 in main (argc=2, argv=0xbfffee44) at ./Programs/python.c:69

At #2 PyUnicode_DecodeUTF8 is called with s="" and size=1490081929. size is err->offset, and err->offset is set only in parsetok() in Parser/parsetok.c. This is the tokenizer bug.

Minimal reproducer:

./python -c 'with open("vuln.py", "wb") as f: f.write(b"\x7f\x00\n\xfd\n")
./python vuln.py

The crash is gone if comment out the code at the end of decoding_fgets() that tests UTF-8.

BrianCain · 2015-11-04T00:32:26Z

Sorry, the report would have been clearer if I'd included a build with symbols and a stack trace.

The test was inspired by the test from bpo-24022 (https://hg.python.org/cpython/rev/03b2259c6cd3), it sounds like it should not have been.

But indeed it seems like you've reproduced this issue, and you agree it's a bug?

BrianCain · 2015-11-04T00:47:45Z

Here is a more useful ASan report:

=================================================================
==12168==ERROR: AddressSanitizer: heap-use-after-free on address 0x62500001e110 at pc 0x000000697238 bp 0x7fff412b9240 sp 0x7fff412b9238
READ of size 1 at 0x62500001e110 thread T0
#0 0x697237 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20
#1 0x68c63b in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1460:13
#2 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18
#3 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16
#4 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12
#5 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
#6 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
#7 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
#8 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
#9 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
#10 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
#11 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
#12 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)
#13 0x431548 in _start (/home/brian/src/fuzzpy/cpython/python+0x431548)

0x62500001e110 is located 16 bytes inside of 8224-byte region [0x62500001e100,0x625000020120)
freed by thread T0 here:
#0 0x4cdef0 in realloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:61
#1 0x501280 in _PyMem_RawRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:84:12
#2 0x4fc68d in _PyMem_DebugRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1921:18
#3 0x4fdf42 in PyMem_Realloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:343:12
#4 0x69a338 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1050:34
#5 0x68a2c9 in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1357:17
#6 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18
#7 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16
#8 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12
#9 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
#10 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
#11 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
#12 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
#13 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
#14 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
#15 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
#16 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)

previously allocated by thread T0 here:
#0 0x4cdb88 in malloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:40
#1 0x501030 in _PyMem_RawMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:62:12
#2 0x5074db in _PyMem_DebugAlloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1838:22
#3 0x4fc213 in _PyMem_DebugMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1861:12
#4 0x4fdbfa in PyMem_Malloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:325:12
#5 0x68791d in PyTokenizer_FromFile /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:861:29
#6 0x68359e in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:126:16
#7 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
#8 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
#9 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
#10 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
#11 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
#12 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
#13 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
#14 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)

SUMMARY: AddressSanitizer: heap-use-after-free /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20 in tok_nextc
Shadow bytes around the buggy address:
0x0c4a7fffbbd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a7fffbbe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a7fffbbf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a7fffbc00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a7fffbc10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c4a7fffbc20: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd
0x0c4a7fffbc30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x0c4a7fffbc40: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x0c4a7fffbc50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x0c4a7fffbc60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x0c4a7fffbc70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Heap right redzone: fb
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack partial redzone: f4
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==12168==ABORTING

serhiy-storchaka · 2015-11-06T21:34:38Z

Yes, there is a bug. When decoding_fgets() encounter non-UTF-8 bytes, it fails and free input buffer in error_ret(). But since tok->cur != tok->inp, next call of tok_nextc() reads freed memory.

        if (tok->cur != tok->inp) {
            return Py_CHARMASK(*tok->cur++); /* Fast path */
        }

If Python is not crashed here, new buffer is allocated and assigned to tok->buf, then PyTokenizer_Get returns error, parsetok() calculates the position of the error

        err_ret->offset = (int)(tok->cur - tok->buf);

but tok->cur points inside old freed buffer, and the offset becomes too large integer. err_input() tries to decode the part of the string before error with the "replace" error handler, but since the position was wrongly calculated, it reads out of allocated memory.

Proposed patch fixes the issue. It sets tok->done and pointers in case of decoding error, so they now are in consistent state. It also removes some duplicated or dead code.

python-dev · 2015-11-14T13:15:01Z

New changeset 73da4fd7542b by Serhiy Storchaka in branch '3.4':
Issue bpo-25388: Fixed tokenizer crash when processing undecodable source code
https://hg.python.org/cpython/rev/73da4fd7542b

New changeset e4a69eb34ad7 by Serhiy Storchaka in branch '3.5':
Issue bpo-25388: Fixed tokenizer crash when processing undecodable source code
https://hg.python.org/cpython/rev/e4a69eb34ad7

New changeset ea0c4b811eae by Serhiy Storchaka in branch 'default':
Issue bpo-25388: Fixed tokenizer crash when processing undecodable source code
https://hg.python.org/cpython/rev/ea0c4b811eae

New changeset 8e472cc258ec by Serhiy Storchaka in branch '2.7':
Issue bpo-25388: Fixed tokenizer hang when processing undecodable source code
https://hg.python.org/cpython/rev/8e472cc258ec

BrianCain mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 13, 2015

BrianCain mannequin changed the title ~~tokenizer crash/misbehavior~~ tokenizer crash/misbehavior -- heap use-after-free Oct 13, 2015

BrianCain mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Oct 13, 2015

serhiy-storchaka self-assigned this Nov 3, 2015

serhiy-storchaka closed this as completed Nov 14, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer crash/misbehavior -- heap use-after-free #69575

tokenizer crash/misbehavior -- heap use-after-free #69575

BrianCain mannequin commented Oct 13, 2015

BrianCain mannequin commented Oct 13, 2015

BrianCain mannequin commented Oct 13, 2015

terryjreedy commented Oct 17, 2015

serhiy-storchaka commented Nov 1, 2015

BrianCain mannequin commented Nov 4, 2015

BrianCain mannequin commented Nov 4, 2015

serhiy-storchaka commented Nov 6, 2015

python-dev mannequin commented Nov 14, 2015

tokenizer crash/misbehavior -- heap use-after-free #69575

tokenizer crash/misbehavior -- heap use-after-free #69575

Comments

BrianCain mannequin commented Oct 13, 2015

BrianCain mannequin commented Oct 13, 2015

BrianCain mannequin commented Oct 13, 2015

terryjreedy commented Oct 17, 2015

serhiy-storchaka commented Nov 1, 2015

BrianCain mannequin commented Nov 4, 2015

BrianCain mannequin commented Nov 4, 2015

serhiy-storchaka commented Nov 6, 2015

python-dev mannequin commented Nov 14, 2015