Issue 25388: tokenizer crash/misbehavior -- heap use-after-free

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69575

classification

Title:	tokenizer crash/misbehavior -- heap use-after-free
Type:	crash	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.6, Python 3.4, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	Brian.Cain, benjamin.peterson, python-dev, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2015-10-13 03:15 by Brian.Cain, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
vuln.patch	Brian.Cain, 2015-10-13 03:15	test case illustrating problem	review
asan.txt	Brian.Cain, 2015-10-13 03:15	Output from test run with ASan enabled
issue25388.patch	serhiy.storchaka, 2015-11-06 21:34		review

Messages (8)
msg252905 - (view)	Author: Brian Cain (Brian.Cain) *	Date: 2015-10-13 03:15
This issue is similar to (but I believe distinct from) the one reported earlier as http://bugs.python.org/issue24022. Tokenizer failures strike me as difficult to exploit, but risky nonetheless. Attached is a test case that illustrates the problem and the output from ASan when it encounters the failure. All of the versions below that I tested failed in one way or another (segfault, assertion failure, printing enormous blank output to console). Some fail frequently and some exhibit this failure only occasionally. Python 3.4.3 (default, Mar 26 2015, 22:03:40) Python 2.7.9 (default, Apr 2 2015, 15:33:21) [GCC 4.9.2] on linux2 Python 3.6.0a0 (default:2a8a39640aa2+, Jul 9 2015, 12:28:50) [GCC 4.9.2] on linux
msg252906 - (view)	Author: Brian Cain (Brian.Cain) *	Date: 2015-10-13 03:15
asan output
msg253114 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2015-10-17 02:36
According to https://docs.python.org/3/reference/lexical_analysis.html#lexical-analysis, the encoding of a sourcefile (in Python 3) defaults to utf-8* and a decoding error is (should be) reported as a SyntaxError. Since b"\x7f\x00\x00\n''s\x01\xfd\n'S" is not invalid as utf-8, I expect a UnicodeDecodeError converted to SyntaxError. * compile(bytes, filename, mode) defaults to latin1 instead. It has no decoding problem, but quits with "ValueError: source code string cannot contain null bytes". On 2.7, I might expect that as the error. I expect '''self.assertIn(b"Non-UTF-8", res.err)''' to always fail because error messages are strings, not bytes. That aside, have you ever seen that particular text (as a string) in a SyntaxError message?). Why do you think the crash is during the tokenizing phase? I could not see anything in the AS report.
msg253879 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-01 21:47
Stack trace: #0 ascii_decode (start=0xa72f2008 "", end=0xfffff891 <error: Cannot access memory at address 0xfffff891>, dest=<optimized out>) at Objects/unicodeobject.c:4795 #1 0x08100c0f in PyUnicode_DecodeUTF8Stateful (s=s@entry=0xa72f2008 "", size=size@entry=1490081929, errors=errors@entry=0x81f4303 "replace", consumed=consumed@entry=0x0) at Objects/unicodeobject.c:4871 #2 0x081029c7 in PyUnicode_DecodeUTF8 (s=0xa72f2008 "", size=1490081929, errors=errors@entry=0x81f4303 "replace") at Objects/unicodeobject.c:4743 #3 0x0815179a in err_input (err=0xbfffec04) at Python/pythonrun.c:1352 #4 0x081525cf in PyParser_ASTFromFileObject (arena=0x8348118, errcode=0x0, flags=<optimized out>, ps2=0x0, ps1=0x0, start=257, enc=0x0, filename=0xb7950e00, fp=0x8347fb0) at Python/pythonrun.c:1163 #5 PyRun_FileExFlags (fp=0x8347fb0, filename_str=0xb79e2eb8 "vuln.py", start=257, globals=0xb79e3d8c, locals=0xb79e3d8c, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:916 #6 0x08152744 in PyRun_SimpleFileExFlags (fp=0x8347fb0, filename=<optimized out>, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:396 #7 0x08063919 in run_file (p_cf=0xbfffecec, filename=0x82eda10 L"vuln.py", fp=0x8347fb0) at Modules/main.c:318 #8 Py_Main (argc=argc@entry=2, argv=argv@entry=0x82ed008) at Modules/main.c:768 #9 0x0805f345 in main (argc=2, argv=0xbfffee44) at ./Programs/python.c:69 At #2 PyUnicode_DecodeUTF8 is called with s="" and size=1490081929. size is err->offset, and err->offset is set only in parsetok() in Parser/parsetok.c. This is the tokenizer bug. Minimal reproducer: ./python -c 'with open("vuln.py", "wb") as f: f.write(b"\x7f\x00\n\xfd\n") ./python vuln.py The crash is gone if comment out the code at the end of decoding_fgets() that tests UTF-8.
msg254033 - (view)	Author: Brian Cain (Brian.Cain) *	Date: 2015-11-04 00:32
Sorry, the report would have been clearer if I'd included a build with symbols and a stack trace. The test was inspired by the test from issue24022 (https://hg.python.org/cpython/rev/03b2259c6cd3), it sounds like it should not have been. But indeed it seems like you've reproduced this issue, and you agree it's a bug?
msg254034 - (view)	Author: Brian Cain (Brian.Cain) *	Date: 2015-11-04 00:47
Here is a more useful ASan report: ================================================================= ==12168==ERROR: AddressSanitizer: heap-use-after-free on address 0x62500001e110 at pc 0x000000697238 bp 0x7fff412b9240 sp 0x7fff412b9238 READ of size 1 at 0x62500001e110 thread T0 #0 0x697237 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20 #1 0x68c63b in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1460:13 #2 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18 #3 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16 #4 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12 #5 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15 #6 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11 #7 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13 #8 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16 #9 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11 #10 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19 #11 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11 #12 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f) #13 0x431548 in _start (/home/brian/src/fuzzpy/cpython/python+0x431548) 0x62500001e110 is located 16 bytes inside of 8224-byte region [0x62500001e100,0x625000020120) freed by thread T0 here: #0 0x4cdef0 in realloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:61 #1 0x501280 in _PyMem_RawRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:84:12 #2 0x4fc68d in _PyMem_DebugRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1921:18 #3 0x4fdf42 in PyMem_Realloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:343:12 #4 0x69a338 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1050:34 #5 0x68a2c9 in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1357:17 #6 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18 #7 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16 #8 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12 #9 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15 #10 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11 #11 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13 #12 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16 #13 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11 #14 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19 #15 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11 #16 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f) previously allocated by thread T0 here: #0 0x4cdb88 in malloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:40 #1 0x501030 in _PyMem_RawMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:62:12 #2 0x5074db in _PyMem_DebugAlloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1838:22 #3 0x4fc213 in _PyMem_DebugMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1861:12 #4 0x4fdbfa in PyMem_Malloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:325:12 #5 0x68791d in PyTokenizer_FromFile /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:861:29 #6 0x68359e in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:126:16 #7 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15 #8 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11 #9 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13 #10 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16 #11 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11 #12 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19 #13 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11 #14 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f) SUMMARY: AddressSanitizer: heap-use-after-free /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20 in tok_nextc Shadow bytes around the buggy address: 0x0c4a7fffbbd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a7fffbbe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a7fffbbf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a7fffbc00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a7fffbc10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x0c4a7fffbc20: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c4a7fffbc30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c4a7fffbc40: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c4a7fffbc50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c4a7fffbc60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c4a7fffbc70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Heap right redzone: fb Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack partial redzone: f4 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==12168==ABORTING
msg254225 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-06 21:34
Yes, there is a bug. When decoding_fgets() encounter non-UTF-8 bytes, it fails and free input buffer in error_ret(). But since tok->cur != tok->inp, next call of tok_nextc() reads freed memory. if (tok->cur != tok->inp) { return Py_CHARMASK(tok->cur++); / Fast path */ } If Python is not crashed here, new buffer is allocated and assigned to tok->buf, then PyTokenizer_Get returns error, parsetok() calculates the position of the error err_ret->offset = (int)(tok->cur - tok->buf); but tok->cur points inside old freed buffer, and the offset becomes too large integer. err_input() tries to decode the part of the string before error with the "replace" error handler, but since the position was wrongly calculated, it reads out of allocated memory. Proposed patch fixes the issue. It sets tok->done and pointers in case of decoding error, so they now are in consistent state. It also removes some duplicated or dead code.
msg254656 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-11-14 13:15
New changeset 73da4fd7542b by Serhiy Storchaka in branch '3.4': Issue #25388: Fixed tokenizer crash when processing undecodable source code https://hg.python.org/cpython/rev/73da4fd7542b New changeset e4a69eb34ad7 by Serhiy Storchaka in branch '3.5': Issue #25388: Fixed tokenizer crash when processing undecodable source code https://hg.python.org/cpython/rev/e4a69eb34ad7 New changeset ea0c4b811eae by Serhiy Storchaka in branch 'default': Issue #25388: Fixed tokenizer crash when processing undecodable source code https://hg.python.org/cpython/rev/ea0c4b811eae New changeset 8e472cc258ec by Serhiy Storchaka in branch '2.7': Issue #25388: Fixed tokenizer hang when processing undecodable source code https://hg.python.org/cpython/rev/8e472cc258ec

History
Date	User	Action	Args
2022-04-11 14:58:22	admin	set	github: 69575
2015-11-14 19:24:42	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2015-11-14 13:15:01	python-dev	set	nosy: + python-dev messages: + msg254656
2015-11-06 21:34:38	serhiy.storchaka	set	files: + issue25388.patch messages: + msg254225 stage: patch review
2015-11-04 00:47:46	Brian.Cain	set	messages: + msg254034
2015-11-04 00:32:26	Brian.Cain	set	messages: + msg254033
2015-11-03 10:57:40	serhiy.storchaka	set	assignee: serhiy.storchaka
2015-11-01 21:47:17	serhiy.storchaka	set	nosy: + serhiy.storchaka, benjamin.peterson messages: + msg253879
2015-10-17 02:36:08	terry.reedy	set	nosy: + terry.reedy messages: + msg253114 versions: + Python 3.5
2015-10-13 03:29:42	Brian.Cain	set	type: crash
2015-10-13 03:16:44	Brian.Cain	set	title: tokenizer crash/misbehavior -> tokenizer crash/misbehavior -- heap use-after-free
2015-10-13 03:16:00	Brian.Cain	set	files: + asan.txt messages: + msg252906
2015-10-13 03:15:15	Brian.Cain	create