This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenizer crash/misbehavior -- heap use-after-free
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Brian.Cain, benjamin.peterson, python-dev, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2015-10-13 03:15 by Brian.Cain, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
vuln.patch Brian.Cain, 2015-10-13 03:15 test case illustrating problem review
asan.txt Brian.Cain, 2015-10-13 03:15 Output from test run with ASan enabled
issue25388.patch serhiy.storchaka, 2015-11-06 21:34 review
Messages (8)
msg252905 - (view) Author: Brian Cain (Brian.Cain) * Date: 2015-10-13 03:15
This issue is similar to (but I believe distinct from) the one reported earlier as http://bugs.python.org/issue24022.  Tokenizer failures strike me as difficult to exploit, but risky nonetheless.

Attached is a test case that illustrates the problem and the output from ASan when it encounters the failure.

All of the versions below that I tested failed in one way or another (segfault, assertion failure, printing enormous blank output to console).  Some fail frequently and some exhibit this failure only occasionally.

Python 3.4.3 (default, Mar 26 2015, 22:03:40) 
Python 2.7.9 (default, Apr  2 2015, 15:33:21) [GCC 4.9.2] on linux2
Python 3.6.0a0 (default:2a8a39640aa2+, Jul  9 2015, 12:28:50) [GCC 4.9.2] on linux
msg252906 - (view) Author: Brian Cain (Brian.Cain) * Date: 2015-10-13 03:15
asan output
msg253114 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2015-10-17 02:36
According to https://docs.python.org/3/reference/lexical_analysis.html#lexical-analysis, the encoding of a sourcefile (in Python 3) defaults to utf-8* and a decoding error is (should be) reported as a SyntaxError. Since b"\x7f\x00\x00\n''s\x01\xfd\n'S" is not invalid as utf-8, I expect a UnicodeDecodeError converted to SyntaxError.

* compile(bytes, filename, mode) defaults to latin1 instead.  It has no decoding problem, but quits with "ValueError: source code string cannot contain null bytes".  On 2.7, I might expect that as the error.

I expect '''self.assertIn(b"Non-UTF-8", res.err)''' to always fail because error messages are strings, not bytes.  That aside, have you ever seen that particular text (as a string) in a SyntaxError message?).

Why do you think the crash is during the tokenizing phase?  I could not see anything in the AS report.
msg253879 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-01 21:47
Stack trace:

#0  ascii_decode (start=0xa72f2008 "", end=0xfffff891 <error: Cannot access memory at address 0xfffff891>, dest=<optimized out>) at Objects/unicodeobject.c:4795
#1  0x08100c0f in PyUnicode_DecodeUTF8Stateful (s=s@entry=0xa72f2008 "", size=size@entry=1490081929, errors=errors@entry=0x81f4303 "replace", consumed=consumed@entry=0x0)
    at Objects/unicodeobject.c:4871
#2  0x081029c7 in PyUnicode_DecodeUTF8 (s=0xa72f2008 "", size=1490081929, errors=errors@entry=0x81f4303 "replace") at Objects/unicodeobject.c:4743
#3  0x0815179a in err_input (err=0xbfffec04) at Python/pythonrun.c:1352
#4  0x081525cf in PyParser_ASTFromFileObject (arena=0x8348118, errcode=0x0, flags=<optimized out>, ps2=0x0, ps1=0x0, start=257, enc=0x0, filename=0xb7950e00, fp=0x8347fb0)
    at Python/pythonrun.c:1163
#5  PyRun_FileExFlags (fp=0x8347fb0, filename_str=0xb79e2eb8 "vuln.py", start=257, globals=0xb79e3d8c, locals=0xb79e3d8c, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:916
#6  0x08152744 in PyRun_SimpleFileExFlags (fp=0x8347fb0, filename=<optimized out>, closeit=1, flags=0xbfffecec) at Python/pythonrun.c:396
#7  0x08063919 in run_file (p_cf=0xbfffecec, filename=0x82eda10 L"vuln.py", fp=0x8347fb0) at Modules/main.c:318
#8  Py_Main (argc=argc@entry=2, argv=argv@entry=0x82ed008) at Modules/main.c:768
#9  0x0805f345 in main (argc=2, argv=0xbfffee44) at ./Programs/python.c:69

At #2 PyUnicode_DecodeUTF8 is called with s="" and size=1490081929. size is err->offset, and err->offset is set only in parsetok() in Parser/parsetok.c. This is the tokenizer bug.

Minimal reproducer:

./python -c 'with open("vuln.py", "wb") as f: f.write(b"\x7f\x00\n\xfd\n")
./python vuln.py

The crash is gone if comment out the code at the end of decoding_fgets() that tests UTF-8.
msg254033 - (view) Author: Brian Cain (Brian.Cain) * Date: 2015-11-04 00:32
Sorry, the report would have been clearer if I'd included a build with symbols and a stack trace.

The test was inspired by the test from issue24022 (https://hg.python.org/cpython/rev/03b2259c6cd3), it sounds like it should not have been.

But indeed it seems like you've reproduced this issue, and you agree it's a bug?
msg254034 - (view) Author: Brian Cain (Brian.Cain) * Date: 2015-11-04 00:47
Here is a more useful ASan report:

=================================================================
==12168==ERROR: AddressSanitizer: heap-use-after-free on address 0x62500001e110 at pc 0x000000697238 bp 0x7fff412b9240 sp 0x7fff412b9238
READ of size 1 at 0x62500001e110 thread T0
    #0 0x697237 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20
    #1 0x68c63b in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1460:13
    #2 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18
    #3 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16
    #4 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12
    #5 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
    #6 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
    #7 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
    #8 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
    #9 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
    #10 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
    #11 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
    #12 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)
    #13 0x431548 in _start (/home/brian/src/fuzzpy/cpython/python+0x431548)

0x62500001e110 is located 16 bytes inside of 8224-byte region [0x62500001e100,0x625000020120)
freed by thread T0 here:
    #0 0x4cdef0 in realloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:61
    #1 0x501280 in _PyMem_RawRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:84:12
    #2 0x4fc68d in _PyMem_DebugRealloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1921:18
    #3 0x4fdf42 in PyMem_Realloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:343:12
    #4 0x69a338 in tok_nextc /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1050:34
    #5 0x68a2c9 in tok_get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1357:17
    #6 0x689d93 in PyTokenizer_Get /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:1809:18
    #7 0x67fec3 in parsetok /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:208:16
    #8 0x6837d4 in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:134:12
    #9 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
    #10 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
    #11 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
    #12 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
    #13 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
    #14 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
    #15 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
    #16 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)

previously allocated by thread T0 here:
    #0 0x4cdb88 in malloc /home/brian/src/fuzzpy/llvm_src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:40
    #1 0x501030 in _PyMem_RawMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:62:12
    #2 0x5074db in _PyMem_DebugAlloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1838:22
    #3 0x4fc213 in _PyMem_DebugMalloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:1861:12
    #4 0x4fdbfa in PyMem_Malloc /home/brian/src/fuzzpy/cpython/Objects/obmalloc.c:325:12
    #5 0x68791d in PyTokenizer_FromFile /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:861:29
    #6 0x68359e in PyParser_ParseFileObject /home/brian/src/fuzzpy/cpython/Parser/parsetok.c:126:16
    #7 0x52f50c in PyParser_ASTFromFileObject /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:1150:15
    #8 0x532e16 in PyRun_FileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:916:11
    #9 0x52c3f8 in PyRun_SimpleFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:396:13
    #10 0x52a460 in PyRun_AnyFileExFlags /home/brian/src/fuzzpy/cpython/Python/pythonrun.c:80:16
    #11 0x5cb04a in run_file /home/brian/src/fuzzpy/cpython/Modules/main.c:318:11
    #12 0x5c5a42 in Py_Main /home/brian/src/fuzzpy/cpython/Modules/main.c:768:19
    #13 0x4fbace in main /home/brian/src/fuzzpy/cpython/./Programs/python.c:69:11
    #14 0x7fe8a9a4aa3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)

SUMMARY: AddressSanitizer: heap-use-after-free /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20 in tok_nextc
Shadow bytes around the buggy address:
  0x0c4a7fffbbd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffbbe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffbbf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffbc00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffbc10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c4a7fffbc20: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffbc30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffbc40: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffbc50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffbc60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffbc70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==12168==ABORTING
msg254225 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-06 21:34
Yes, there is a bug. When decoding_fgets() encounter non-UTF-8 bytes, it fails and free input buffer in error_ret(). But since tok->cur != tok->inp, next call of tok_nextc() reads freed memory.

        if (tok->cur != tok->inp) {
            return Py_CHARMASK(*tok->cur++); /* Fast path */
        }

If Python is not crashed here, new buffer is allocated and assigned to tok->buf, then PyTokenizer_Get returns error, parsetok() calculates the position of the error

            err_ret->offset = (int)(tok->cur - tok->buf);

but tok->cur points inside old freed buffer, and the offset becomes too large integer. err_input() tries to decode the part of the string before error with the "replace" error handler, but since the position was wrongly calculated, it reads out of allocated memory.

Proposed patch fixes the issue. It sets tok->done and pointers in case of decoding error, so they now are in consistent state. It also removes some duplicated or dead code.
msg254656 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-11-14 13:15
New changeset 73da4fd7542b by Serhiy Storchaka in branch '3.4':
Issue #25388: Fixed tokenizer crash when processing undecodable source code
https://hg.python.org/cpython/rev/73da4fd7542b

New changeset e4a69eb34ad7 by Serhiy Storchaka in branch '3.5':
Issue #25388: Fixed tokenizer crash when processing undecodable source code
https://hg.python.org/cpython/rev/e4a69eb34ad7

New changeset ea0c4b811eae by Serhiy Storchaka in branch 'default':
Issue #25388: Fixed tokenizer crash when processing undecodable source code
https://hg.python.org/cpython/rev/ea0c4b811eae

New changeset 8e472cc258ec by Serhiy Storchaka in branch '2.7':
Issue #25388: Fixed tokenizer hang when processing undecodable source code
https://hg.python.org/cpython/rev/8e472cc258ec
History
Date User Action Args
2022-04-11 14:58:22adminsetgithub: 69575
2015-11-14 19:24:42serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2015-11-14 13:15:01python-devsetnosy: + python-dev
messages: + msg254656
2015-11-06 21:34:38serhiy.storchakasetfiles: + issue25388.patch

messages: + msg254225
stage: patch review
2015-11-04 00:47:46Brian.Cainsetmessages: + msg254034
2015-11-04 00:32:26Brian.Cainsetmessages: + msg254033
2015-11-03 10:57:40serhiy.storchakasetassignee: serhiy.storchaka
2015-11-01 21:47:17serhiy.storchakasetnosy: + serhiy.storchaka, benjamin.peterson
messages: + msg253879
2015-10-17 02:36:08terry.reedysetnosy: + terry.reedy

messages: + msg253114
versions: + Python 3.5
2015-10-13 03:29:42Brian.Cainsettype: crash
2015-10-13 03:16:44Brian.Cainsettitle: tokenizer crash/misbehavior -> tokenizer crash/misbehavior -- heap use-after-free
2015-10-13 03:16:00Brian.Cainsetfiles: + asan.txt

messages: + msg252906
2015-10-13 03:15:15Brian.Caincreate