New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer crash/misbehavior -- heap use-after-free #69575
Comments
This issue is similar to (but I believe distinct from) the one reported earlier as http://bugs.python.org/issue24022. Tokenizer failures strike me as difficult to exploit, but risky nonetheless. Attached is a test case that illustrates the problem and the output from ASan when it encounters the failure. All of the versions below that I tested failed in one way or another (segfault, assertion failure, printing enormous blank output to console). Some fail frequently and some exhibit this failure only occasionally. Python 3.4.3 (default, Mar 26 2015, 22:03:40) |
asan output |
According to https://docs.python.org/3/reference/lexical_analysis.html#lexical-analysis, the encoding of a sourcefile (in Python 3) defaults to utf-8* and a decoding error is (should be) reported as a SyntaxError. Since b"\x7f\x00\x00\n''s\x01\xfd\n'S" is not invalid as utf-8, I expect a UnicodeDecodeError converted to SyntaxError.
I expect '''self.assertIn(b"Non-UTF-8", res.err)''' to always fail because error messages are strings, not bytes. That aside, have you ever seen that particular text (as a string) in a SyntaxError message?). Why do you think the crash is during the tokenizing phase? I could not see anything in the AS report. |
Stack trace: #0 ascii_decode (start=0xa72f2008 "", end=0xfffff891 <error: Cannot access memory at address 0xfffff891>, dest=<optimized out>) at Objects/unicodeobject.c:4795 At #2 PyUnicode_DecodeUTF8 is called with s="" and size=1490081929. size is err->offset, and err->offset is set only in parsetok() in Parser/parsetok.c. This is the tokenizer bug. Minimal reproducer: ./python -c 'with open("vuln.py", "wb") as f: f.write(b"\x7f\x00\n\xfd\n") The crash is gone if comment out the code at the end of decoding_fgets() that tests UTF-8. |
Sorry, the report would have been clearer if I'd included a build with symbols and a stack trace. The test was inspired by the test from bpo-24022 (https://hg.python.org/cpython/rev/03b2259c6cd3), it sounds like it should not have been. But indeed it seems like you've reproduced this issue, and you agree it's a bug? |
Here is a more useful ASan report: ================================================================= 0x62500001e110 is located 16 bytes inside of 8224-byte region [0x62500001e100,0x625000020120) previously allocated by thread T0 here: SUMMARY: AddressSanitizer: heap-use-after-free /home/brian/src/fuzzpy/cpython/Parser/tokenizer.c:911:20 in tok_nextc |
Yes, there is a bug. When decoding_fgets() encounter non-UTF-8 bytes, it fails and free input buffer in error_ret(). But since tok->cur != tok->inp, next call of tok_nextc() reads freed memory. if (tok->cur != tok->inp) {
return Py_CHARMASK(*tok->cur++); /* Fast path */
} If Python is not crashed here, new buffer is allocated and assigned to tok->buf, then PyTokenizer_Get returns error, parsetok() calculates the position of the error
but tok->cur points inside old freed buffer, and the offset becomes too large integer. err_input() tries to decode the part of the string before error with the "replace" error handler, but since the position was wrongly calculated, it reads out of allocated memory. Proposed patch fixes the issue. It sets tok->done and pointers in case of decoding error, so they now are in consistent state. It also removes some duplicated or dead code. |
New changeset 73da4fd7542b by Serhiy Storchaka in branch '3.4': New changeset e4a69eb34ad7 by Serhiy Storchaka in branch '3.5': New changeset ea0c4b811eae by Serhiy Storchaka in branch 'default': New changeset 8e472cc258ec by Serhiy Storchaka in branch '2.7': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: