New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASAN/UBSAN: heap-buffer-overflow in pegen.c #85130
Comments
ASAN/UBSAN has detected a heap-buffer-overflow in pegen.c ==1625693==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x606000026b71 at pc 0x00000073574d bp 0x7fff297284f0 sp 0x7fff297284e0 0x606000026b71 is located 0 bytes to the right of 49-byte region [0x606000026b40,0x606000026b71) SUMMARY: AddressSanitizer: heap-buffer-overflow Objects/unicodeobject.c:4941 in ascii_decode |
Lysandros, could you take a look? |
Christian, what test are you running to get the error? |
Yup, I'm on it. |
Note that although we could just exit if the length of the line is smaller than the column offset before calling https://github.com/python/cpython/blob/master/Parser/pegen.c#L148 (I assume that is the problem) is more important to understand how are we reaching that situation. |
Agreed. For that I'd probably need the input Christian is getting the error with. I'm currently trying things out, but haven't gotten the error yet. |
Seems that just this (at top level):
is able to produce a situation where the column offset is bigger than the line length: Col offset is 11 and line is 10 |
Correcting my previous post (deleted): You can get an assert failure without ASAN with this patch: diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..65fa44921f 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -145,6 +145,9 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
if (!str) {
return 0;
}
+ assert(col_offset <= PyUnicode_GET_LENGTH(line));
PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
if (!text) {
return 0; |
I theorize that this happens when the error points to the end of the line. I assume that the difference between col_offset and PyUnicode_GET_LENGTH(line) may be still bigger than 1 if the line contains some Unicode stuff like with:
but I think that would be the same situation. |
I'm using an ASAN UBSAN build of Python: $ make clean
$ ./configure --with-address-sanitizer --with-undefined-behavior-sanitizer
$ cat > asan-suppression.txt << EOF
# Python/initconfig.c
leak:_PyWideStringList_Copy
EOF
$ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" make
$ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof |
I cannot reproduce this on my Ubuntu 20.04: ➜ cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test -v test_eof ---------------------------------------------------------------------- Ran 5 tests in 0.413s OK == Tests result: SUCCESS == 1 test OK. Total duration: 654 ms |
Running Pablo's examples in the REPL does not reproduce the errors either: ➜ cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python
Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
Python 3.10.0a0 (heads/master:e2fb8a2c42, Jun 12 2020, 16:59:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "Ṕýţĥòñ" +
File "<stdin>", line 1
"Ṕýţĥòñ" +
^
SyntaxError: invalid syntax
>>> yield from
File "<stdin>", line 1
yield from
^
SyntaxError: invalid syntax
>>> ^D
➜ cpython git:(master) ✗ |
Did you clean and compile with asan/ubsan? Older compilers may not detect the issue. My system has GCC 10: |
Lysandros, are you using a debug build? The assert() won't trigger if you don't have run configure with --with-pydebug. I would recommend to: make distclean && ./configure --with-pydebug && make |
Let me try that. I'm currently using gcc 9.3. |
I now configured Python with ./configure --with-pydebug --with-address-sanitizer --with-undefined-behavior-sanitizer. |
BTW I'm not talking about the assert not triggering. I'm only saying that ASAN/UBSAN do not report an error when running |
Oh, I was confused by this:
I thought you meant that you could not reproduce the crash also with my patch. For the ASAN, you may need the newer gcc and other stuff. I was not able to reproduce this on my work macOS but will try on my Arch Linux today at night that has the same versions as Christian |
Is the only way to get gcc-10.1 to build from source? Because currently my internet connection is too slow to get all of gcc downloaded and |
If your package manager does not have one package or you cannot find a PPA, you would need to do it from source or maybe use Docker with s distribution that has GCC 10 like Arch. |
That's because of Line 404 in e2fb8a2
|
What happens with:
|
Exact same thing. The offset is 16 at the start and gets decreased to 15 in the line I linked to in my previous post. And then col_offset gets converted to col_number which is 9, which seems correct. Although it is correct, it feels correct by accident, since the if-branch on Line 154 in e2fb8a2
|
Well, I think the solution here is to clip the col_offset to the line length and maybe adding assert that the difference is not bigger than 1. |
I'm guessing that some parts of the conversion code were only there to circumvent issues in displaying the error messages that weren't pegen's fault. These were fixed by Guido in #64271, so I think we can delete some of them. For example, this patch works just fine for me. diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..2c348178fb 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -150,10 +150,6 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
return 0;
}
Py_ssize_t size = PyUnicode_GET_LENGTH(text);
- str = PyUnicode_AsUTF8(text);
- if (str != NULL && (int)strlen(str) == col_offset) {
- size = strlen(str);
- }
Py_DECREF(text);
return size;
}
@@ -400,9 +396,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype, if (!error_line) {
Py_ssize_t size = p->tok->inp - p->tok->buf;
- if (size && p->tok->buf[size-1] == '\n') {
- size--;
- }
error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
if (!error_line) {
goto error; |
Can you run the test suite before/after of pyflakes to make sure we don't introduce any regression to double check? If everything looks fine, open a PR :) |
The exact same errors before and after! I'll wait on the PR though till someone has checked if this indeed solves the heap-buffer-overflow. I'm not at all confident that is does and I can't test. |
Ok, I was able to reproduce: ❯ gcc --version make distclean ❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
0:00:00 load avg: 1.82 Run tests sequentially
0:00:00 load avg: 1.82 [1/1] test_eof
test test_eof failed -- Traceback (most recent call last):
File "/home/pablogsal/github/python/master/Lib/test/test_eof.py", line 54, in test_line_continuation_EOF_from_file_bpo2180
self.assertIn(b'unexpected EOF while parsing', err)
AssertionError: b'unexpected EOF while parsing' not found in b'Parser/tokenizer.c:978:50: runtime error: pointer index expression with base 0x625000016900 overflowed to 0xbebebebebebee6be\n ===================== test_eof failed == Tests result: FAILURE == 1 test failed: Total duration: 359 ms ---------- With this patch diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..a9f24ca5fa 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -145,15 +145,15 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
if (!str) {
return 0;
}
+ Py_ssize_t linesize = PyUnicode_GET_LENGTH(line);
+ if (col_offset > linesize) {
+ col_offset = (int)linesize;
+ }
PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
if (!text) {
return 0;
}
Py_ssize_t size = PyUnicode_GET_LENGTH(text);
- str = PyUnicode_AsUTF8(text);
- if (str != NULL && (int)strlen(str) == col_offset) {
- size = strlen(str);
- }
Py_DECREF(text);
return size;
}
@@ -400,9 +400,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype, if (!error_line) {
Py_ssize_t size = p->tok->inp - p->tok->buf;
- if (size && p->tok->buf[size-1] == '\n') {
- size--;
- }
error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
if (!error_line) {
goto error; ❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof == Tests result: SUCCESS == 1 test OK. Total duration: 500 ms |
This line is wrong though, since PyUnicode_GET_LENGTH returns the length in code points and PyUnicode_DecodeUTF8 expects the number of bytes. For non-ascii input this would push the caret further to the left. For example: $ ./python.exe
Python 3.10.0a0 (heads/master-dirty:e2fb8a2c42, Jun 12 2020, 20:22:52)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "Ṕýţĥòñ" +
File "<stdin>", line 1
"Ṕýţĥòñ" +
^
SyntaxError: invalid syntax |
Whoops! Thanks for pointing that out. In any case, the patch still shows that the ASAN error goes away by limiting the access to the string so I suppose that we could transform it into a proper patch. |
FYI, since PR 20875/20919, msvc(x64) has warned C4244 (conversion from 'Py_ssize_t' to 'int', possible loss of data). |
Hummmm, that is because now lineno, col_offset, end_lineno, end_col_offset are Py_ssize_t so we probably need to adapt the other assignments. WIll make a PR soon |
No need. Already got something ready. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: