ASAN/UBSAN: heap-buffer-overflow in pegen.c #85130

tiran · 2020-06-12T10:42:31Z

BPO	40958
Nosy	@tiran, @lysnikolaou, @pablogsal, @miss-islington
PRs	bpo-40958: Avoid buffer overflow in the parser when indexing the current line #20842 bpo-40958: Avoid buffer overflow in the parser when indexing the current line #20875 [3.9] bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875) #20919 bpo-40958: Avoid 'possible loss of data' build warning on Windows #20968 bpo-40958: Avoid 'possible loss of data' warning on Windows #20970 [3.9] bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970) #21001

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-06-16.15:51:14.172>
created_at = <Date 2020-06-12.10:42:31.032>
labels = ['type-security', 'interpreter-core', '3.9', '3.10']
title = 'ASAN/UBSAN: heap-buffer-overflow in pegen.c'
updated_at = <Date 2020-06-20.17:35:07.180>
user = 'https://github.com/tiran'

bugs.python.org fields:

activity = <Date 2020-06-20.17:35:07.180>
actor = 'miss-islington'
assignee = 'none'
closed = True
closed_date = <Date 2020-06-16.15:51:14.172>
closer = 'pablogsal'
components = ['Interpreter Core']
creation = <Date 2020-06-12.10:42:31.032>
creator = 'christian.heimes'
dependencies = []
files = []
hgrepos = []
issue_num = 40958
keywords = ['patch']
message_count = 37.0
messages = ['371351', '371358', '371359', '371360', '371361', '371362', '371363', '371365', '371366', '371368', '371369', '371370', '371371', '371372', '371373', '371374', '371375', '371376', '371378', '371379', '371381', '371382', '371384', '371388', '371389', '371391', '371393', '371395', '371399', '371405', '371675', '371686', '371831', '371837', '371838', '371936', '371945']
nosy_count = 4.0
nosy_names = ['christian.heimes', 'lys.nikolaou', 'pablogsal', 'miss-islington']
pr_nums = ['20842', '20875', '20919', '20968', '20970', '21001']
priority = 'high'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'security'
url = 'https://bugs.python.org/issue40958'
versions = ['Python 3.9', 'Python 3.10']

tiran · 2020-06-12T10:42:31Z

ASAN/UBSAN has detected a heap-buffer-overflow in pegen.c

==1625693==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x606000026b71 at pc 0x00000073574d bp 0x7fff297284f0 sp 0x7fff297284e0
READ of size 1 at 0x606000026b71 thread T0
#0 0x73574c in ascii_decode Objects/unicodeobject.c:4941
#1 0x82bd0f in unicode_decode_utf8 Objects/unicodeobject.c:4999
#2 0xf35859 in byte_offset_to_character_offset Parser/pegen.c:148
#3 0xf35859 in _PyPegen_raise_error_known_location Parser/pegen.c:412
#4 0xf36482 in _PyPegen_raise_error Parser/pegen.c:373
#5 0xf39e1d in tokenizer_error Parser/pegen.c:321
#6 0xf39e1d in _PyPegen_fill_token Parser/pegen.c:638
#7 0xf3ca0f in _PyPegen_expect_token Parser/pegen.c:753
#8 0xf4cc7a in _tmp_15_rule Parser/parser.c:16184
#9 0xf3c799 in _PyPegen_lookahead (/home/heimes/dev/python/cpython/python+0xf3c799)
#10 0xfafb4a in compound_stmt_rule Parser/parser.c:1860
#11 0xfb7fc2 in statement_rule Parser/parser.c:1224
#12 0xfb7fc2 in _loop1_11_rule Parser/parser.c:15954
#13 0xfb7fc2 in statements_rule Parser/parser.c:1183
#14 0xfbbce7 in file_rule Parser/parser.c:716
#15 0xfbbce7 in _PyPegen_parse Parser/parser.c:24401
#16 0xf3f868 in _PyPegen_run_parser Parser/pegen.c:1077
#17 0xf4044f in _PyPegen_run_parser_from_file_pointer Parser/pegen.c:1137
#18 0xa27f36 in PyRun_FileExFlags Python/pythonrun.c:1057
#19 0xa2826a in PyRun_SimpleFileExFlags Python/pythonrun.c:400
#20 0x479b1b in pymain_run_file Modules/main.c:369
#21 0x479b1b in pymain_run_python Modules/main.c:553
#22 0x47bd59 in Py_RunMain Modules/main.c:632
#23 0x47bd59 in pymain_main Modules/main.c:662
#24 0x47bd59 in Py_BytesMain Modules/main.c:686
#25 0x7f59aa5cd041 in __libc_start_main (/lib64/libc.so.6+0x27041)
#26 0x47643d in _start (/home/heimes/dev/python/cpython/python+0x47643d)

0x606000026b71 is located 0 bytes to the right of 49-byte region [0x606000026b40,0x606000026b71)
allocated by thread T0 here:
#0 0x7f59ab303667 in __interceptor_malloc (/lib64/libasan.so.6+0xb0667)
#1 0x749c7d in PyUnicode_New Objects/unicodeobject.c:1437
#2 0x872f15 in _PyUnicode_Init Objects/unicodeobject.c:15535
#3 0x9fe0ab in pycore_init_types Python/pylifecycle.c:599
#4 0x9fe0ab in pycore_interp_init Python/pylifecycle.c:724
#5 0xa07c69 in pyinit_config Python/pylifecycle.c:765
#6 0xa07c69 in pyinit_core Python/pylifecycle.c:926
#7 0xa09b17 in Py_InitializeFromConfig Python/pylifecycle.c:1136
#8 0x4766c2 in pymain_init Modules/main.c:66
#9 0x47bd12 in pymain_main Modules/main.c:653
#10 0x47bd12 in Py_BytesMain Modules/main.c:686
#11 0x7f59aa5cd041 in __libc_start_main (/lib64/libc.so.6+0x27041)

SUMMARY: AddressSanitizer: heap-buffer-overflow Objects/unicodeobject.c:4941 in ascii_decode
Shadow bytes around the buggy address:
0x0c0c7fffcd10: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
0x0c0c7fffcd20: 00 00 00 00 00 00 00 07 fa fa fa fa 00 00 00 00
0x0c0c7fffcd30: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 05
0x0c0c7fffcd40: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
0x0c0c7fffcd50: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00
=>0x0c0c7fffcd60: 00 00 00 01 fa fa fa fa 00 00 00 00 00 00[01]fa
0x0c0c7fffcd70: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
0x0c0c7fffcd80: 00 00 00 00 00 00 05 fa fa fa fa fa 00 00 00 00
0x0c0c7fffcd90: 00 00 00 fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c0c7fffcda0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
0x0c0c7fffcdb0: fd fd fd fd fd fd fd fd fa fa fa fa 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==1625693==ABORTING

pablogsal · 2020-06-12T12:34:50Z

Lysandros, could you take a look?

pablogsal · 2020-06-12T12:38:15Z

Christian, what test are you running to get the error?

lysnikolaou · 2020-06-12T12:52:00Z

Lysandros, could you take a look?

Yup, I'm on it.

pablogsal · 2020-06-12T12:56:09Z

Note that although we could just exit if the length of the line is smaller than the column offset before calling https://github.com/python/cpython/blob/master/Parser/pegen.c#L148 (I assume that is the problem) is more important to understand how are we reaching that situation.

lysnikolaou · 2020-06-12T13:04:14Z

Agreed. For that I'd probably need the input Christian is getting the error with. I'm currently trying things out, but haven't gotten the error yet.

pablogsal · 2020-06-12T13:05:00Z

Seems that just this (at top level):

> yield from

is able to produce a situation where the column offset is bigger than the line length:

Col offset is 11 and line is 10

pablogsal · 2020-06-12T13:08:51Z

Correcting my previous post (deleted):

You can get an assert failure without ASAN with this patch:

diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..65fa44921f 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -145,6 +145,9 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
     if (!str) {
         return 0;
     }
+    assert(col_offset <= PyUnicode_GET_LENGTH(line));
     PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
     if (!text) {
         return 0;

pablogsal · 2020-06-12T13:12:58Z

I theorize that this happens when the error points to the end of the line. I assume that the difference between col_offset and PyUnicode_GET_LENGTH(line) may be still bigger than 1 if the line contains some Unicode stuff like with:

> "Ṕýţĥòñ" +

but I think that would be the same situation.

tiran · 2020-06-12T13:53:26Z

I'm using an ASAN UBSAN build of Python:

$ make clean
$ ./configure --with-address-sanitizer --with-undefined-behavior-sanitizer
$ cat > asan-suppression.txt << EOF
# Python/initconfig.c
leak:_PyWideStringList_Copy
EOF
$ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0"  make
$ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof

lysnikolaou · 2020-06-12T14:03:39Z

I cannot reproduce this on my Ubuntu 20.04:

➜ cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test -v test_eof
Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
== CPython 3.10.0a0 (heads/master:e2fb8a2c42, Jun 12 2020, 16:59:27) [GCC 9.3.0]
== Linux-5.4.0-33-generic-x86_64-with-glibc2.31 little-endian
== cwd: /home/lysnikolaou/repos/cpython/build/test_python_71258
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
0:00:00 load avg: 2.11 Run tests sequentially
0:00:00 load avg: 2.11 [1/1] test_eof
test_EOFC (test.test_eof.EOFTestCase) ... ok
test_EOFS (test.test_eof.EOFTestCase) ... ok
test_eof_with_line_continuation (test.test_eof.EOFTestCase) ... ok
test_line_continuation_EOF (test.test_eof.EOFTestCase)
A continuation at the end of input must be an error; bpo-2180. ... ok
test_line_continuation_EOF_from_file_bpo2180 (test.test_eof.EOFTestCase)
Ensure tok_nextc() does not add too many ending newlines. ... Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
ok

----------------------------------------------------------------------

Ran 5 tests in 0.413s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 654 ms
Tests result: SUCCESS

lysnikolaou · 2020-06-12T14:05:53Z

Running Pablo's examples in the REPL does not reproduce the errors either:

➜  cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python                    
Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
Python 3.10.0a0 (heads/master:e2fb8a2c42, Jun 12 2020, 16:59:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "Ṕýţĥòñ" +
  File "<stdin>", line 1
    "Ṕýţĥòñ" +
              ^
SyntaxError: invalid syntax
>>> yield from
  File "<stdin>", line 1
    yield from
              ^
SyntaxError: invalid syntax
>>> ^D
➜  cpython git:(master) ✗

tiran · 2020-06-12T14:07:05Z

Did you clean and compile with asan/ubsan?

Older compilers may not detect the issue. My system has GCC 10:
libubsan-10.1.1-1.fc32.x86_64
gcc-10.1.1-1.fc32.x86_64
libasan-10.1.1-1.fc32.x86_64

pablogsal · 2020-06-12T14:12:33Z

Lysandros, are you using a debug build? The assert() won't trigger if you don't have run configure with --with-pydebug. I would recommend to:

make distclean && ./configure --with-pydebug && make

lysnikolaou · 2020-06-12T14:15:51Z

Let me try that. I'm currently using gcc 9.3.

lysnikolaou · 2020-06-12T14:17:11Z

I now configured Python with ./configure --with-pydebug --with-address-sanitizer --with-undefined-behavior-sanitizer.

lysnikolaou · 2020-06-12T14:18:14Z

Lysandros, are you using a debug build? The assert() won't trigger if you don't have run configure with --with-pydebug.

BTW I'm not talking about the assert not triggering. I'm only saying that ASAN/UBSAN do not report an error when running yield from in the REPL.

pablogsal · 2020-06-12T14:32:05Z

Oh, I was confused by this:

Running Pablo's examples in the REPL does not reproduce the errors either:

I thought you meant that you could not reproduce the crash also with my patch. For the ASAN, you may need the newer gcc and other stuff. I was not able to reproduce this on my work macOS but will try on my Arch Linux today at night that has the same versions as Christian

lysnikolaou · 2020-06-12T14:36:49Z

Is the only way to get gcc-10.1 to build from source? Because currently my internet connection is too slow to get all of gcc downloaded and apt install gcc-10 installs 10.0.1, which I'm getting totally unrelated erros with.

pablogsal · 2020-06-12T14:41:33Z

Is the only way to get gcc-10.1 to build from source? Because currently my internet connection is too slow to get all of gcc downloaded and apt install gcc-10 installs 10.0.1, which I'm getting totally unrelated erros with.

If your package manager does not have one package or you cannot find a PPA, you would need to do it from source or maybe use Docker with s distribution that has GCC 10 like Arch.

lysnikolaou · 2020-06-12T15:04:54Z

Note that although we could just exit if the length of the line is smaller than the column offset before calling https://github.com/python/cpython/blob/master/Parser/pegen.c#L148 (I assume that is the problem) is more important to understand how are we reaching that situation.

That's because of

cpython/Parser/pegen.c

Line 404 in e2fb8a2

size--;

, which decreases the size of the string to be decoded by 1 if the last character is a newline (or, equivalently, if the error offset points past the end of the line). Thus, PyUnicode_DecodeUTF8 returns an object that is one character shorter than the col_offset and that's how we get to the situation you mentioned.

pablogsal · 2020-06-12T15:15:39Z

returns an object that is one character shorter than the col_offset and that's how we get to the situation you mentioned.

What happens with:

>> "Ṕýţĥòñ" +

lysnikolaou · 2020-06-12T15:29:33Z

What happens with:

>> "Ṕýţĥòñ" +

Exact same thing. The offset is 16 at the start and gets decreased to 15 in the line I linked to in my previous post. And then col_offset gets converted to col_number which is 9, which seems correct.

Although it is correct, it feels correct by accident, since the if-branch on

cpython/Parser/pegen.c

Line 154 in e2fb8a2

if (str != NULL && (int)strlen(str) == col_offset) {

only evaluates to false due to the decrease of size and I don't know if that was intended.

pablogsal · 2020-06-12T15:47:45Z

Well, I think the solution here is to clip the col_offset to the line length and maybe adding assert that the difference is not bigger than 1.

lysnikolaou · 2020-06-12T16:01:56Z

I'm guessing that some parts of the conversion code were only there to circumvent issues in displaying the error messages that weren't pegen's fault. These were fixed by Guido in #64271, so I think we can delete some of them. For example, this patch works just fine for me.

diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..2c348178fb 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -150,10 +150,6 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
         return 0;
     }
     Py_ssize_t size = PyUnicode_GET_LENGTH(text);
-    str = PyUnicode_AsUTF8(text);
-    if (str != NULL && (int)strlen(str) == col_offset) {
-        size = strlen(str);
-    }
     Py_DECREF(text);
     return size;
 }
@@ -400,9 +396,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype,

     if (!error_line) {
         Py_ssize_t size = p->tok->inp - p->tok->buf;
-        if (size && p->tok->buf[size-1] == '\n') {
-            size--;
-        }
         error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
         if (!error_line) {
             goto error;

pablogsal · 2020-06-12T16:14:56Z

Can you run the test suite before/after of pyflakes to make sure we don't introduce any regression to double check? If everything looks fine, open a PR :)

lysnikolaou · 2020-06-12T16:28:22Z

Can you run the test suite before/after of pyflakes to make sure we don't introduce any regression to double check? If everything looks fine, open a PR :)

The exact same errors before and after! I'll wait on the PR though till someone has checked if this indeed solves the heap-buffer-overflow. I'm not at all confident that is does and I can't test.

pablogsal · 2020-06-12T16:41:10Z

Ok, I was able to reproduce:

❯ gcc --version
gcc (GCC) 10.1.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

make distclean
./configure --with-address-sanitizer --with-undefined-behavior-sanitizer
LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" make -j

❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
0:00:00 load avg: 1.82 Run tests sequentially
0:00:00 load avg: 1.82 [1/1] test_eof
test test_eof failed -- Traceback (most recent call last):
  File "/home/pablogsal/github/python/master/Lib/test/test_eof.py", line 54, in test_line_continuation_EOF_from_file_bpo2180
    self.assertIn(b'unexpected EOF while parsing', err)
AssertionError: b'unexpected EOF while parsing' not found in b'Parser/tokenizer.c:978:50: runtime error: pointer index expression with base 0x625000016900 overflowed to 0xbebebebebebee6be\n

=====================
============================================\n==27549==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x606000027c51 at pc 0x5612210ca7d4 bp 0x7fffe6e9ff70 sp 0x7fffe6e9ff60\nREAD of size 1 at 0x60600
0027c51 thread T0\n #0 0x5612210ca7d3 in ascii_decode Objects/unicodeobject.c:4941\n #1 0x5612211c9f4a in unicode_decode_utf8 Objects/unicodeobject.c:4999\n #2 0x5612219201bd in byte_offset_to_characte
r_offset Parser/pegen.c:148\n #3 0x5612219201bd in _PyPegen_raise_error_known_location Parser/pegen.c:412\n #4 0x561221920e4d in _PyPegen_raise_error Parser/pegen.c:373\n #5 0x561221924981 in tokenizer
_error Parser/pegen.c:321\n #6 0x561221924981 in _PyPegen_fill_token Parser/pegen.c:638\n #7 0x56122192777f in _PyPegen_expect_token Parser/pegen.c:753\n #8 0x56122193817a in _tmp_15_rule Parser/parser
.c:16184\n #9 0x5612219274f9 in _PyPegen_lookahead (/home/pablogsal/github/python/master/python+0x1c344f9)\n #10 0x56122199de2c in compound_stmt_rule Parser/parser.c:1860\n #11 0x5612219a65c2 in statem
ent_rule Parser/parser.c:1224\n #12 0x5612219a65c2 in _loop1_11_rule Parser/parser.c:15954\n #13 0x5612219a65c2 in statements_rule Parser/parser.c:1183\n #14 0x5612219aa4b7 in file_rule Parser/parser.c
:716\n #15 0x5612219aa4b7 in _PyPegen_parse Parser/parser.c:24401\n #16 0x56122192a768 in _PyPegen_run_parser Parser/pegen.c:1077\n #17 0x56122192b3ef in _PyPegen_run_parser_from_file_pointer Parser/pe
gen.c:1137\n #18 0x5612213da3c6 in PyRun_FileExFlags Python/pythonrun.c:1057\n #19 0x5612213da72c in PyRun_SimpleFileExFlags Python/pythonrun.c:400\n #20 0x561220df0dbb in pymain_run_file Modules/main.
c:369\n #21 0x561220df0dbb in pymain_run_python Modules/main.c:553\n #22 0x561220df3154 in Py_RunMain Modules/main.c:632\n #23 0x561220df3154 in pymain_main Modules/main.c:662\n #24 0x561220df3154 i
n Py_BytesMain Modules/main.c:686\n #25 0x7f981bf9a001 in __libc_start_main (/usr/lib/libc.so.6+0x27001)\n #26 0x561220ded48d in _start (/home/pablogsal/github/python/master/python+0x10fa48d)\n\n0x6060000
27c51 is located 0 bytes to the right of 49-byte region [0x606000027c20,0x606000027c51)\nallocated by thread T0 here:\n #0 0x7f981ccce459 in __interceptor_malloc /build/gcc/src/gcc/libsanitizer/asan/asan_mal
loc_linux.cpp:145\n #1 0x5612210dfa1d in PyUnicode_New Objects/unicodeobject.c:1437\n #2 0x56122121324b in _PyUnicode_Init Objects/unicodeobject.c:15535\n #3 0x5612213ae5c3 in pycore_init_types Python/
pylifecycle.c:599\n #4 0x5612213ae5c3 in pycore_interp_init Python/pylifecycle.c:724\n #5 0x5612213b8b4b in pyinit_config Python/pylifecycle.c:765\n #6 0x5612213b8b4b in pyinit_core Python/pylifecycle.
c:926\n #7 0x5612213bab6c in Py_InitializeFromConfig Python/pylifecycle.c:1136\n #8 0x561220ded752 in pymain_init Modules/main.c:66\n #9 0x561220df310a in pymain_main Modules/main.c:653\n #10 0x5612
20df310a in Py_BytesMain Modules/main.c:686\n #11 0x7f981bf9a001 in __libc_start_main (/usr/lib/libc.so.6+0x27001)\n\nSUMMARY: AddressSanitizer: heap-buffer-overflow Objects/unicodeobject.c:4941 in ascii_dec
ode\nShadow bytes around the buggy address:\n 0x0c0c7fffcf30: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n 0x0c0c7fffcf40: 00 00 00 07 fa fa fa fa 00 00 00 00 00 00 00 00\n 0x0c0c7fffcf50: fa fa fa fa 0
0 00 00 00 00 00 00 05 fa fa fa fa\n 0x0c0c7fffcf60: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n 0x0c0c7fffcf70: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 01\n=>0x0c0c7fffcf80: fa fa fa fa 00 00 00 0
0 00 00[01]fa fa fa fa fa\n 0x0c0c7fffcf90: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n 0x0c0c7fffcfa0: 00 00 05 fa fa fa fa fa 00 00 00 00 00 00 00 fa\n 0x0c0c7fffcfb0: fa fa fa fa 00 00 00 00 00 00 0
0 00 fa fa fa fa\n 0x0c0c7fffcfc0: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd\n 0x0c0c7fffcfd0: fd fd fd fd fa fa fa fa 00 00 00 00 00 00 00 fa\nShadow byte legend (one shadow byte represents 8 applicati
on bytes):\n Addressable: 00\n Partially addressable: 01 02 03 04 05 06 07 \n Heap left redzone: fa\n Freed heap region: fd\n Stack left redzone: f1\n Stack mid redzone: f
2\n Stack right redzone: f3\n Stack after return: f5\n Stack use after scope: f8\n Global redzone: f9\n Global init order: f6\n Poisoned by user: f7\n Container overflow:
fc\n Array cookie: ac\n Intra object redzone: bb\n ASan internal: fe\n Left alloca redzone: ca\n Right alloca redzone: cb\n Shadow gap: cc\n==27549==ABORT
ING\n'

test_eof failed

== Tests result: FAILURE ==

1 test failed:
test_eof

Total duration: 359 ms
Tests result: FAILURE

----------

With this patch

diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..a9f24ca5fa 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -145,15 +145,15 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
     if (!str) {
         return 0;
     }
+    Py_ssize_t linesize = PyUnicode_GET_LENGTH(line);
+    if (col_offset > linesize) {
+        col_offset = (int)linesize;
+    }
     PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
     if (!text) {
         return 0;
     }
     Py_ssize_t size = PyUnicode_GET_LENGTH(text);
-    str = PyUnicode_AsUTF8(text);
-    if (str != NULL && (int)strlen(str) == col_offset) {
-        size = strlen(str);
-    }
     Py_DECREF(text);
     return size;
 }
@@ -400,9 +400,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype,

     if (!error_line) {
         Py_ssize_t size = p->tok->inp - p->tok->buf;
-        if (size && p->tok->buf[size-1] == '\n') {
-            size--;
-        }
         error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
         if (!error_line) {
             goto error;

❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
0:00:00 load avg: 1.39 Run tests sequentially
0:00:00 load avg: 1.39 [1/1] test_eof

== Tests result: SUCCESS ==

1 test OK.

Total duration: 500 ms
Tests result: SUCCESS

lysnikolaou · 2020-06-12T17:28:31Z

Py_ssize_t linesize = PyUnicode_GET_LENGTH(line);

This line is wrong though, since PyUnicode_GET_LENGTH returns the length in code points and PyUnicode_DecodeUTF8 expects the number of bytes. For non-ascii input this would push the caret further to the left. For example:

$ ./python.exe
Python 3.10.0a0 (heads/master-dirty:e2fb8a2c42, Jun 12 2020, 20:22:52)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "Ṕýţĥòñ" +
  File "<stdin>", line 1
    "Ṕýţĥòñ" +
         ^
SyntaxError: invalid syntax

pablogsal · 2020-06-12T18:50:15Z

This line is wrong though, since PyUnicode_GET_LENGTH returns the length in code points

Whoops! Thanks for pointing that out. In any case, the patch still shows that the ASAN error goes away by limiting the access to the string so I suppose that we could transform it into a proper patch.

pablogsal · 2020-06-16T15:49:52Z

New changeset 51c5896 by Pablo Galindo in branch 'master':
bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875)
51c5896

pablogsal · 2020-06-16T17:37:26Z

New changeset 7795ae8 by Miss Islington (bot) in branch '3.9':
bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875) (GH-20919)
7795ae8

neonene · 2020-06-18T18:32:09Z

FYI, since PR 20875/20919, msvc(x64) has warned C4244 (conversion from 'Py_ssize_t' to 'int', possible loss of data).
parse.c especially gets more than 700.

pablogsal · 2020-06-18T21:13:57Z

Hummmm, that is because now lineno, col_offset, end_lineno, end_col_offset are Py_ssize_t so we probably need to adapt the other assignments. WIll make a PR soon

lysnikolaou · 2020-06-18T21:16:33Z

WIll make a PR soon

No need. Already got something ready.

miss-islington · 2020-06-20T12:57:30Z

New changeset 861efc6 by Lysandros Nikolaou in branch 'master':
bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970)
861efc6

miss-islington · 2020-06-20T17:35:07Z

New changeset c9f83c1 by Miss Islington (bot) in branch '3.9':
bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970)
c9f83c1

tiran added 3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-security A security issue labels Jun 12, 2020

pablogsal closed this as completed Jun 16, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

ASAN/UBSAN: heap-buffer-overflow in pegen.c #85130

ASAN/UBSAN: heap-buffer-overflow in pegen.c #85130

Comments

tiran commented Jun 12, 2020

tiran commented Jun 12, 2020

pablogsal commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

pablogsal commented Jun 12, 2020

pablogsal commented Jun 12, 2020

tiran commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

tiran commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

lysnikolaou commented Jun 12, 2020

pablogsal commented Jun 12, 2020

pablogsal commented Jun 16, 2020

pablogsal commented Jun 16, 2020

neonene mannequin commented Jun 18, 2020

pablogsal commented Jun 18, 2020

lysnikolaou commented Jun 18, 2020

miss-islington commented Jun 20, 2020

miss-islington commented Jun 20, 2020