This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ASAN/UBSAN: heap-buffer-overflow in pegen.c
Type: security Stage: resolved
Components: Interpreter Core Versions: Python 3.10, Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, lys.nikolaou, miss-islington, pablogsal
Priority: high Keywords: patch

Created on 2020-06-12 10:42 by christian.heimes, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 20842 closed pablogsal, 2020-06-12 21:18
PR 20875 merged pablogsal, 2020-06-15 00:30
PR 20919 merged miss-islington, 2020-06-16 15:49
PR 20968 closed lys.nikolaou, 2020-06-18 21:19
PR 20970 merged lys.nikolaou, 2020-06-18 21:45
PR 21001 merged miss-islington, 2020-06-20 12:57
Messages (37)
msg371351 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-06-12 10:42
ASAN/UBSAN has detected a heap-buffer-overflow in pegen.c

==1625693==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x606000026b71 at pc 0x00000073574d bp 0x7fff297284f0 sp 0x7fff297284e0
READ of size 1 at 0x606000026b71 thread T0
    #0 0x73574c in ascii_decode Objects/unicodeobject.c:4941
    #1 0x82bd0f in unicode_decode_utf8 Objects/unicodeobject.c:4999
    #2 0xf35859 in byte_offset_to_character_offset Parser/pegen.c:148
    #3 0xf35859 in _PyPegen_raise_error_known_location Parser/pegen.c:412
    #4 0xf36482 in _PyPegen_raise_error Parser/pegen.c:373
    #5 0xf39e1d in tokenizer_error Parser/pegen.c:321
    #6 0xf39e1d in _PyPegen_fill_token Parser/pegen.c:638
    #7 0xf3ca0f in _PyPegen_expect_token Parser/pegen.c:753
    #8 0xf4cc7a in _tmp_15_rule Parser/parser.c:16184
    #9 0xf3c799 in _PyPegen_lookahead (/home/heimes/dev/python/cpython/python+0xf3c799)
    #10 0xfafb4a in compound_stmt_rule Parser/parser.c:1860
    #11 0xfb7fc2 in statement_rule Parser/parser.c:1224
    #12 0xfb7fc2 in _loop1_11_rule Parser/parser.c:15954
    #13 0xfb7fc2 in statements_rule Parser/parser.c:1183
    #14 0xfbbce7 in file_rule Parser/parser.c:716
    #15 0xfbbce7 in _PyPegen_parse Parser/parser.c:24401
    #16 0xf3f868 in _PyPegen_run_parser Parser/pegen.c:1077
    #17 0xf4044f in _PyPegen_run_parser_from_file_pointer Parser/pegen.c:1137
    #18 0xa27f36 in PyRun_FileExFlags Python/pythonrun.c:1057
    #19 0xa2826a in PyRun_SimpleFileExFlags Python/pythonrun.c:400
    #20 0x479b1b in pymain_run_file Modules/main.c:369
    #21 0x479b1b in pymain_run_python Modules/main.c:553
    #22 0x47bd59 in Py_RunMain Modules/main.c:632
    #23 0x47bd59 in pymain_main Modules/main.c:662
    #24 0x47bd59 in Py_BytesMain Modules/main.c:686
    #25 0x7f59aa5cd041 in __libc_start_main (/lib64/libc.so.6+0x27041)
    #26 0x47643d in _start (/home/heimes/dev/python/cpython/python+0x47643d)

0x606000026b71 is located 0 bytes to the right of 49-byte region [0x606000026b40,0x606000026b71)
allocated by thread T0 here:
    #0 0x7f59ab303667 in __interceptor_malloc (/lib64/libasan.so.6+0xb0667)
    #1 0x749c7d in PyUnicode_New Objects/unicodeobject.c:1437
    #2 0x872f15 in _PyUnicode_Init Objects/unicodeobject.c:15535
    #3 0x9fe0ab in pycore_init_types Python/pylifecycle.c:599
    #4 0x9fe0ab in pycore_interp_init Python/pylifecycle.c:724
    #5 0xa07c69 in pyinit_config Python/pylifecycle.c:765
    #6 0xa07c69 in pyinit_core Python/pylifecycle.c:926
    #7 0xa09b17 in Py_InitializeFromConfig Python/pylifecycle.c:1136
    #8 0x4766c2 in pymain_init Modules/main.c:66
    #9 0x47bd12 in pymain_main Modules/main.c:653
    #10 0x47bd12 in Py_BytesMain Modules/main.c:686
    #11 0x7f59aa5cd041 in __libc_start_main (/lib64/libc.so.6+0x27041)

SUMMARY: AddressSanitizer: heap-buffer-overflow Objects/unicodeobject.c:4941 in ascii_decode
Shadow bytes around the buggy address:
  0x0c0c7fffcd10: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
  0x0c0c7fffcd20: 00 00 00 00 00 00 00 07 fa fa fa fa 00 00 00 00
  0x0c0c7fffcd30: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 05
  0x0c0c7fffcd40: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
  0x0c0c7fffcd50: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00
=>0x0c0c7fffcd60: 00 00 00 01 fa fa fa fa 00 00 00 00 00 00[01]fa
  0x0c0c7fffcd70: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
  0x0c0c7fffcd80: 00 00 00 00 00 00 05 fa fa fa fa fa 00 00 00 00
  0x0c0c7fffcd90: 00 00 00 fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c0c7fffcda0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
  0x0c0c7fffcdb0: fd fd fd fd fd fd fd fd fa fa fa fa 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1625693==ABORTING
msg371358 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 12:34
Lysandros, could you take a look?
msg371359 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 12:38
Christian, what test are you running to get the error?
msg371360 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 12:51
> Lysandros, could you take a look?

Yup, I'm on it.
msg371361 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 12:56
Note that although we could just exit if the length of the line is smaller than the column offset before calling https://github.com/python/cpython/blob/master/Parser/pegen.c#L148 (I assume that is the problem) is more important to understand how are we reaching that situation.
msg371362 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 13:04
Agreed. For that I'd probably need the input Christian is getting the error with. I'm currently trying things out, but haven't gotten the error yet.
msg371363 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 13:05
Seems that just this (at top level):

>> yield from

is able to produce a situation where the column offset is bigger than the line length:

Col offset is 11 and line is 10
msg371365 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 13:08
Correcting my previous post (deleted):

You can get an assert failure without ASAN with this patch:

diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..65fa44921f 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -145,6 +145,9 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
     if (!str) {
         return 0;
     }
+    assert(col_offset <= PyUnicode_GET_LENGTH(line));
     PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
     if (!text) {
         return 0;
msg371366 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 13:12
I theorize that this happens when the error points to the end of the line. I assume that the difference between col_offset and PyUnicode_GET_LENGTH(line) may be still bigger than 1 if the line contains some Unicode stuff like with:

>> "Ṕýţĥòñ" +

but I think that would be the same situation.
msg371368 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-06-12 13:53
I'm using an ASAN UBSAN build of Python:

$ make clean
$ ./configure --with-address-sanitizer --with-undefined-behavior-sanitizer
$ cat > asan-suppression.txt << EOF
# Python/initconfig.c
leak:_PyWideStringList_Copy
EOF
$ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0"  make
$ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
msg371369 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 14:03
I cannot reproduce this on my Ubuntu 20.04:

➜  cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test -v test_eof
Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
== CPython 3.10.0a0 (heads/master:e2fb8a2c42, Jun 12 2020, 16:59:27) [GCC 9.3.0]
== Linux-5.4.0-33-generic-x86_64-with-glibc2.31 little-endian
== cwd: /home/lysnikolaou/repos/cpython/build/test_python_71258
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
0:00:00 load avg: 2.11 Run tests sequentially
0:00:00 load avg: 2.11 [1/1] test_eof
test_EOFC (test.test_eof.EOFTestCase) ... ok
test_EOFS (test.test_eof.EOFTestCase) ... ok
test_eof_with_line_continuation (test.test_eof.EOFTestCase) ... ok
test_line_continuation_EOF (test.test_eof.EOFTestCase)
A continuation at the end of input must be an error; bpo2180. ... ok
test_line_continuation_EOF_from_file_bpo2180 (test.test_eof.EOFTestCase)
Ensure tok_nextc() does not add too many ending newlines. ... Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
ok

----------------------------------------------------------------------

Ran 5 tests in 0.413s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 654 ms
Tests result: SUCCESS
msg371370 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 14:05
Running Pablo's examples in the REPL does not reproduce the errors either:

➜  cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python                    
Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
Python 3.10.0a0 (heads/master:e2fb8a2c42, Jun 12 2020, 16:59:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "Ṕýţĥòñ" +
  File "<stdin>", line 1
    "Ṕýţĥòñ" +
              ^
SyntaxError: invalid syntax
>>> yield from
  File "<stdin>", line 1
    yield from
              ^
SyntaxError: invalid syntax
>>> ^D
➜  cpython git:(master) ✗
msg371371 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-06-12 14:07
Did you clean and compile with asan/ubsan?

Older compilers may not detect the issue. My system has GCC 10:
libubsan-10.1.1-1.fc32.x86_64
gcc-10.1.1-1.fc32.x86_64
libasan-10.1.1-1.fc32.x86_64
msg371372 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 14:12
Lysandros, are you using a debug build? The assert() won't trigger if you don't have run configure with --with-pydebug. I would recommend to:

make distclean && ./configure --with-pydebug && make
msg371373 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 14:15
Let me try that. I'm currently using gcc 9.3.
msg371374 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 14:17
I now configured Python with ./configure --with-pydebug --with-address-sanitizer --with-undefined-behavior-sanitizer.
msg371375 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 14:18
> Lysandros, are you using a debug build? The assert() won't trigger if you don't have run configure with --with-pydebug.

BTW I'm not talking about the assert not triggering. I'm only saying that ASAN/UBSAN do not report an error when running `yield from` in the REPL.
msg371376 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 14:32
Oh, I was confused by this:

> Running Pablo's examples in the REPL does not reproduce the errors either:


I thought you meant that you could not reproduce the crash also with my patch. For the ASAN, you may need the newer gcc and other stuff. I was not able to reproduce this on my work macOS but will try on my Arch Linux today at night that has the same versions as Christian
msg371378 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 14:36
Is the only way to get gcc-10.1 to build from source? Because currently my internet connection is too slow to get all of gcc downloaded and `apt install gcc-10` installs 10.0.1, which I'm getting totally unrelated erros with.
msg371379 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 14:41
> Is the only way to get gcc-10.1 to build from source? Because currently my internet connection is too slow to get all of gcc downloaded and `apt install gcc-10` installs 10.0.1, which I'm getting totally unrelated erros with.

If your package manager does not have one package or you cannot find a PPA, you would need to do it from source or maybe use Docker with s distribution that has GCC 10 like Arch.
msg371381 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 15:04
> Note that although we could just exit if the length of the line is smaller than the column offset before calling https://github.com/python/cpython/blob/master/Parser/pegen.c#L148 (I assume that is the problem) is more important to understand how are we reaching that situation.

That's because of https://github.com/python/cpython/blob/e2fb8a2c42ee60c72a40d93da69e9efc4e359023/Parser/pegen.c#L404, which decreases the size of the string to be decoded by 1 if the last character is a newline (or, equivalently, if the error offset points past the end of the line). Thus, PyUnicode_DecodeUTF8 returns an object that is one character shorter than the col_offset and that's how we get to the situation you mentioned.
msg371382 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 15:15
>  returns an object that is one character shorter than the col_offset and that's how we get to the situation you mentioned.

What happens with:

>>> "Ṕýţĥòñ" +
msg371384 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 15:29
> What happens with:

>>> "Ṕýţĥòñ" +

Exact same thing. The offset is 16 at the start and gets decreased to 15 in the line I linked to in my previous post. And then col_offset gets converted to col_number which is 9, which seems correct.

Although it is correct, it feels correct by accident, since the if-branch on https://github.com/python/cpython/blob/e2fb8a2c42ee60c72a40d93da69e9efc4e359023/Parser/pegen.c#L154 only evaluates to false due to the decrease of size and I don't know if that was intended.
msg371388 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 15:47
Well, I think the solution here is to clip the col_offset to the line length and maybe adding assert that the difference is not bigger than 1.
msg371389 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 16:01
I'm guessing that some parts of the conversion code were only there to circumvent issues in displaying the error messages that weren't pegen's fault. These were fixed by Guido in GH-20072, so I think we can delete some of them. For example, this patch works just fine for me.

diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..2c348178fb 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -150,10 +150,6 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
         return 0;
     }
     Py_ssize_t size = PyUnicode_GET_LENGTH(text);
-    str = PyUnicode_AsUTF8(text);
-    if (str != NULL && (int)strlen(str) == col_offset) {
-        size = strlen(str);
-    }
     Py_DECREF(text);
     return size;
 }
@@ -400,9 +396,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype,

     if (!error_line) {
         Py_ssize_t size = p->tok->inp - p->tok->buf;
-        if (size && p->tok->buf[size-1] == '\n') {
-            size--;
-        }
         error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
         if (!error_line) {
             goto error;
msg371391 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 16:14
Can you run the test suite before/after of pyflakes to make sure we don't introduce any regression to double check? If everything looks fine, open a PR :)
msg371393 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 16:28
> Can you run the test suite before/after of pyflakes to make sure we don't introduce any regression to double check? If everything looks fine, open a PR :)

The exact same errors before and after! I'll wait on the PR though till someone has checked if this indeed solves the heap-buffer-overflow. I'm not at all confident that is does and I can't test.
msg371395 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 16:41
Ok, I was able to reproduce:

❯ gcc --version
gcc (GCC) 10.1.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

make distclean
./configure --with-address-sanitizer --with-undefined-behavior-sanitizer
LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0"  make -j

❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
0:00:00 load avg: 1.82 Run tests sequentially
0:00:00 load avg: 1.82 [1/1] test_eof
test test_eof failed -- Traceback (most recent call last):
  File "/home/pablogsal/github/python/master/Lib/test/test_eof.py", line 54, in test_line_continuation_EOF_from_file_bpo2180
    self.assertIn(b'unexpected EOF while parsing', err)
AssertionError: b'unexpected EOF while parsing' not found in b'Parser/tokenizer.c:978:50: runtime error: pointer index expression with base 0x625000016900 overflowed to 0xbebebebebebee6be\n=====================
============================================\n==27549==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x606000027c51 at pc 0x5612210ca7d4 bp 0x7fffe6e9ff70 sp 0x7fffe6e9ff60\nREAD of size 1 at 0x60600
0027c51 thread T0\n    #0 0x5612210ca7d3 in ascii_decode Objects/unicodeobject.c:4941\n    #1 0x5612211c9f4a in unicode_decode_utf8 Objects/unicodeobject.c:4999\n    #2 0x5612219201bd in byte_offset_to_characte
r_offset Parser/pegen.c:148\n    #3 0x5612219201bd in _PyPegen_raise_error_known_location Parser/pegen.c:412\n    #4 0x561221920e4d in _PyPegen_raise_error Parser/pegen.c:373\n    #5 0x561221924981 in tokenizer
_error Parser/pegen.c:321\n    #6 0x561221924981 in _PyPegen_fill_token Parser/pegen.c:638\n    #7 0x56122192777f in _PyPegen_expect_token Parser/pegen.c:753\n    #8 0x56122193817a in _tmp_15_rule Parser/parser
.c:16184\n    #9 0x5612219274f9 in _PyPegen_lookahead (/home/pablogsal/github/python/master/python+0x1c344f9)\n    #10 0x56122199de2c in compound_stmt_rule Parser/parser.c:1860\n    #11 0x5612219a65c2 in statem
ent_rule Parser/parser.c:1224\n    #12 0x5612219a65c2 in _loop1_11_rule Parser/parser.c:15954\n    #13 0x5612219a65c2 in statements_rule Parser/parser.c:1183\n    #14 0x5612219aa4b7 in file_rule Parser/parser.c
:716\n    #15 0x5612219aa4b7 in _PyPegen_parse Parser/parser.c:24401\n    #16 0x56122192a768 in _PyPegen_run_parser Parser/pegen.c:1077\n    #17 0x56122192b3ef in _PyPegen_run_parser_from_file_pointer Parser/pe
gen.c:1137\n    #18 0x5612213da3c6 in PyRun_FileExFlags Python/pythonrun.c:1057\n    #19 0x5612213da72c in PyRun_SimpleFileExFlags Python/pythonrun.c:400\n    #20 0x561220df0dbb in pymain_run_file Modules/main.
c:369\n    #21 0x561220df0dbb in pymain_run_python Modules/main.c:553\n    #22 0x561220df3154 in Py_RunMain Modules/main.c:632\n    #23 0x561220df3154 in pymain_main Modules/main.c:662\n    #24 0x561220df3154 i
n Py_BytesMain Modules/main.c:686\n    #25 0x7f981bf9a001 in __libc_start_main (/usr/lib/libc.so.6+0x27001)\n    #26 0x561220ded48d in _start (/home/pablogsal/github/python/master/python+0x10fa48d)\n\n0x6060000
27c51 is located 0 bytes to the right of 49-byte region [0x606000027c20,0x606000027c51)\nallocated by thread T0 here:\n    #0 0x7f981ccce459 in __interceptor_malloc /build/gcc/src/gcc/libsanitizer/asan/asan_mal
loc_linux.cpp:145\n    #1 0x5612210dfa1d in PyUnicode_New Objects/unicodeobject.c:1437\n    #2 0x56122121324b in _PyUnicode_Init Objects/unicodeobject.c:15535\n    #3 0x5612213ae5c3 in pycore_init_types Python/
pylifecycle.c:599\n    #4 0x5612213ae5c3 in pycore_interp_init Python/pylifecycle.c:724\n    #5 0x5612213b8b4b in pyinit_config Python/pylifecycle.c:765\n    #6 0x5612213b8b4b in pyinit_core Python/pylifecycle.
c:926\n    #7 0x5612213bab6c in Py_InitializeFromConfig Python/pylifecycle.c:1136\n    #8 0x561220ded752 in pymain_init Modules/main.c:66\n    #9 0x561220df310a in pymain_main Modules/main.c:653\n    #10 0x5612
20df310a in Py_BytesMain Modules/main.c:686\n    #11 0x7f981bf9a001 in __libc_start_main (/usr/lib/libc.so.6+0x27001)\n\nSUMMARY: AddressSanitizer: heap-buffer-overflow Objects/unicodeobject.c:4941 in ascii_dec
ode\nShadow bytes around the buggy address:\n  0x0c0c7fffcf30: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n  0x0c0c7fffcf40: 00 00 00 07 fa fa fa fa 00 00 00 00 00 00 00 00\n  0x0c0c7fffcf50: fa fa fa fa 0
0 00 00 00 00 00 00 05 fa fa fa fa\n  0x0c0c7fffcf60: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n  0x0c0c7fffcf70: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 01\n=>0x0c0c7fffcf80: fa fa fa fa 00 00 00 0
0 00 00[01]fa fa fa fa fa\n  0x0c0c7fffcf90: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n  0x0c0c7fffcfa0: 00 00 05 fa fa fa fa fa 00 00 00 00 00 00 00 fa\n  0x0c0c7fffcfb0: fa fa fa fa 00 00 00 00 00 00 0
0 00 fa fa fa fa\n  0x0c0c7fffcfc0: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd\n  0x0c0c7fffcfd0: fd fd fd fd fa fa fa fa 00 00 00 00 00 00 00 fa\nShadow byte legend (one shadow byte represents 8 applicati
on bytes):\n  Addressable:           00\n  Partially addressable: 01 02 03 04 05 06 07 \n  Heap left redzone:       fa\n  Freed heap region:       fd\n  Stack left redzone:      f1\n  Stack mid redzone:       f
2\n  Stack right redzone:     f3\n  Stack after return:      f5\n  Stack use after scope:   f8\n  Global redzone:          f9\n  Global init order:       f6\n  Poisoned by user:        f7\n  Container overflow:
      fc\n  Array cookie:            ac\n  Intra object redzone:    bb\n  ASan internal:           fe\n  Left alloca redzone:     ca\n  Right alloca redzone:    cb\n  Shadow gap:              cc\n==27549==ABORT
ING\n'

test_eof failed

== Tests result: FAILURE ==

1 test failed:
    test_eof

Total duration: 359 ms
Tests result: FAILURE


----------

With this patch

diff --git a/Parser/pegen.c b/Parser/pegen.c
index e29910bf86..a9f24ca5fa 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -145,15 +145,15 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
     if (!str) {
         return 0;
     }
+    Py_ssize_t linesize = PyUnicode_GET_LENGTH(line);
+    if (col_offset > linesize) {
+        col_offset = (int)linesize;
+    }
     PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
     if (!text) {
         return 0;
     }
     Py_ssize_t size = PyUnicode_GET_LENGTH(text);
-    str = PyUnicode_AsUTF8(text);
-    if (str != NULL && (int)strlen(str) == col_offset) {
-        size = strlen(str);
-    }
     Py_DECREF(text);
     return size;
 }
@@ -400,9 +400,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype,

     if (!error_line) {
         Py_ssize_t size = p->tok->inp - p->tok->buf;
-        if (size && p->tok->buf[size-1] == '\n') {
-            size--;
-        }
         error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
         if (!error_line) {
             goto error;


❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
0:00:00 load avg: 1.39 Run tests sequentially
0:00:00 load avg: 1.39 [1/1] test_eof

== Tests result: SUCCESS ==

1 test OK.

Total duration: 500 ms
Tests result: SUCCESS
msg371399 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-12 17:28
> +    Py_ssize_t linesize = PyUnicode_GET_LENGTH(line);

This line is wrong though, since PyUnicode_GET_LENGTH returns the length in code points and PyUnicode_DecodeUTF8 expects the number of bytes. For non-ascii input this would push the caret further to the left. For example:

$ ./python.exe
Python 3.10.0a0 (heads/master-dirty:e2fb8a2c42, Jun 12 2020, 20:22:52)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "Ṕýţĥòñ" +
  File "<stdin>", line 1
    "Ṕýţĥòñ" +
         ^
SyntaxError: invalid syntax
msg371405 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-12 18:50
> This line is wrong though, since PyUnicode_GET_LENGTH returns the length in code points

Whoops! Thanks for pointing that out. In any case, the patch still shows that the ASAN error goes away by limiting the access to the string so I suppose that we could transform it into a proper patch.
msg371675 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-16 15:49
New changeset 51c5896b6205911d29ac07f167ec7f3cf1cb600d by Pablo Galindo in branch 'master':
bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875)
https://github.com/python/cpython/commit/51c5896b6205911d29ac07f167ec7f3cf1cb600d
msg371686 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-16 17:37
New changeset 7795ae8f05a5b1134a576a372f64699176cac5fb by Miss Islington (bot) in branch '3.9':
bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875) (GH-20919)
https://github.com/python/cpython/commit/7795ae8f05a5b1134a576a372f64699176cac5fb
msg371831 - (view) Author: neonene (neonene) * Date: 2020-06-18 18:32
FYI, since PR 20875/20919, msvc(x64) has warned C4244 (conversion from 'Py_ssize_t' to 'int', possible loss of data).
parse.c especially gets more than 700.
msg371837 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-06-18 21:13
Hummmm, that is because now lineno, col_offset, end_lineno, end_col_offset are Py_ssize_t so we probably need to adapt the other assignments. WIll make a PR soon
msg371838 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-06-18 21:16
> WIll make a PR soon

No need. Already got something ready.
msg371936 - (view) Author: miss-islington (miss-islington) Date: 2020-06-20 12:57
New changeset 861efc6e8fe7f030b1e193989b13287b31385939 by Lysandros Nikolaou in branch 'master':
bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970)
https://github.com/python/cpython/commit/861efc6e8fe7f030b1e193989b13287b31385939
msg371945 - (view) Author: miss-islington (miss-islington) Date: 2020-06-20 17:35
New changeset c9f83c173b0cc62d6fcdc363e9ab05f6664ff8f3 by Miss Islington (bot) in branch '3.9':
bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970)
https://github.com/python/cpython/commit/c9f83c173b0cc62d6fcdc363e9ab05f6664ff8f3
History
Date User Action Args
2022-04-11 14:59:32adminsetgithub: 85130
2020-06-20 17:35:07miss-islingtonsetmessages: + msg371945
2020-06-20 12:57:39miss-islingtonsetpull_requests: + pull_request20176
2020-06-20 12:57:30miss-islingtonsetnosy: + miss-islington
messages: + msg371936
2020-06-19 15:11:05neonenesetnosy: + christian.heimes, - neonene
2020-06-18 21:45:53lys.nikolaousetpull_requests: + pull_request20148
2020-06-18 21:19:09lys.nikolaousetpull_requests: + pull_request20146
2020-06-18 21:16:33lys.nikolaousetmessages: + msg371838
2020-06-18 21:13:56pablogsalsetmessages: + msg371837
2020-06-18 18:32:08neonenesetnosy: + neonene, - christian.heimes, miss-islington
messages: + msg371831
2020-06-16 17:37:25pablogsalsetmessages: + msg371686
2020-06-16 15:51:14pablogsalsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-06-16 15:49:57miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request20098
2020-06-16 15:49:51pablogsalsetmessages: + msg371675
2020-06-15 00:30:46pablogsalsetpull_requests: + pull_request20063
2020-06-12 21:18:12pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request20035
2020-06-12 18:50:14pablogsalsetmessages: + msg371405
2020-06-12 17:28:31lys.nikolaousetmessages: + msg371399
2020-06-12 16:41:10pablogsalsetmessages: + msg371395
2020-06-12 16:28:22lys.nikolaousetmessages: + msg371393
2020-06-12 16:14:55pablogsalsetmessages: + msg371391
2020-06-12 16:01:55lys.nikolaousetmessages: + msg371389
2020-06-12 15:47:44pablogsalsetmessages: + msg371388
2020-06-12 15:29:33lys.nikolaousetmessages: + msg371384
2020-06-12 15:15:38pablogsalsetmessages: + msg371382
2020-06-12 15:04:54lys.nikolaousetmessages: + msg371381
2020-06-12 14:41:32pablogsalsetmessages: + msg371379
2020-06-12 14:36:48lys.nikolaousetmessages: + msg371378
2020-06-12 14:32:04pablogsalsetmessages: + msg371376
2020-06-12 14:18:14lys.nikolaousetmessages: + msg371375
2020-06-12 14:17:10lys.nikolaousetmessages: + msg371374
2020-06-12 14:15:51lys.nikolaousetmessages: + msg371373
2020-06-12 14:12:32pablogsalsetmessages: + msg371372
2020-06-12 14:07:04christian.heimessetmessages: + msg371371
2020-06-12 14:05:52lys.nikolaousetmessages: + msg371370
2020-06-12 14:03:39lys.nikolaousetmessages: + msg371369
2020-06-12 13:53:26christian.heimessetmessages: + msg371368
2020-06-12 13:12:57pablogsalsetmessages: + msg371366
2020-06-12 13:08:51pablogsalsetmessages: + msg371365
2020-06-12 13:08:36pablogsalsetmessages: - msg371364
2020-06-12 13:08:21pablogsalsetmessages: + msg371364
2020-06-12 13:05:00pablogsalsetmessages: + msg371363
2020-06-12 13:04:14lys.nikolaousetmessages: + msg371362
2020-06-12 12:56:09pablogsalsetmessages: + msg371361
2020-06-12 12:51:59lys.nikolaousetmessages: + msg371360
2020-06-12 12:38:15pablogsalsetmessages: + msg371359
2020-06-12 12:34:49pablogsalsetmessages: + msg371358
2020-06-12 12:31:01pablogsalsetnosy: + lys.nikolaou
2020-06-12 10:42:31christian.heimescreate