Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASAN/UBSAN: heap-buffer-overflow in pegen.c #85130

Closed
tiran opened this issue Jun 12, 2020 · 37 comments
Closed

ASAN/UBSAN: heap-buffer-overflow in pegen.c #85130

tiran opened this issue Jun 12, 2020 · 37 comments
Labels
3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-security A security issue

Comments

@tiran
Copy link
Member

tiran commented Jun 12, 2020

BPO 40958
Nosy @tiran, @lysnikolaou, @pablogsal, @miss-islington
PRs
  • bpo-40958: Avoid buffer overflow in the parser when indexing the current line #20842
  • bpo-40958: Avoid buffer overflow in the parser when indexing the current line #20875
  • [3.9] bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875) #20919
  • bpo-40958: Avoid 'possible loss of data' build warning on Windows #20968
  • bpo-40958: Avoid 'possible loss of data' warning on Windows #20970
  • [3.9] bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970) #21001
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-06-16.15:51:14.172>
    created_at = <Date 2020-06-12.10:42:31.032>
    labels = ['type-security', 'interpreter-core', '3.9', '3.10']
    title = 'ASAN/UBSAN: heap-buffer-overflow in pegen.c'
    updated_at = <Date 2020-06-20.17:35:07.180>
    user = 'https://github.com/tiran'

    bugs.python.org fields:

    activity = <Date 2020-06-20.17:35:07.180>
    actor = 'miss-islington'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-06-16.15:51:14.172>
    closer = 'pablogsal'
    components = ['Interpreter Core']
    creation = <Date 2020-06-12.10:42:31.032>
    creator = 'christian.heimes'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40958
    keywords = ['patch']
    message_count = 37.0
    messages = ['371351', '371358', '371359', '371360', '371361', '371362', '371363', '371365', '371366', '371368', '371369', '371370', '371371', '371372', '371373', '371374', '371375', '371376', '371378', '371379', '371381', '371382', '371384', '371388', '371389', '371391', '371393', '371395', '371399', '371405', '371675', '371686', '371831', '371837', '371838', '371936', '371945']
    nosy_count = 4.0
    nosy_names = ['christian.heimes', 'lys.nikolaou', 'pablogsal', 'miss-islington']
    pr_nums = ['20842', '20875', '20919', '20968', '20970', '21001']
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'security'
    url = 'https://bugs.python.org/issue40958'
    versions = ['Python 3.9', 'Python 3.10']

    @tiran
    Copy link
    Member Author

    tiran commented Jun 12, 2020

    ASAN/UBSAN has detected a heap-buffer-overflow in pegen.c

    ==1625693==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x606000026b71 at pc 0x00000073574d bp 0x7fff297284f0 sp 0x7fff297284e0
    READ of size 1 at 0x606000026b71 thread T0
    #0 0x73574c in ascii_decode Objects/unicodeobject.c:4941
    #1 0x82bd0f in unicode_decode_utf8 Objects/unicodeobject.c:4999
    #2 0xf35859 in byte_offset_to_character_offset Parser/pegen.c:148
    #3 0xf35859 in _PyPegen_raise_error_known_location Parser/pegen.c:412
    #4 0xf36482 in _PyPegen_raise_error Parser/pegen.c:373
    #5 0xf39e1d in tokenizer_error Parser/pegen.c:321
    #6 0xf39e1d in _PyPegen_fill_token Parser/pegen.c:638
    #7 0xf3ca0f in _PyPegen_expect_token Parser/pegen.c:753
    #8 0xf4cc7a in _tmp_15_rule Parser/parser.c:16184
    #9 0xf3c799 in _PyPegen_lookahead (/home/heimes/dev/python/cpython/python+0xf3c799)
    #10 0xfafb4a in compound_stmt_rule Parser/parser.c:1860
    #11 0xfb7fc2 in statement_rule Parser/parser.c:1224
    #12 0xfb7fc2 in _loop1_11_rule Parser/parser.c:15954
    #13 0xfb7fc2 in statements_rule Parser/parser.c:1183
    #14 0xfbbce7 in file_rule Parser/parser.c:716
    #15 0xfbbce7 in _PyPegen_parse Parser/parser.c:24401
    #16 0xf3f868 in _PyPegen_run_parser Parser/pegen.c:1077
    #17 0xf4044f in _PyPegen_run_parser_from_file_pointer Parser/pegen.c:1137
    #18 0xa27f36 in PyRun_FileExFlags Python/pythonrun.c:1057
    #19 0xa2826a in PyRun_SimpleFileExFlags Python/pythonrun.c:400
    #20 0x479b1b in pymain_run_file Modules/main.c:369
    #21 0x479b1b in pymain_run_python Modules/main.c:553
    #22 0x47bd59 in Py_RunMain Modules/main.c:632
    #23 0x47bd59 in pymain_main Modules/main.c:662
    #24 0x47bd59 in Py_BytesMain Modules/main.c:686
    #25 0x7f59aa5cd041 in __libc_start_main (/lib64/libc.so.6+0x27041)
    #26 0x47643d in _start (/home/heimes/dev/python/cpython/python+0x47643d)

    0x606000026b71 is located 0 bytes to the right of 49-byte region [0x606000026b40,0x606000026b71)
    allocated by thread T0 here:
    #0 0x7f59ab303667 in __interceptor_malloc (/lib64/libasan.so.6+0xb0667)
    #1 0x749c7d in PyUnicode_New Objects/unicodeobject.c:1437
    #2 0x872f15 in _PyUnicode_Init Objects/unicodeobject.c:15535
    #3 0x9fe0ab in pycore_init_types Python/pylifecycle.c:599
    #4 0x9fe0ab in pycore_interp_init Python/pylifecycle.c:724
    #5 0xa07c69 in pyinit_config Python/pylifecycle.c:765
    #6 0xa07c69 in pyinit_core Python/pylifecycle.c:926
    #7 0xa09b17 in Py_InitializeFromConfig Python/pylifecycle.c:1136
    #8 0x4766c2 in pymain_init Modules/main.c:66
    #9 0x47bd12 in pymain_main Modules/main.c:653
    #10 0x47bd12 in Py_BytesMain Modules/main.c:686
    #11 0x7f59aa5cd041 in __libc_start_main (/lib64/libc.so.6+0x27041)

    SUMMARY: AddressSanitizer: heap-buffer-overflow Objects/unicodeobject.c:4941 in ascii_decode
    Shadow bytes around the buggy address:
    0x0c0c7fffcd10: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
    0x0c0c7fffcd20: 00 00 00 00 00 00 00 07 fa fa fa fa 00 00 00 00
    0x0c0c7fffcd30: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 05
    0x0c0c7fffcd40: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
    0x0c0c7fffcd50: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00
    =>0x0c0c7fffcd60: 00 00 00 01 fa fa fa fa 00 00 00 00 00 00[01]fa
    0x0c0c7fffcd70: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
    0x0c0c7fffcd80: 00 00 00 00 00 00 05 fa fa fa fa fa 00 00 00 00
    0x0c0c7fffcd90: 00 00 00 fa fa fa fa fa 00 00 00 00 00 00 00 00
    0x0c0c7fffcda0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
    0x0c0c7fffcdb0: fd fd fd fd fd fd fd fd fa fa fa fa 00 00 00 00
    Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable: 00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone: fa
    Freed heap region: fd
    Stack left redzone: f1
    Stack mid redzone: f2
    Stack right redzone: f3
    Stack after return: f5
    Stack use after scope: f8
    Global redzone: f9
    Global init order: f6
    Poisoned by user: f7
    Container overflow: fc
    Array cookie: ac
    Intra object redzone: bb
    ASan internal: fe
    Left alloca redzone: ca
    Right alloca redzone: cb
    Shadow gap: cc
    ==1625693==ABORTING

    @tiran tiran added 3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-security A security issue labels Jun 12, 2020
    @pablogsal
    Copy link
    Member

    Lysandros, could you take a look?

    @pablogsal
    Copy link
    Member

    Christian, what test are you running to get the error?

    @lysnikolaou
    Copy link
    Contributor

    Lysandros, could you take a look?

    Yup, I'm on it.

    @pablogsal
    Copy link
    Member

    Note that although we could just exit if the length of the line is smaller than the column offset before calling https://github.com/python/cpython/blob/master/Parser/pegen.c#L148 (I assume that is the problem) is more important to understand how are we reaching that situation.

    @lysnikolaou
    Copy link
    Contributor

    Agreed. For that I'd probably need the input Christian is getting the error with. I'm currently trying things out, but haven't gotten the error yet.

    @pablogsal
    Copy link
    Member

    Seems that just this (at top level):

    > yield from

    is able to produce a situation where the column offset is bigger than the line length:

    Col offset is 11 and line is 10

    @pablogsal
    Copy link
    Member

    Correcting my previous post (deleted):

    You can get an assert failure without ASAN with this patch:

    diff --git a/Parser/pegen.c b/Parser/pegen.c
    index e29910bf86..65fa44921f 100644
    --- a/Parser/pegen.c
    +++ b/Parser/pegen.c
    @@ -145,6 +145,9 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
         if (!str) {
             return 0;
         }
    +    assert(col_offset <= PyUnicode_GET_LENGTH(line));
         PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
         if (!text) {
             return 0;

    @pablogsal
    Copy link
    Member

    I theorize that this happens when the error points to the end of the line. I assume that the difference between col_offset and PyUnicode_GET_LENGTH(line) may be still bigger than 1 if the line contains some Unicode stuff like with:

    > "Ṕýţĥòñ" +

    but I think that would be the same situation.

    @tiran
    Copy link
    Member Author

    tiran commented Jun 12, 2020

    I'm using an ASAN UBSAN build of Python:

    $ make clean
    $ ./configure --with-address-sanitizer --with-undefined-behavior-sanitizer
    $ cat > asan-suppression.txt << EOF
    # Python/initconfig.c
    leak:_PyWideStringList_Copy
    EOF
    $ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0"  make
    $ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof

    @lysnikolaou
    Copy link
    Contributor

    I cannot reproduce this on my Ubuntu 20.04:

    ➜ cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test -v test_eof
    Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
    == CPython 3.10.0a0 (heads/master:e2fb8a2c42, Jun 12 2020, 16:59:27) [GCC 9.3.0]
    == Linux-5.4.0-33-generic-x86_64-with-glibc2.31 little-endian
    == cwd: /home/lysnikolaou/repos/cpython/build/test_python_71258
    == CPU count: 4
    == encodings: locale=UTF-8, FS=utf-8
    0:00:00 load avg: 2.11 Run tests sequentially
    0:00:00 load avg: 2.11 [1/1] test_eof
    test_EOFC (test.test_eof.EOFTestCase) ... ok
    test_EOFS (test.test_eof.EOFTestCase) ... ok
    test_eof_with_line_continuation (test.test_eof.EOFTestCase) ... ok
    test_line_continuation_EOF (test.test_eof.EOFTestCase)
    A continuation at the end of input must be an error; bpo-2180. ... ok
    test_line_continuation_EOF_from_file_bpo2180 (test.test_eof.EOFTestCase)
    Ensure tok_nextc() does not add too many ending newlines. ... Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
    ok

    ----------------------------------------------------------------------

    Ran 5 tests in 0.413s

    OK

    == Tests result: SUCCESS ==

    1 test OK.

    Total duration: 654 ms
    Tests result: SUCCESS

    @lysnikolaou
    Copy link
    Contributor

    Running Pablo's examples in the REPL does not reproduce the errors either:

    ➜  cpython git:(master) ✗ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python                    
    Modules/posixmodule.c:14682:9: runtime error: left shift of 34 by 26 places cannot be represented in type 'int'
    Python 3.10.0a0 (heads/master:e2fb8a2c42, Jun 12 2020, 16:59:27) 
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> "Ṕýţĥòñ" +
      File "<stdin>", line 1
        "Ṕýţĥòñ" +
                  ^
    SyntaxError: invalid syntax
    >>> yield from
      File "<stdin>", line 1
        yield from
                  ^
    SyntaxError: invalid syntax
    >>> ^D
    ➜  cpython git:(master) ✗

    @tiran
    Copy link
    Member Author

    tiran commented Jun 12, 2020

    Did you clean and compile with asan/ubsan?

    Older compilers may not detect the issue. My system has GCC 10:
    libubsan-10.1.1-1.fc32.x86_64
    gcc-10.1.1-1.fc32.x86_64
    libasan-10.1.1-1.fc32.x86_64

    @pablogsal
    Copy link
    Member

    Lysandros, are you using a debug build? The assert() won't trigger if you don't have run configure with --with-pydebug. I would recommend to:

    make distclean && ./configure --with-pydebug && make

    @lysnikolaou
    Copy link
    Contributor

    Let me try that. I'm currently using gcc 9.3.

    @lysnikolaou
    Copy link
    Contributor

    I now configured Python with ./configure --with-pydebug --with-address-sanitizer --with-undefined-behavior-sanitizer.

    @lysnikolaou
    Copy link
    Contributor

    Lysandros, are you using a debug build? The assert() won't trigger if you don't have run configure with --with-pydebug.

    BTW I'm not talking about the assert not triggering. I'm only saying that ASAN/UBSAN do not report an error when running yield from in the REPL.

    @pablogsal
    Copy link
    Member

    Oh, I was confused by this:

    Running Pablo's examples in the REPL does not reproduce the errors either:

    I thought you meant that you could not reproduce the crash also with my patch. For the ASAN, you may need the newer gcc and other stuff. I was not able to reproduce this on my work macOS but will try on my Arch Linux today at night that has the same versions as Christian

    @lysnikolaou
    Copy link
    Contributor

    Is the only way to get gcc-10.1 to build from source? Because currently my internet connection is too slow to get all of gcc downloaded and apt install gcc-10 installs 10.0.1, which I'm getting totally unrelated erros with.

    @pablogsal
    Copy link
    Member

    Is the only way to get gcc-10.1 to build from source? Because currently my internet connection is too slow to get all of gcc downloaded and apt install gcc-10 installs 10.0.1, which I'm getting totally unrelated erros with.

    If your package manager does not have one package or you cannot find a PPA, you would need to do it from source or maybe use Docker with s distribution that has GCC 10 like Arch.

    @lysnikolaou
    Copy link
    Contributor

    Note that although we could just exit if the length of the line is smaller than the column offset before calling https://github.com/python/cpython/blob/master/Parser/pegen.c#L148 (I assume that is the problem) is more important to understand how are we reaching that situation.

    That's because of

    size--;
    , which decreases the size of the string to be decoded by 1 if the last character is a newline (or, equivalently, if the error offset points past the end of the line). Thus, PyUnicode_DecodeUTF8 returns an object that is one character shorter than the col_offset and that's how we get to the situation you mentioned.

    @pablogsal
    Copy link
    Member

    returns an object that is one character shorter than the col_offset and that's how we get to the situation you mentioned.

    What happens with:

    >> "Ṕýţĥòñ" +

    @lysnikolaou
    Copy link
    Contributor

    What happens with:

    >> "Ṕýţĥòñ" +

    Exact same thing. The offset is 16 at the start and gets decreased to 15 in the line I linked to in my previous post. And then col_offset gets converted to col_number which is 9, which seems correct.

    Although it is correct, it feels correct by accident, since the if-branch on

    if (str != NULL && (int)strlen(str) == col_offset) {
    only evaluates to false due to the decrease of size and I don't know if that was intended.

    @pablogsal
    Copy link
    Member

    Well, I think the solution here is to clip the col_offset to the line length and maybe adding assert that the difference is not bigger than 1.

    @lysnikolaou
    Copy link
    Contributor

    I'm guessing that some parts of the conversion code were only there to circumvent issues in displaying the error messages that weren't pegen's fault. These were fixed by Guido in #64271, so I think we can delete some of them. For example, this patch works just fine for me.

    diff --git a/Parser/pegen.c b/Parser/pegen.c
    index e29910bf86..2c348178fb 100644
    --- a/Parser/pegen.c
    +++ b/Parser/pegen.c
    @@ -150,10 +150,6 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
             return 0;
         }
         Py_ssize_t size = PyUnicode_GET_LENGTH(text);
    -    str = PyUnicode_AsUTF8(text);
    -    if (str != NULL && (int)strlen(str) == col_offset) {
    -        size = strlen(str);
    -    }
         Py_DECREF(text);
         return size;
     }
    @@ -400,9 +396,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype,
         if (!error_line) {
             Py_ssize_t size = p->tok->inp - p->tok->buf;
    -        if (size && p->tok->buf[size-1] == '\n') {
    -            size--;
    -        }
             error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
             if (!error_line) {
                 goto error;

    @pablogsal
    Copy link
    Member

    Can you run the test suite before/after of pyflakes to make sure we don't introduce any regression to double check? If everything looks fine, open a PR :)

    @lysnikolaou
    Copy link
    Contributor

    Can you run the test suite before/after of pyflakes to make sure we don't introduce any regression to double check? If everything looks fine, open a PR :)

    The exact same errors before and after! I'll wait on the PR though till someone has checked if this indeed solves the heap-buffer-overflow. I'm not at all confident that is does and I can't test.

    @pablogsal
    Copy link
    Member

    Ok, I was able to reproduce:

    ❯ gcc --version
    gcc (GCC) 10.1.0
    Copyright (C) 2020 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    make distclean
    ./configure --with-address-sanitizer --with-undefined-behavior-sanitizer
    LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" make -j

    ❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
    0:00:00 load avg: 1.82 Run tests sequentially
    0:00:00 load avg: 1.82 [1/1] test_eof
    test test_eof failed -- Traceback (most recent call last):
      File "/home/pablogsal/github/python/master/Lib/test/test_eof.py", line 54, in test_line_continuation_EOF_from_file_bpo2180
        self.assertIn(b'unexpected EOF while parsing', err)
    AssertionError: b'unexpected EOF while parsing' not found in b'Parser/tokenizer.c:978:50: runtime error: pointer index expression with base 0x625000016900 overflowed to 0xbebebebebebee6be\n

    =====================
    ============================================\n==27549==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x606000027c51 at pc 0x5612210ca7d4 bp 0x7fffe6e9ff70 sp 0x7fffe6e9ff60\nREAD of size 1 at 0x60600
    0027c51 thread T0\n #0 0x5612210ca7d3 in ascii_decode Objects/unicodeobject.c:4941\n #1 0x5612211c9f4a in unicode_decode_utf8 Objects/unicodeobject.c:4999\n #2 0x5612219201bd in byte_offset_to_characte
    r_offset Parser/pegen.c:148\n #3 0x5612219201bd in _PyPegen_raise_error_known_location Parser/pegen.c:412\n #4 0x561221920e4d in _PyPegen_raise_error Parser/pegen.c:373\n #5 0x561221924981 in tokenizer
    _error Parser/pegen.c:321\n #6 0x561221924981 in _PyPegen_fill_token Parser/pegen.c:638\n #7 0x56122192777f in _PyPegen_expect_token Parser/pegen.c:753\n #8 0x56122193817a in _tmp_15_rule Parser/parser
    .c:16184\n #9 0x5612219274f9 in _PyPegen_lookahead (/home/pablogsal/github/python/master/python+0x1c344f9)\n #10 0x56122199de2c in compound_stmt_rule Parser/parser.c:1860\n #11 0x5612219a65c2 in statem
    ent_rule Parser/parser.c:1224\n #12 0x5612219a65c2 in _loop1_11_rule Parser/parser.c:15954\n #13 0x5612219a65c2 in statements_rule Parser/parser.c:1183\n #14 0x5612219aa4b7 in file_rule Parser/parser.c
    :716\n #15 0x5612219aa4b7 in _PyPegen_parse Parser/parser.c:24401\n #16 0x56122192a768 in _PyPegen_run_parser Parser/pegen.c:1077\n #17 0x56122192b3ef in _PyPegen_run_parser_from_file_pointer Parser/pe
    gen.c:1137\n #18 0x5612213da3c6 in PyRun_FileExFlags Python/pythonrun.c:1057\n #19 0x5612213da72c in PyRun_SimpleFileExFlags Python/pythonrun.c:400\n #20 0x561220df0dbb in pymain_run_file Modules/main.
    c:369\n #21 0x561220df0dbb in pymain_run_python Modules/main.c:553\n #22 0x561220df3154 in Py_RunMain Modules/main.c:632\n #23 0x561220df3154 in pymain_main Modules/main.c:662\n #24 0x561220df3154 i
    n Py_BytesMain Modules/main.c:686\n #25 0x7f981bf9a001 in __libc_start_main (/usr/lib/libc.so.6+0x27001)\n #26 0x561220ded48d in _start (/home/pablogsal/github/python/master/python+0x10fa48d)\n\n0x6060000
    27c51 is located 0 bytes to the right of 49-byte region [0x606000027c20,0x606000027c51)\nallocated by thread T0 here:\n #0 0x7f981ccce459 in __interceptor_malloc /build/gcc/src/gcc/libsanitizer/asan/asan_mal
    loc_linux.cpp:145\n #1 0x5612210dfa1d in PyUnicode_New Objects/unicodeobject.c:1437\n #2 0x56122121324b in _PyUnicode_Init Objects/unicodeobject.c:15535\n #3 0x5612213ae5c3 in pycore_init_types Python/
    pylifecycle.c:599\n #4 0x5612213ae5c3 in pycore_interp_init Python/pylifecycle.c:724\n #5 0x5612213b8b4b in pyinit_config Python/pylifecycle.c:765\n #6 0x5612213b8b4b in pyinit_core Python/pylifecycle.
    c:926\n #7 0x5612213bab6c in Py_InitializeFromConfig Python/pylifecycle.c:1136\n #8 0x561220ded752 in pymain_init Modules/main.c:66\n #9 0x561220df310a in pymain_main Modules/main.c:653\n #10 0x5612
    20df310a in Py_BytesMain Modules/main.c:686\n #11 0x7f981bf9a001 in __libc_start_main (/usr/lib/libc.so.6+0x27001)\n\nSUMMARY: AddressSanitizer: heap-buffer-overflow Objects/unicodeobject.c:4941 in ascii_dec
    ode\nShadow bytes around the buggy address:\n 0x0c0c7fffcf30: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n 0x0c0c7fffcf40: 00 00 00 07 fa fa fa fa 00 00 00 00 00 00 00 00\n 0x0c0c7fffcf50: fa fa fa fa 0
    0 00 00 00 00 00 00 05 fa fa fa fa\n 0x0c0c7fffcf60: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n 0x0c0c7fffcf70: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 01\n=>0x0c0c7fffcf80: fa fa fa fa 00 00 00 0
    0 00 00[01]fa fa fa fa fa\n 0x0c0c7fffcf90: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00\n 0x0c0c7fffcfa0: 00 00 05 fa fa fa fa fa 00 00 00 00 00 00 00 fa\n 0x0c0c7fffcfb0: fa fa fa fa 00 00 00 00 00 00 0
    0 00 fa fa fa fa\n 0x0c0c7fffcfc0: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd\n 0x0c0c7fffcfd0: fd fd fd fd fa fa fa fa 00 00 00 00 00 00 00 fa\nShadow byte legend (one shadow byte represents 8 applicati
    on bytes):\n Addressable: 00\n Partially addressable: 01 02 03 04 05 06 07 \n Heap left redzone: fa\n Freed heap region: fd\n Stack left redzone: f1\n Stack mid redzone: f
    2\n Stack right redzone: f3\n Stack after return: f5\n Stack use after scope: f8\n Global redzone: f9\n Global init order: f6\n Poisoned by user: f7\n Container overflow:
    fc\n Array cookie: ac\n Intra object redzone: bb\n ASan internal: fe\n Left alloca redzone: ca\n Right alloca redzone: cb\n Shadow gap: cc\n==27549==ABORT
    ING\n'

    test_eof failed

    == Tests result: FAILURE ==

    1 test failed:
    test_eof

    Total duration: 359 ms
    Tests result: FAILURE

    ----------

    With this patch

    diff --git a/Parser/pegen.c b/Parser/pegen.c
    index e29910bf86..a9f24ca5fa 100644
    --- a/Parser/pegen.c
    +++ b/Parser/pegen.c
    @@ -145,15 +145,15 @@ byte_offset_to_character_offset(PyObject *line, int col_offset)
         if (!str) {
             return 0;
         }
    +    Py_ssize_t linesize = PyUnicode_GET_LENGTH(line);
    +    if (col_offset > linesize) {
    +        col_offset = (int)linesize;
    +    }
         PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
         if (!text) {
             return 0;
         }
         Py_ssize_t size = PyUnicode_GET_LENGTH(text);
    -    str = PyUnicode_AsUTF8(text);
    -    if (str != NULL && (int)strlen(str) == col_offset) {
    -        size = strlen(str);
    -    }
         Py_DECREF(text);
         return size;
     }
    @@ -400,9 +400,6 @@ _PyPegen_raise_error_known_location(Parser *p, PyObject *errtype,
         if (!error_line) {
             Py_ssize_t size = p->tok->inp - p->tok->buf;
    -        if (size && p->tok->buf[size-1] == '\n') {
    -            size--;
    -        }
             error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
             if (!error_line) {
                 goto error;

    ❯ LSAN_OPTIONS="suppressions=asan-suppression.txt,print_suppressions=0" ./python -m test test_eof
    0:00:00 load avg: 1.39 Run tests sequentially
    0:00:00 load avg: 1.39 [1/1] test_eof

    == Tests result: SUCCESS ==

    1 test OK.

    Total duration: 500 ms
    Tests result: SUCCESS

    @lysnikolaou
    Copy link
    Contributor

    • Py_ssize_t linesize = PyUnicode_GET_LENGTH(line);

    This line is wrong though, since PyUnicode_GET_LENGTH returns the length in code points and PyUnicode_DecodeUTF8 expects the number of bytes. For non-ascii input this would push the caret further to the left. For example:

    $ ./python.exe
    Python 3.10.0a0 (heads/master-dirty:e2fb8a2c42, Jun 12 2020, 20:22:52)
    [Clang 11.0.0 (clang-1100.0.33.8)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> "Ṕýţĥòñ" +
      File "<stdin>", line 1
        "Ṕýţĥòñ" +
             ^
    SyntaxError: invalid syntax

    @pablogsal
    Copy link
    Member

    This line is wrong though, since PyUnicode_GET_LENGTH returns the length in code points

    Whoops! Thanks for pointing that out. In any case, the patch still shows that the ASAN error goes away by limiting the access to the string so I suppose that we could transform it into a proper patch.

    @pablogsal
    Copy link
    Member

    New changeset 51c5896 by Pablo Galindo in branch 'master':
    bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875)
    51c5896

    @pablogsal
    Copy link
    Member

    New changeset 7795ae8 by Miss Islington (bot) in branch '3.9':
    bpo-40958: Avoid buffer overflow in the parser when indexing the current line (GH-20875) (GH-20919)
    7795ae8

    @neonene
    Copy link
    Mannequin

    neonene mannequin commented Jun 18, 2020

    FYI, since PR 20875/20919, msvc(x64) has warned C4244 (conversion from 'Py_ssize_t' to 'int', possible loss of data).
    parse.c especially gets more than 700.

    @pablogsal
    Copy link
    Member

    Hummmm, that is because now lineno, col_offset, end_lineno, end_col_offset are Py_ssize_t so we probably need to adapt the other assignments. WIll make a PR soon

    @lysnikolaou
    Copy link
    Contributor

    WIll make a PR soon

    No need. Already got something ready.

    @miss-islington
    Copy link
    Contributor

    New changeset 861efc6 by Lysandros Nikolaou in branch 'master':
    bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970)
    861efc6

    @miss-islington
    Copy link
    Contributor

    New changeset c9f83c1 by Miss Islington (bot) in branch '3.9':
    bpo-40958: Avoid 'possible loss of data' warning on Windows (GH-20970)
    c9f83c1

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-security A security issue
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants