Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repl segfaults on non utf-8 input #91273

Closed
jooon mannequin opened this issue Mar 25, 2022 · 8 comments
Closed

repl segfaults on non utf-8 input #91273

jooon mannequin opened this issue Mar 25, 2022 · 8 comments
Labels
3.10 only security fixes 3.11 only security fixes type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@jooon
Copy link
Mannequin

jooon mannequin commented Mar 25, 2022

BPO 47117
Nosy @pablogsal, @miss-islington, @tirkarthi, @jooon
PRs
  • bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized #32129
  • [3.10] bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized (GH-32129) #32130
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-03-26.18:26:31.861>
    created_at = <Date 2022-03-25.10:06:05.710>
    labels = ['3.10', 'type-crash', '3.11']
    title = 'repl segfaults on non utf-8 input'
    updated_at = <Date 2022-03-26.18:26:31.861>
    user = 'https://github.com/jooon'

    bugs.python.org fields:

    activity = <Date 2022-03-26.18:26:31.861>
    actor = 'pablogsal'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-03-26.18:26:31.861>
    closer = 'pablogsal'
    components = []
    creation = <Date 2022-03-25.10:06:05.710>
    creator = 'jooon'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 47117
    keywords = ['patch']
    message_count = 8.0
    messages = ['415992', '416004', '416005', '416006', '416070', '416072', '416079', '416080']
    nosy_count = 4.0
    nosy_names = ['pablogsal', 'miss-islington', 'xtreak', 'jooon']
    pr_nums = ['32129', '32130']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue47117'
    versions = ['Python 3.10', 'Python 3.11']

    @jooon
    Copy link
    Mannequin Author

    jooon mannequin commented Mar 25, 2022

    Some bytes that are non utf-8 segfaults python repl in 3.10 and later on linux. Example:

    $ python3.10
    Python 3.10.4 (main, Mar 24 2022, 14:20:44) [GCC 9.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> �
    Segmentation fault (core dumped)

    It is treated correctly in Python 3.9 and earlier

    $ python3.9
    Python 3.9.12 (main, Mar 24 2022, 14:21:53) 
    [GCC 9.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> �
      File "<stdin>", line 0
        
    SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte

    How to reproduce:

    In Gnome on Ubuntu 20.04 with the Swedish keyboard layout, holding left alt and pressing the ö key enters the byte 0xb6 into the terminal.

    I have only been able to make it crash the repl. I can't make it crash the parser. For instance trying to eval the byte.

    @jooon jooon mannequin added 3.10 only security fixes 3.11 only security fixes type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 25, 2022
    @tirkarthi
    Copy link
    Member

    This looks similar to https://bugs.python.org/issue46206

    @jooon
    Copy link
    Mannequin Author

    jooon mannequin commented Mar 25, 2022

    Yes. I think they are the same. I can reproduce the emoji crash. This is much easier to reproduce. No need to have a Swedish keyboard layout.

    1. Copy _😀
    2. Start python with a non unicode locale. LC_ALL=C python3.10
    3. Paste in _😀
    4. Press backspace once. It will look like the 2 character wide emoji is replaced by a 1 character wide space.
    5. Press return
    6. Crash

    @jooon
    Copy link
    Mannequin Author

    jooon mannequin commented Mar 25, 2022

    very similar back trace too

    (gdb) run
    Starting program: /home/jon/.pyenv/versions/3.10.4/bin/python3.10 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Python 3.10.4 (main, Mar 24 2022, 14:20:44) [GCC 9.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> _ 
    
    Program received signal SIGSEGV, Segmentation fault.
    __strchr_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:57
    57	../sysdeps/x86_64/multiarch/strchr-avx2.S: No such file or directory.
    (gdb) bt
    #0  __strchr_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:57
    #1  0x00005555557d4a7a in get_error_line (lineno=lineno@entry=0, p=<optimized out>, p=<optimized out>) at Parser/pegen.c:443
    #2  0x00005555557d541b in _PyPegen_raise_error_known_location (p=0x7ffff7885ed0, 
        errtype=0x5555558fe420 <_PyExc_SyntaxError>, lineno=0, col_offset=0, end_lineno=0, end_col_offset=-1, 
        errmsg=0x5555558a2dd3 "(%s) %U", va=0x7fffffffd410) at Parser/pegen.c:499
    #3  0x00005555557d5646 in _PyPegen_raise_error (p=p@entry=0x7ffff7885ed0, errtype=<optimized out>, 
        errmsg=errmsg@entry=0x5555558a2dd3 "(%s) %U") at Parser/pegen.c:422
    #4  0x00005555557d5839 in raise_decode_error (p=p@entry=0x7ffff7885ed0) at Parser/pegen.c:271
    #5  0x00005555557d6193 in initialize_token (token_type=60, end=0x0, start=<optimized out>, token=0x7ffff7a55d10, 
        p=0x7ffff7885ed0) at Parser/pegen.c:720
    #6  _PyPegen_fill_token (p=p@entry=0x7ffff7885ed0) at Parser/pegen.c:793
    #7  0x00005555557fec00 in statement_newline_rule (p=0x7ffff7885ed0) at Parser/parser.c:1080
    #8  interactive_rule (p=0x7ffff7885ed0) at Parser/parser.c:1002
    #9  _PyPegen_parse (p=p@entry=0x7ffff7885ed0) at Parser/parser.c:34508
    #10 0x00005555557d6c60 in _PyPegen_run_parser (p=0x7ffff7885ed0) at Parser/pegen.c:1342
    #11 0x00005555557d718f in _PyPegen_run_parser_from_file_pointer (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, 
        start_rule=start_rule@entry=256, filename_ob=filename_ob@entry=0x7ffff7a85670, enc=enc@entry=0x7ffff7a7c1a0 "utf-8", 
        ps1=<optimized out>, ps1@entry=0x1e000000160 <error: Cannot access memory at address 0x1e000000160>, 
        ps2=ps2@entry=0xe0000001a0 <error: Cannot access memory at address 0xe0000001a0>, flags=0x7fffffffd7f8, 
        errcode=0x7fffffffd724, arena=0x7ffff792cc70) at Parser/pegen.c:1448
    #12 0x000055555575661c in _PyParser_ASTFromFile (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, 
        filename_ob=filename_ob@entry=0x7ffff7a85670, enc=enc@entry=0x7ffff7a7c1a0 "utf-8", mode=mode@entry=256, 
        ps1=0x1e000000160 <error: Cannot access memory at address 0x1e000000160>, ps1@entry=0x7ffff7acf960 ">>> ", 
        ps2=0xe0000001a0 <error: Cannot access memory at address 0xe0000001a0>, ps2@entry=0x7ffff7af02e0 "... ", 
        flags=<optimized out>, errcode=<optimized out>, arena=<optimized out>) at Parser/peg_api.c:26
    #13 0x00005555556cad97 in PyRun_InteractiveOneObjectEx (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, filename=filename@entry=0x7ffff7a85670, flags=flags@entry=0x7fffffffd7f8) at Python/pythonrun.c:257
    #14 0x00005555556cba26 in _PyRun_InteractiveLoopObject (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, filename=filename@entry=0x7ffff7a85670, flags=flags@entry=0x7fffffffd7f8) at Python/pythonrun.c:148
    #15 0x00005555556cc5ce in _PyRun_AnyFileObject (flags=<optimized out>, closeit=<optimized out>, filename=0x7ffff7a85670, fp=<optimized out>) at Python/pythonrun.c:84
    #16 PyRun_AnyFileExFlags (fp=0x7ffff7e29980 <_IO_2_1_stdin_>, filename=filename@entry=0x555555802103 "<stdin>", closeit=closeit@entry=0, flags=flags@entry=0x7fffffffd7f8) at Python/pythonrun.c:116
    #17 0x00005555555bb5c7 in pymain_run_stdin (config=0x555555932ce0) at Modules/main.c:502
    #18 pymain_run_python (exitcode=exitcode@entry=0x7fffffffd930) at Modules/main.c:590
    #19 0x00005555555bba1f in Py_RunMain () at Modules/main.c:666
    #20 pymain_main (args=0x7fffffffd8f0) at Modules/main.c:696
    #21 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
    #22 0x00007ffff7c610b3 in __libc_start_main (main=0x5555555aedb0 <main>, argc=1, argv=0x7fffffffda58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffda48)
        at ../csu/libc-start.c:308
    #23 0x00005555555ba57e in _start () at ./Include/internal/pycore_pyerrors.h:14

    @pablogsal
    Copy link
    Member

    Ah yes, we have been defeated by half an emoji :)

    @miss-islington
    Copy link
    Contributor

    New changeset 26cca80 by Pablo Galindo Salgado in branch 'main':
    bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized (GH-32129)
    26cca80

    @pablogsal
    Copy link
    Member

    New changeset 27ee431 by Pablo Galindo Salgado in branch '3.10':
    [3.10] bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized (GH-32129) (GH-32130)
    27ee431

    @pablogsal
    Copy link
    Member

    Thanks for the report, Jon!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 only security fixes type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants