repl segfaults on non utf-8 input #91273

jooon · 2022-03-25T10:06:06Z

BPO	47117
Nosy	@pablogsal, @miss-islington, @tirkarthi, @jooon
PRs	bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized #32129 [3.10] bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized (GH-32129) #32130

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2022-03-26.18:26:31.861>
created_at = <Date 2022-03-25.10:06:05.710>
labels = ['3.10', 'type-crash', '3.11']
title = 'repl segfaults on non utf-8 input'
updated_at = <Date 2022-03-26.18:26:31.861>
user = 'https://github.com/jooon'

bugs.python.org fields:

activity = <Date 2022-03-26.18:26:31.861>
actor = 'pablogsal'
assignee = 'none'
closed = True
closed_date = <Date 2022-03-26.18:26:31.861>
closer = 'pablogsal'
components = []
creation = <Date 2022-03-25.10:06:05.710>
creator = 'jooon'
dependencies = []
files = []
hgrepos = []
issue_num = 47117
keywords = ['patch']
message_count = 8.0
messages = ['415992', '416004', '416005', '416006', '416070', '416072', '416079', '416080']
nosy_count = 4.0
nosy_names = ['pablogsal', 'miss-islington', 'xtreak', 'jooon']
pr_nums = ['32129', '32130']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue47117'
versions = ['Python 3.10', 'Python 3.11']

jooon · 2022-03-25T10:06:06Z

Some bytes that are non utf-8 segfaults python repl in 3.10 and later on linux. Example:

$ python3.10
Python 3.10.4 (main, Mar 24 2022, 14:20:44) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> �
Segmentation fault (core dumped)

It is treated correctly in Python 3.9 and earlier

$ python3.9
Python 3.9.12 (main, Mar 24 2022, 14:21:53) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> �
  File "<stdin>", line 0
    
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte

How to reproduce:

In Gnome on Ubuntu 20.04 with the Swedish keyboard layout, holding left alt and pressing the ö key enters the byte 0xb6 into the terminal.

I have only been able to make it crash the repl. I can't make it crash the parser. For instance trying to eval the byte.

tirkarthi · 2022-03-25T14:45:24Z

This looks similar to https://bugs.python.org/issue46206

jooon · 2022-03-25T14:59:12Z

Yes. I think they are the same. I can reproduce the emoji crash. This is much easier to reproduce. No need to have a Swedish keyboard layout.

Copy _😀
Start python with a non unicode locale. LC_ALL=C python3.10
Paste in _😀
Press backspace once. It will look like the 2 character wide emoji is replaced by a 1 character wide space.
Press return
Crash

jooon · 2022-03-25T15:07:15Z

very similar back trace too

(gdb) run
Starting program: /home/jon/.pyenv/versions/3.10.4/bin/python3.10 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.10.4 (main, Mar 24 2022, 14:20:44) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> _ 

Program received signal SIGSEGV, Segmentation fault.
__strchr_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:57
57	../sysdeps/x86_64/multiarch/strchr-avx2.S: No such file or directory.
(gdb) bt
#0  __strchr_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:57
#1  0x00005555557d4a7a in get_error_line (lineno=lineno@entry=0, p=<optimized out>, p=<optimized out>) at Parser/pegen.c:443
#2  0x00005555557d541b in _PyPegen_raise_error_known_location (p=0x7ffff7885ed0, 
    errtype=0x5555558fe420 <_PyExc_SyntaxError>, lineno=0, col_offset=0, end_lineno=0, end_col_offset=-1, 
    errmsg=0x5555558a2dd3 "(%s) %U", va=0x7fffffffd410) at Parser/pegen.c:499
#3  0x00005555557d5646 in _PyPegen_raise_error (p=p@entry=0x7ffff7885ed0, errtype=<optimized out>, 
    errmsg=errmsg@entry=0x5555558a2dd3 "(%s) %U") at Parser/pegen.c:422
#4  0x00005555557d5839 in raise_decode_error (p=p@entry=0x7ffff7885ed0) at Parser/pegen.c:271
#5  0x00005555557d6193 in initialize_token (token_type=60, end=0x0, start=<optimized out>, token=0x7ffff7a55d10, 
    p=0x7ffff7885ed0) at Parser/pegen.c:720
#6  _PyPegen_fill_token (p=p@entry=0x7ffff7885ed0) at Parser/pegen.c:793
#7  0x00005555557fec00 in statement_newline_rule (p=0x7ffff7885ed0) at Parser/parser.c:1080
#8  interactive_rule (p=0x7ffff7885ed0) at Parser/parser.c:1002
#9  _PyPegen_parse (p=p@entry=0x7ffff7885ed0) at Parser/parser.c:34508
#10 0x00005555557d6c60 in _PyPegen_run_parser (p=0x7ffff7885ed0) at Parser/pegen.c:1342
#11 0x00005555557d718f in _PyPegen_run_parser_from_file_pointer (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, 
    start_rule=start_rule@entry=256, filename_ob=filename_ob@entry=0x7ffff7a85670, enc=enc@entry=0x7ffff7a7c1a0 "utf-8", 
    ps1=<optimized out>, ps1@entry=0x1e000000160 <error: Cannot access memory at address 0x1e000000160>, 
    ps2=ps2@entry=0xe0000001a0 <error: Cannot access memory at address 0xe0000001a0>, flags=0x7fffffffd7f8, 
    errcode=0x7fffffffd724, arena=0x7ffff792cc70) at Parser/pegen.c:1448
#12 0x000055555575661c in _PyParser_ASTFromFile (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, 
    filename_ob=filename_ob@entry=0x7ffff7a85670, enc=enc@entry=0x7ffff7a7c1a0 "utf-8", mode=mode@entry=256, 
    ps1=0x1e000000160 <error: Cannot access memory at address 0x1e000000160>, ps1@entry=0x7ffff7acf960 ">>> ", 
    ps2=0xe0000001a0 <error: Cannot access memory at address 0xe0000001a0>, ps2@entry=0x7ffff7af02e0 "... ", 
    flags=<optimized out>, errcode=<optimized out>, arena=<optimized out>) at Parser/peg_api.c:26
#13 0x00005555556cad97 in PyRun_InteractiveOneObjectEx (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, filename=filename@entry=0x7ffff7a85670, flags=flags@entry=0x7fffffffd7f8) at Python/pythonrun.c:257
#14 0x00005555556cba26 in _PyRun_InteractiveLoopObject (fp=fp@entry=0x7ffff7e29980 <_IO_2_1_stdin_>, filename=filename@entry=0x7ffff7a85670, flags=flags@entry=0x7fffffffd7f8) at Python/pythonrun.c:148
#15 0x00005555556cc5ce in _PyRun_AnyFileObject (flags=<optimized out>, closeit=<optimized out>, filename=0x7ffff7a85670, fp=<optimized out>) at Python/pythonrun.c:84
#16 PyRun_AnyFileExFlags (fp=0x7ffff7e29980 <_IO_2_1_stdin_>, filename=filename@entry=0x555555802103 "<stdin>", closeit=closeit@entry=0, flags=flags@entry=0x7fffffffd7f8) at Python/pythonrun.c:116
#17 0x00005555555bb5c7 in pymain_run_stdin (config=0x555555932ce0) at Modules/main.c:502
#18 pymain_run_python (exitcode=exitcode@entry=0x7fffffffd930) at Modules/main.c:590
#19 0x00005555555bba1f in Py_RunMain () at Modules/main.c:666
#20 pymain_main (args=0x7fffffffd8f0) at Modules/main.c:696
#21 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#22 0x00007ffff7c610b3 in __libc_start_main (main=0x5555555aedb0 <main>, argc=1, argv=0x7fffffffda58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffda48)
    at ../csu/libc-start.c:308
#23 0x00005555555ba57e in _start () at ./Include/internal/pycore_pyerrors.h:14

pablogsal · 2022-03-26T15:54:44Z

Ah yes, we have been defeated by half an emoji :)

miss-islington · 2022-03-26T16:29:16Z

New changeset 26cca80 by Pablo Galindo Salgado in branch 'main':
bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized (GH-32129)
26cca80

pablogsal · 2022-03-26T18:26:13Z

New changeset 27ee431 by Pablo Galindo Salgado in branch '3.10':
[3.10] bpo-47117: Don't crash if we fail to decode characters when the tokenizer buffers are uninitialized (GH-32129) (GH-32130)
27ee431

pablogsal · 2022-03-26T18:26:32Z

Thanks for the report, Jon!

jooon mannequin added 3.10 only security fixes 3.11 only security fixes type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 25, 2022

pablogsal closed this as completed Mar 26, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repl segfaults on non utf-8 input #91273

repl segfaults on non utf-8 input #91273

jooon mannequin commented Mar 25, 2022

jooon mannequin commented Mar 25, 2022

tirkarthi commented Mar 25, 2022

jooon mannequin commented Mar 25, 2022

jooon mannequin commented Mar 25, 2022

pablogsal commented Mar 26, 2022

miss-islington commented Mar 26, 2022

pablogsal commented Mar 26, 2022

pablogsal commented Mar 26, 2022

repl segfaults on non utf-8 input #91273

repl segfaults on non utf-8 input #91273

Comments

jooon mannequin commented Mar 25, 2022

jooon mannequin commented Mar 25, 2022

tirkarthi commented Mar 25, 2022

jooon mannequin commented Mar 25, 2022

jooon mannequin commented Mar 25, 2022

pablogsal commented Mar 26, 2022

miss-islington commented Mar 26, 2022

pablogsal commented Mar 26, 2022

pablogsal commented Mar 26, 2022