Author Neui
Recipients Neui
Date 2019-02-01.16:55:37
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1549040138.25.0.646545491344.issue35883@roundup.psfhosted.org>
In-reply-to
Content
When an invalid unicode character is given to argv (cli arguments), then python abort()s with an fatal error about an character not in range (ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]).

I am wondering if this behaviour should change to replace those with U+FFFD REPLACEMENT CHARACTER (like .decode(..., 'replace')) or even with something similar/better (see https://docs.python.org/3/library/codecs.html#error-handlers )

The reason for this is that other applications can use the invalid character since it is just some data (like GDB for use as an argument to the program to be debugged), where in python this becomes an limitation, since the script (if specified) never runs.

The main motivation for me is that there is an command-not-found debian package that gets the wrongly-typed command as a command argument. If that then contains an invalid unicode character, it then just fails rather saying it couldn't find the/a similar command. If this doesn't get changed, it either then has to accept that this is a limitation, use an other way of passing the command or re-write it in not python.

# Requires bash 4.2+
# Specifying a script omits the first two lines
$ python3.6 $'\U7fffbeba'
Failed checking if argv[0] is an import path entry
ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]
Fatal Python error: no mem for sys.argv
ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]

Current thread 0x00007fd212eaf740 (most recent call first):
Aborted (core dumped)

$ python3.6 --version
Python 3.6.7

$ uname -a
Linux nopea 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

GDB backtrace just before throwing the error: (note that it's argc=2 since first argument is a script)
#0  find_maxchar_surrogates (begin=begin@entry=0xa847a0 L'\x7fffbeba' <repeats 12 times>, end=end@entry=0xa847d0 L"", maxchar=maxchar@entry=0x7fffffffde94, 
    num_surrogates=num_surrogates@entry=0x7fffffffde98) at ../Objects/unicodeobject.c:1626
#1  0x00000000004cee4b in PyUnicode_FromUnicode (u=u@entry=0xa847a0 L'\x7fffbeba' <repeats 12 times>, size=12) at ../Objects/unicodeobject.c:2017
#2  0x00000000004db856 in PyUnicode_FromWideChar (w=0xa847a0 L'\x7fffbeba' <repeats 12 times>, size=<optimized out>, size@entry=-1) at ../Objects/unicodeobject.c:2502
#3  0x000000000043253d in makeargvobject (argc=argc@entry=2, argv=argv@entry=0xa82268) at ../Python/sysmodule.c:2145
#4  0x0000000000433228 in PySys_SetArgvEx (argc=2, argv=0xa82268, updatepath=1) at ../Python/sysmodule.c:2264
#5  0x00000000004332c1 in PySys_SetArgv (argc=<optimized out>, argv=<optimized out>) at ../Python/sysmodule.c:2277
#6  0x000000000043a5bd in Py_Main (argc=argc@entry=3, argv=argv@entry=0xa82260) at ../Modules/main.c:733
#7  0x0000000000421149 in main (argc=3, argv=0x7fffffffe178) at ../Programs/python.c:69

Similar issues:
https://bugs.python.org/issue25631 "Segmentation fault with invalid Unicode command-line arguments in embedded Python" (actually 'fixed' since it now abort()s)
https://bugs.python.org/issue2128 "sys.argv is wrong for unicode strings"
History
Date User Action Args
2019-02-01 16:55:40Neuisetrecipients: + Neui
2019-02-01 16:55:38Neuisetmessageid: <1549040138.25.0.646545491344.issue35883@roundup.psfhosted.org>
2019-02-01 16:55:38Neuilinkissue35883 messages
2019-02-01 16:55:37Neuicreate