classification
Title: Change invalid unicode characters to replacement characters in argv
Type: crash Stage:
Components: Interpreter Core Versions: Python 3.10, Python 3.9, Python 3.8, Python 3.7, Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Neui, SilentGhost, eryksun, jberg, ncoghlan
Priority: normal Keywords:

Created on 2019-02-01 16:55 by Neui, last changed 2020-05-24 19:34 by jberg.

Messages (11)
msg334703 - (view) Author: (Neui) Date: 2019-02-01 16:55
When an invalid unicode character is given to argv (cli arguments), then python abort()s with an fatal error about an character not in range (ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]).

I am wondering if this behaviour should change to replace those with U+FFFD REPLACEMENT CHARACTER (like .decode(..., 'replace')) or even with something similar/better (see https://docs.python.org/3/library/codecs.html#error-handlers )

The reason for this is that other applications can use the invalid character since it is just some data (like GDB for use as an argument to the program to be debugged), where in python this becomes an limitation, since the script (if specified) never runs.

The main motivation for me is that there is an command-not-found debian package that gets the wrongly-typed command as a command argument. If that then contains an invalid unicode character, it then just fails rather saying it couldn't find the/a similar command. If this doesn't get changed, it either then has to accept that this is a limitation, use an other way of passing the command or re-write it in not python.

# Requires bash 4.2+
# Specifying a script omits the first two lines
$ python3.6 $'\U7fffbeba'
Failed checking if argv[0] is an import path entry
ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]
Fatal Python error: no mem for sys.argv
ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]

Current thread 0x00007fd212eaf740 (most recent call first):
Aborted (core dumped)

$ python3.6 --version
Python 3.6.7

$ uname -a
Linux nopea 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

GDB backtrace just before throwing the error: (note that it's argc=2 since first argument is a script)
#0  find_maxchar_surrogates (begin=begin@entry=0xa847a0 L'\x7fffbeba' <repeats 12 times>, end=end@entry=0xa847d0 L"", maxchar=maxchar@entry=0x7fffffffde94, 
    num_surrogates=num_surrogates@entry=0x7fffffffde98) at ../Objects/unicodeobject.c:1626
#1  0x00000000004cee4b in PyUnicode_FromUnicode (u=u@entry=0xa847a0 L'\x7fffbeba' <repeats 12 times>, size=12) at ../Objects/unicodeobject.c:2017
#2  0x00000000004db856 in PyUnicode_FromWideChar (w=0xa847a0 L'\x7fffbeba' <repeats 12 times>, size=<optimized out>, size@entry=-1) at ../Objects/unicodeobject.c:2502
#3  0x000000000043253d in makeargvobject (argc=argc@entry=2, argv=argv@entry=0xa82268) at ../Python/sysmodule.c:2145
#4  0x0000000000433228 in PySys_SetArgvEx (argc=2, argv=0xa82268, updatepath=1) at ../Python/sysmodule.c:2264
#5  0x00000000004332c1 in PySys_SetArgv (argc=<optimized out>, argv=<optimized out>) at ../Python/sysmodule.c:2277
#6  0x000000000043a5bd in Py_Main (argc=argc@entry=3, argv=argv@entry=0xa82260) at ../Modules/main.c:733
#7  0x0000000000421149 in main (argc=3, argv=0x7fffffffe178) at ../Programs/python.c:69

Similar issues:
https://bugs.python.org/issue25631 "Segmentation fault with invalid Unicode command-line arguments in embedded Python" (actually 'fixed' since it now abort()s)
https://bugs.python.org/issue2128 "sys.argv is wrong for unicode strings"
msg334705 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-02-01 17:10
I'm on  4.15.0-44-generic and I cannot reproduce the crash. I get "python3: can't open file '������': [Errno 2] No such file or directory"

Could you try this on a different machine / installation?
msg334707 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-02-01 17:22
Hm, this seems to be due to how the terminal emulator handles those special characters, actually. I can reproduce in another terminal.
msg334712 - (view) Author: (Neui) Date: 2019-02-01 19:33
I'd say that the terminal is not really relevant here, but rather the locale settings because it uses wide string functions. Prefixing it with LC_ALL=C produces the same output as you had on my Ubuntu machine. I also get that output when running it in Cygwin (and MSYS2), although it seems setting LC_ALL has no effect.
msg334732 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-02-01 23:49
In Unix, Python 3.6 decodes the char * command line arguments via mbstowcs. In Linux, I see the following misbehavior of mbstowcs when decoding an overlong UTF-8 sequence:

    >>> mbstowcs = ctypes.CDLL(None, use_errno=True).mbstowcs
    >>> arg = bytes(x + 128 for x in [1 + 124, 63, 63, 59, 58, 58])
    >>> mbstowcs(None, arg, 0)
    1
    >>> buf = (ctypes.c_int * 2)()
    >>> mbstowcs(buf, arg, 2)
    1
    >>> hex(buf[0])
    '0x7fffbeba'

This shouldn't be an issue in 3.7, at least not with the default UTF-8 mode configuration. With this mode, Py_DecodeLocale calls _Py_DecodeUTF8Ex using the surrogateescape error handler [1].

[1]: https://github.com/python/cpython/blob/v3.7.2/Python/fileutils.c#L456
msg369811 - (view) Author: Johannes Berg (jberg) Date: 2020-05-24 17:26
Pretty sure this is an issue still, I see it on current git master.

This seems to work around it?

https://p.sipsolutions.net/603927f1537226b3.txt

Basically, it seems that mbstowcs() and mbrtowc() on glibc with utf-8 just blindly decode even invalid UTF-8 to a too large wchar_t, rather than failing.
msg369812 - (view) Author: Johannes Berg (jberg) Date: 2020-05-24 17:37
A simple test case is something like

  ./python -c 'import sys; print(sys.argv[1].encode(sys.getfilesystemencoding(), "surrogateescape"))' "$(echo -ne '\xfa\xbd\x83\x96\x80')"


Which you'd probably expect to print

  b'\xfa\xbd\x83\x96\x80'

i.e. the same bytes that were passed in, but currently that fails.
msg369813 - (view) Author: Johannes Berg (jberg) Date: 2020-05-24 17:40
In fact that python one-liner works with just about everything else that you can throw at it, just not something that "looks like utf-8 but isn't".

And of course adding LC_CTYPE=ascii or something like that fixes it, as you'd expect. Then the "surrogateescape" works fine, since mbstowcs() won't try to decode it as utf-8.
msg369814 - (view) Author: Johannes Berg (jberg) Date: 2020-05-24 17:44
And wrt. _Py_DecodeUTF8Ex() - it doesn't seem to help. But that's probably because I'm not __ANDROID__, nor __APPLE__, and then regardless of current_locale being non-zero or not, we end up in decode_current_locale() where the impedance mismatch happens.

Setting PYTHONUTF8=1 in the environment works too, in that case we do get into _Py_DecodeUTF8Ex().
msg369819 - (view) Author: Johannes Berg (jberg) Date: 2020-05-24 19:28
Like I said above, it could be argued that the bug is in glibc, and then

https://p.sipsolutions.net/6a4e9fce82dbbfa0.txt

could be used as a simple LD_PRELOAD wrapper to work around this, just to illustrate the problem from that side.


Arguably, that makes glibc in violation of RFC 3629, since it says:


3.  UTF-8 definition

[...]

   In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
   accessible range) are encoded using sequences of 1 to 4 octets.

[...]

      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

[...]

   Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.

[...]

Here's a simple test program:

https://p.sipsolutions.net/ac091b4ea4b7f742.txt
msg369820 - (view) Author: Johannes Berg (jberg) Date: 2020-05-24 19:34
I've also filed https://sourceware.org/bugzilla/show_bug.cgi?id=26034 for glibc, because that's where really the issues seems to be?

But perhaps python should be forgiving of glibc errors here.
History
Date User Action Args
2020-05-24 19:34:28jbergsetmessages: + msg369820
2020-05-24 19:28:05jbergsetmessages: + msg369819
2020-05-24 17:44:03jbergsetmessages: + msg369814
2020-05-24 17:40:20jbergsetmessages: + msg369813
2020-05-24 17:37:55jbergsetmessages: + msg369812
versions: + Python 3.5, Python 3.8, Python 3.9, Python 3.10
2020-05-24 17:26:36jbergsetnosy: + jberg
messages: + msg369811
2019-02-01 23:49:22eryksunsetnosy: + eryksun
messages: + msg334732
2019-02-01 19:48:23SilentGhostsetnosy: + ncoghlan

versions: + Python 3.7
2019-02-01 19:33:53Neuisetmessages: + msg334712
2019-02-01 17:22:34SilentGhostsetmessages: + msg334707
2019-02-01 17:10:26SilentGhostsettype: behavior -> crash

messages: + msg334705
nosy: + SilentGhost
2019-02-01 16:55:38Neuicreate