Python launcher does not support unicode characters #60422

turncc · 2012-10-13T14:24:39Z

BPO	16218
Nosy	@jcea, @pitrou, @vstinner, @tjguk, @jkloth, @ezio-melotti, @asvetlov, @skrah, @serhiy-storchaka, @koobs
Files	pythonrun_filename_decoding.patch pythonrun_filename_decoding_2.patch pythonrun_filename_decoding_test.patch: Fix the test pythonrun_filename_decoding_test_2.patch test_non_ascii.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-01-03.01:08:38.180>
created_at = <Date 2012-10-13.14:24:38.762>
labels = ['interpreter-core', 'type-bug']
title = 'Python launcher does not support unicode characters'
updated_at = <Date 2016-06-22.19:18:10.177>
user = 'https://bugs.python.org/turncc'

bugs.python.org fields:

activity = <Date 2016-06-22.19:18:10.177>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2013-01-03.01:08:38.180>
closer = 'vstinner'
components = ['Interpreter Core']
creation = <Date 2012-10-13.14:24:38.762>
creator = 'turncc'
dependencies = []
files = ['27630', '27707', '27846', '27854', '27888']
hgrepos = []
issue_num = 16218
keywords = ['patch', '3.3regression']
message_count = 65.0
messages = ['172807', '173359', '173373', '173374', '173376', '173382', '173724', '174408', '174409', '174427', '174430', '174433', '174521', '174529', '174531', '174549', '174560', '174568', '174571', '174573', '174577', '174581', '174587', '174588', '174590', '174595', '174603', '174604', '174606', '174611', '174620', '174841', '174842', '174844', '174864', '174865', '174871', '174874', '174876', '174877', '174878', '174881', '174898', '174899', '174901', '174944', '175185', '175255', '175270', '175273', '175274', '175290', '175295', '175414', '175435', '175436', '175437', '176872', '178118', '178171', '178173', '178234', '178869', '178871', '179564']
nosy_count = 13.0
nosy_names = ['jcea', 'pitrou', 'vstinner', 'tim.golden', 'jkloth', 'ezio.melotti', 'asvetlov', 'skrah', 'gklein', 'python-dev', 'serhiy.storchaka', 'koobs', 'turncc']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue16218'
versions = ['Python 3.3', 'Python 3.4']

turncc · 2012-10-13T14:24:38Z

If there are non ASCII character in the py.exe arguments, the execution will fail. The script file name or path may contain non ASCII characters.

tjguk · 2012-10-19T19:48:28Z

Confirming that this doesn't happen on 2.7

py -2 £.py succeeds
py -3 £.py gives:

python: failed to set __main__.__loader__

serhiy-storchaka · 2012-10-20T07:25:19Z

I can reproduce this on Linux (3.3+ only):

$ name=$(printf "\xff")
$ echo "print('Hello, world')" >$name
$ ./python $name
python: failed to set __main__.__loader__

The issue is in PyRun_SimpleFileExFlags() function, which gets raw char * as the file name (the documentation says about the filesystem encoding (sys.getfilesystemencoding())), but then this name decoded from UTF-8 in set_main_loader().

serhiy-storchaka · 2012-10-20T07:55:03Z

Here is a patch which fixes filename decoding error in PyRun_SimpleFileExFlags().

vstinner · 2012-10-20T09:16:09Z

The patch looks correct, but a test is missing.

serhiy-storchaka · 2012-10-20T10:31:12Z

Where we have tests for Python launch? I can't find. runpy is not affected.

serhiy-storchaka · 2012-10-24T23:54:23Z

Test added.

python-dev · 2012-11-01T12:52:16Z

New changeset 02d25098ad57 by Andrew Svetlov in branch '3.3':
Issue bpo-16218: Support non ascii characters in python launcher.
http://hg.python.org/cpython/rev/02d25098ad57

New changeset 1267d64c14b3 by Andrew Svetlov in branch 'default':
Merge issue bpo-16218: Support non ascii characters in python launcher.
http://hg.python.org/cpython/rev/1267d64c14b3

asvetlov · 2012-11-01T12:52:51Z

Fixed. Thanks, Serhiy.

vsajip · 2012-11-01T16:23:15Z

I'm not especially familiar with this code, but just trying to understand - how come filename_obj isn't decref'd on normal exit?

asvetlov · 2012-11-01T16:37:01Z

Vinay, it's processed in
PyObject_CallFunction(loader_type, "sN", "__main__", filename_obj)
Please note "sN" format istead "sO".
"N" means PyObject* is passed but unlike "sO" that object is not increfed.

vsajip · 2012-11-01T17:21:44Z

Please note "sN" format istead "sO".

I see. Thanks.

skrah · 2012-11-02T14:34:45Z

Some of the buildbots are failing with the new test:

======================================================================
FAIL: test_non_utf8 (test.test_cmd_line_script.CmdLineTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_cmd_line_script.py", line 373, in test_non_utf8
    importlib.machinery.SourceFileLoader)
  File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_cmd_line_script.py", line 126, in _check_script
    rc, out, err = assert_python_ok(*run_args)
  File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/script_helper.py", line 54, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
  File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/script_helper.py", line 46, in _assert_python
    "stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore')))
AssertionError: Process return code is 1, stderr follows:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-20: ordinal not in range(128)

Ran 23 tests in 8.959s

jcea · 2012-11-02T14:51:27Z

Reopening bug.

Quite a few buildbots are failing with this patch. Please, commit a new version or revert.

asvetlov · 2012-11-02T14:57:41Z

I see. Sorry, my fault.
Give me weekend to figure out why it fails.
Thanks.

serhiy-storchaka · 2012-11-02T18:12:07Z

I was not able to reproduce this error, I got other errors. The issue not in Python interpreter, the test is broken. Here is a patch that might solve the issue on some platforms (need to test on Windows).

I guess failing of all command line tests when the path to temporary directory contains non-ascii.

skrah · 2012-11-02T19:33:26Z

Serhiy, your original example from msg173373 still fails on
FreeBSD:

$ name=$(printf "\xff")
$ echo "print('Hello, world')" >$name
$ ./python $name
UnicodeEncodeError: 'ascii' codec can't encode character '\xff' in position 0: ordinal not in range(128)
[41257 refs]

serhiy-storchaka · 2012-11-02T20:30:00Z

Serhiy, your original example from msg173373 still fails on
FreeBSD:

Thank you for a report. I have not any ideas what happened (note that
error on encoding, not decoding). Can you please show me the results of
sys.getdefaultencoding(), sys.getfilesystemencoding(),
locale.getpreferredencoding(True), locale.getpreferredencoding(False),
the output of locale command?

skrah · 2012-11-02T20:40:04Z

This is it:

>>> 
>>> sys.getdefaultencoding()
'utf-8'
>>> sys.getfilesystemencoding()
'ascii'
>>> locale.getpreferredencoding(True)
'US-ASCII'
>>> locale.getpreferredencoding(False)
'US-ASCII'
>>>

$ locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

asvetlov · 2012-11-02T20:51:46Z

Perhaps we have to skip tests if filesystem encoding doesn't support wide characters.
Not sure about the way: should we skip if sys.getfilesystemencoding() is not utf8 or better to try encode path and skip if it fails?
I think the later is better.

skrah · 2012-11-02T21:03:59Z

On FreeBSD both Serhiy's original test case as well as the unit test work
if the locale is ISO8859-15:

>>> sys.getdefaultencoding()
'utf-8'
>>> sys.getfilesystemencoding()
'iso8859-15'
>>> locale.getpreferredencoding(True)
'ISO8859-15'
>>> locale.getpreferredencoding(False)
'ISO8859-15'

Naturally, if the locale is utf-8 the test works as well.

vstinner · 2012-11-05T08:16:55Z

The test is still failing on Mac OS X:

======================================================================
FAIL: test_non_ascii (test.test_cmd_line_script.CmdLineTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/test_cmd_line_script.py", line 380, in test_non_ascii
    rc, stdout, stderr = assert_python_ok(script_name)
  File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/script_helper.py", line 54, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
  File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/script_helper.py", line 46, in _assert_python
    "stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore')))
AssertionError: Process return code is 2, stderr follows:
/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/python.exe: can't open file './@test_63568_tmp.py': [Errno 2] No such file or directory

http://buildbot.python.org/all/builders/AMD64%20Mountain%20Lion%20%5BSB%5D%203.x/builds/410/steps/test/logs/stdio

--

If I remember correctly, the command line is always decoded from UTF-8/surrogateescape on Mac OS X. That's why we have the function _Py_DecodeUTF8_surrogateescape() (for bootstrap reasons).

Such example should not work if the locale encoding is not UTF-8 on Mac OS X:
---

arg = _Py_DecodeUTF8_surrogateescape(...);
filename = _Py_wchar2char(arg);
fp = fopen(filename, "r");

run_file() uses a different strategy:

        unicode = PyUnicode_FromWideChar(filename, wcslen(filename));
        if (unicode != NULL) {
            bytes = PyUnicode_EncodeFSDefault(unicode);
            Py_DECREF(unicode);
        }
        if (bytes != NULL)
            filename_str = PyBytes_AsString(bytes);
        else {
            PyErr_Clear();
            filename_str = "<encoding error>";
        }

run_file() looks to be right. Py_Main() should use similar code.

We should probably not encode and then decode the filename in each function, but this is another problem.

serhiy-storchaka · 2012-11-05T08:50:38Z

The issue is about Windows and UTF-8 is never used as filesystem encoding
on Windows.

The issue exists on Linux as I reported in msg173373.

vstinner · 2012-11-05T12:12:49Z

"It skipped on locales which does not support "£" (cp1006, cp1250, cp1251, cp737, cp852, cp855, cp866, cp874, cp949, euc_kr, gb2312, gbk, hz, iso2022_kr, iso8859_10, iso8859_11, iso8859_16, iso8859_2, iso8859_4, iso8859_5, iso8859_6, johab, koi8_r, koi8_u, mac_arabic, mac_farsi, ptcp154, tis_620). But the bug is actual on such locales."

This issue is not specific to this test: I create the issue bpo-16414 to improve the situation.

vstinner · 2012-11-05T12:14:42Z

>> It tests nothing on utf-8 locale (test passed even when bug is not fixed).
> The issue is about Windows and UTF-8 is never used as filesystem encoding on Windows.
The issue exists on Linux as I reported in msg173373.

I don't understand your problem. Non-ASCII filenames were already supported with UTF-8 locale encoding. The new test checks that there is no regression with UTF-8 locale encoding. The test pass without the fix because it was not supported.

serhiy-storchaka · 2012-11-05T12:34:25Z

Non-ASCII filenames were already supported with UTF-8 locale encoding.

Test the example in msg173373. It fails without fix.

vstinner · 2012-11-05T22:20:11Z

I created the issue bpo-16416 to fix the Mac OS X case.

serhiy-storchaka · 2012-11-08T19:14:12Z

I think here should be used something like CommonTest.test_nonascii_abspath() in Lib/test/test_genericpath.py.

koobs · 2012-11-10T01:16:23Z

If there's not another revision of the test patch in the wings, can 56df0d4f0011 also be applied to 3.3, as tests are still failing on at least koobs-freebsd and koobs-freebsd-clang buildbots.

vstinner · 2012-11-10T10:36:16Z

> Non-ASCII filenames were already supported with UTF-8 locale encoding.

Test the example in msg173373. It fails without fix.

Oh, I didn't understand that, sorry. I created bpo-16444 to test also UTF-8 locale encoding with undecodable filenames (undecodable from UTF-8 in *strict* mode, not by os.fsencode() which uses surrogateescape).

python-dev · 2012-11-10T11:07:36Z

New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default':
Issue bpo-16444, bpo-16218: Use TESTFN_UNDECODABLE on UNIX
http://hg.python.org/cpython/rev/6b8a8bc6ba9c

vstinner · 2012-11-10T11:08:21Z

"If there's not another revision of the test patch in the wings, can 56df0d4f0011 also be applied to 3.3, as tests are still failing on at least koobs-freebsd and koobs-freebsd-clang buildbots."

I just applied the patch of the issue bpo-16444. I will check 3.4 buildbots, and then backport to older Python versions (at least 3.3).

pitrou · 2012-11-10T17:20:59Z

If there's not another revision of the test patch in the wings, can
56df0d4f0011 also be applied to 3.3, as tests are still failing on at
least koobs-freebsd and koobs-freebsd-clang buildbots.

Let me insist on what koobs just said. The Windows buildbots are still
broken on 3.3, so this either needs fixing or reverting.

jcea · 2012-11-10T20:26:41Z

OpenIndiana 3.3 and 3.x buildbot broken too for a week.

I suggest to revert this patch and use the custom buildbots to "debug it" before committing again. A week, and counting, it is about time.

Feel free to hammer my OpenIndiana custom buildbots.

python-dev · 2012-11-12T00:24:16Z

New changeset 6017f09ead53 by Victor Stinner in branch '3.3':
Issue bpo-16218, bpo-16444: Backport improvment on tests for non-ASCII characters
http://hg.python.org/cpython/rev/6017f09ead53

koobs · 2012-11-12T10:58:37Z

Back to green for all branches on FreeBSD, thank you Victor

skrah · 2012-11-12T11:07:23Z

The "Mountain Lion" bots still fail. :)

vstinner · 2012-11-12T11:14:46Z

Back to green for all branches on FreeBSD, thank you Victor

FreeBSD buildbots are green because I disabled the test on undecodable bytes! See issue bpo-16455 which proposes a fix for FreeBSD and OpenIndiana.

The "Mountain Lion" bots still fail. :)

Yeah I know, see the issue bpo-16416 which has a patch. I plan to commit it to 3.4, wait for buildbots, and then backport to 3.3.

--

Python 3.3 handles non-ASCII almost everywhere. Python 3.4 will probably handle non-ASCII everywhere.

Handling *undecodable* bytes is really hard. We cannot use the same code for UNIX and Windows. If we store data as bytes, it solves the issue, but we don't support any Unicode character on Windows anymore. If we store data as Unicode, it's the opposite (ok for Windows, decode error on UNIX).

vstinner · 2012-12-04T02:32:43Z

New changeset c25635b137cc by Victor Stinner in branch 'default':
Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/c25635b137cc

This changeset should fix this issue on FreeBSD and Solaris: see the issue bpo-16455 for more information.

asvetlov · 2012-12-25T11:35:19Z

Victor, are you done all work for the issue?
Can it be closed?

vstinner · 2012-12-25T23:04:19Z

The issue is now fixed on all platforms for Python 3.4. Please keep the
issue open until all changes are backported to Python 3.3 or even Python
3.2.

asvetlov · 2012-12-25T23:20:59Z

I assign the issue to you than. Is it ok?

vstinner · 2012-12-26T16:24:00Z

Status of the different issues:

bpo-16416, Mac OS X: 3.2, 3.3, 3.4
bpo-16455, FreeBSD and Solaris: 3.4
bpo-16218, set_main_loader: 3.3, 3.4
bpo-16218, test_cmd_line_script: 3.4 (3.3 has an old copy of the test)
bpo-16414, add support.TESTFN_NONASCII: 3.4
bpo-16444, use support.TESTFN_NONASCII: 3.4

python-dev · 2013-01-03T00:59:41Z

New changeset 41658a4fb3cc by Victor Stinner in branch '3.2':
Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII, TESTFN_UNDECODABLE,
http://hg.python.org/cpython/rev/41658a4fb3cc

New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3':
(Merge 3.2) Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII,
http://hg.python.org/cpython/rev/4d40c1ce8566

vstinner · 2013-01-03T01:08:38Z

I assign the issue to you than. Is it ok?

Sure.

I backported all changesets related to this issue to Python 3.2 and 3.3. So I can finally close this issue.

asvetlov · 2013-01-10T16:29:28Z

Thanks!

turncc mannequin added OS-windows type-crash A hard crash of the interpreter, possibly with a core dump labels Oct 13, 2012

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed OS-windows labels Oct 20, 2012

ezio-melotti added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Oct 20, 2012

asvetlov closed this as completed Nov 1, 2012

jcea reopened this Nov 2, 2012

asvetlov self-assigned this Nov 2, 2012

asvetlov assigned vstinner and unassigned asvetlov Dec 25, 2012

vstinner closed this as completed Jan 3, 2013

vstinner removed their assignment Jan 3, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Python launcher does not support unicode characters #60422

Python launcher does not support unicode characters #60422

Comments

turncc mannequin commented Oct 13, 2012

turncc mannequin commented Oct 13, 2012

tjguk commented Oct 19, 2012

serhiy-storchaka commented Oct 20, 2012

serhiy-storchaka commented Oct 20, 2012

vstinner commented Oct 20, 2012

serhiy-storchaka commented Oct 20, 2012

serhiy-storchaka commented Oct 24, 2012

python-dev mannequin commented Nov 1, 2012

asvetlov commented Nov 1, 2012

vsajip commented Nov 1, 2012

asvetlov commented Nov 1, 2012

vsajip commented Nov 1, 2012

skrah mannequin commented Nov 2, 2012

jcea commented Nov 2, 2012

asvetlov commented Nov 2, 2012

serhiy-storchaka commented Nov 2, 2012

skrah mannequin commented Nov 2, 2012

serhiy-storchaka commented Nov 2, 2012

skrah mannequin commented Nov 2, 2012

asvetlov commented Nov 2, 2012

skrah mannequin commented Nov 2, 2012

vstinner commented Nov 5, 2012

serhiy-storchaka commented Nov 5, 2012

vstinner commented Nov 5, 2012

vstinner commented Nov 5, 2012

serhiy-storchaka commented Nov 5, 2012

vstinner commented Nov 5, 2012

serhiy-storchaka commented Nov 8, 2012

koobs commented Nov 10, 2012

vstinner commented Nov 10, 2012

python-dev mannequin commented Nov 10, 2012

vstinner commented Nov 10, 2012

pitrou commented Nov 10, 2012

jcea commented Nov 10, 2012

python-dev mannequin commented Nov 12, 2012

koobs commented Nov 12, 2012

skrah mannequin commented Nov 12, 2012

vstinner commented Nov 12, 2012

vstinner commented Dec 4, 2012

asvetlov commented Dec 25, 2012

vstinner commented Dec 25, 2012

asvetlov commented Dec 25, 2012

vstinner commented Dec 26, 2012

python-dev mannequin commented Jan 3, 2013

vstinner commented Jan 3, 2013

asvetlov commented Jan 10, 2013