New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python launcher does not support unicode characters #60422
Comments
If there are non ASCII character in the py.exe arguments, the execution will fail. The script file name or path may contain non ASCII characters. |
Confirming that this doesn't happen on 2.7 py -2 £.py succeeds python: failed to set __main__.__loader__ |
I can reproduce this on Linux (3.3+ only): $ name=$(printf "\xff")
$ echo "print('Hello, world')" >$name
$ ./python $name
python: failed to set __main__.__loader__ The issue is in PyRun_SimpleFileExFlags() function, which gets raw char * as the file name (the documentation says about the filesystem encoding (sys.getfilesystemencoding())), but then this name decoded from UTF-8 in set_main_loader(). |
Here is a patch which fixes filename decoding error in PyRun_SimpleFileExFlags(). |
The patch looks correct, but a test is missing. |
Where we have tests for Python launch? I can't find. runpy is not affected. |
Test added. |
New changeset 02d25098ad57 by Andrew Svetlov in branch '3.3': New changeset 1267d64c14b3 by Andrew Svetlov in branch 'default': |
Fixed. Thanks, Serhiy. |
I'm not especially familiar with this code, but just trying to understand - how come filename_obj isn't decref'd on normal exit? |
Vinay, it's processed in |
I see. Thanks. |
Some of the buildbots are failing with the new test: ====================================================================== Traceback (most recent call last):
File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_cmd_line_script.py", line 373, in test_non_utf8
importlib.machinery.SourceFileLoader)
File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_cmd_line_script.py", line 126, in _check_script
rc, out, err = assert_python_ok(*run_args)
File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/script_helper.py", line 54, in assert_python_ok
return _assert_python(True, *args, **env_vars)
File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/script_helper.py", line 46, in _assert_python
"stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore')))
AssertionError: Process return code is 1, stderr follows:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-20: ordinal not in range(128) Ran 23 tests in 8.959s |
Reopening bug. Quite a few buildbots are failing with this patch. Please, commit a new version or revert. |
I see. Sorry, my fault. |
I was not able to reproduce this error, I got other errors. The issue not in Python interpreter, the test is broken. Here is a patch that might solve the issue on some platforms (need to test on Windows). I guess failing of all command line tests when the path to temporary directory contains non-ascii. |
Serhiy, your original example from msg173373 still fails on $ name=$(printf "\xff")
$ echo "print('Hello, world')" >$name
$ ./python $name
UnicodeEncodeError: 'ascii' codec can't encode character '\xff' in position 0: ordinal not in range(128)
[41257 refs] |
Thank you for a report. I have not any ideas what happened (note that |
This is it: >>>
>>> sys.getdefaultencoding()
'utf-8'
>>> sys.getfilesystemencoding()
'ascii'
>>> locale.getpreferredencoding(True)
'US-ASCII'
>>> locale.getpreferredencoding(False)
'US-ASCII'
>>> $ locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL= |
Perhaps we have to skip tests if filesystem encoding doesn't support wide characters. |
On FreeBSD both Serhiy's original test case as well as the unit test work >>> sys.getdefaultencoding()
'utf-8'
>>> sys.getfilesystemencoding()
'iso8859-15'
>>> locale.getpreferredencoding(True)
'ISO8859-15'
>>> locale.getpreferredencoding(False)
'ISO8859-15' Naturally, if the locale is utf-8 the test works as well. |
The test is still failing on Mac OS X: ====================================================================== Traceback (most recent call last):
File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/test_cmd_line_script.py", line 380, in test_non_ascii
rc, stdout, stderr = assert_python_ok(script_name)
File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/script_helper.py", line 54, in assert_python_ok
return _assert_python(True, *args, **env_vars)
File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/script_helper.py", line 46, in _assert_python
"stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore')))
AssertionError: Process return code is 2, stderr follows:
/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/python.exe: can't open file './@test_63568_tmp.py': [Errno 2] No such file or directory -- If I remember correctly, the command line is always decoded from UTF-8/surrogateescape on Mac OS X. That's why we have the function _Py_DecodeUTF8_surrogateescape() (for bootstrap reasons). Such example should not work if the locale encoding is not UTF-8 on Mac OS X: arg = _Py_DecodeUTF8_surrogateescape(...);
filename = _Py_wchar2char(arg);
fp = fopen(filename, "r"); run_file() uses a different strategy: unicode = PyUnicode_FromWideChar(filename, wcslen(filename));
if (unicode != NULL) {
bytes = PyUnicode_EncodeFSDefault(unicode);
Py_DECREF(unicode);
}
if (bytes != NULL)
filename_str = PyBytes_AsString(bytes);
else {
PyErr_Clear();
filename_str = "<encoding error>";
} run_file() looks to be right. Py_Main() should use similar code. We should probably not encode and then decode the filename in each function, but this is another problem. |
The issue exists on Linux as I reported in msg173373. |
"It skipped on locales which does not support "£" (cp1006, cp1250, cp1251, cp737, cp852, cp855, cp866, cp874, cp949, euc_kr, gb2312, gbk, hz, iso2022_kr, iso8859_10, iso8859_11, iso8859_16, iso8859_2, iso8859_4, iso8859_5, iso8859_6, johab, koi8_r, koi8_u, mac_arabic, mac_farsi, ptcp154, tis_620). But the bug is actual on such locales." This issue is not specific to this test: I create the issue bpo-16414 to improve the situation. |
I don't understand your problem. Non-ASCII filenames were already supported with UTF-8 locale encoding. The new test checks that there is no regression with UTF-8 locale encoding. The test pass without the fix because it was not supported. |
Test the example in msg173373. It fails without fix. |
I created the issue bpo-16416 to fix the Mac OS X case. |
I think here should be used something like CommonTest.test_nonascii_abspath() in Lib/test/test_genericpath.py. |
If there's not another revision of the test patch in the wings, can 56df0d4f0011 also be applied to 3.3, as tests are still failing on at least koobs-freebsd and koobs-freebsd-clang buildbots. |
Oh, I didn't understand that, sorry. I created bpo-16444 to test also UTF-8 locale encoding with undecodable filenames (undecodable from UTF-8 in *strict* mode, not by os.fsencode() which uses surrogateescape). |
New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default': |
"If there's not another revision of the test patch in the wings, can 56df0d4f0011 also be applied to 3.3, as tests are still failing on at least koobs-freebsd and koobs-freebsd-clang buildbots." I just applied the patch of the issue bpo-16444. I will check 3.4 buildbots, and then backport to older Python versions (at least 3.3). |
Let me insist on what koobs just said. The Windows buildbots are still |
OpenIndiana 3.3 and 3.x buildbot broken too for a week. I suggest to revert this patch and use the custom buildbots to "debug it" before committing again. A week, and counting, it is about time. Feel free to hammer my OpenIndiana custom buildbots. |
New changeset 6017f09ead53 by Victor Stinner in branch '3.3': |
Back to green for all branches on FreeBSD, thank you Victor |
The "Mountain Lion" bots still fail. :) |
FreeBSD buildbots are green because I disabled the test on undecodable bytes! See issue bpo-16455 which proposes a fix for FreeBSD and OpenIndiana.
Yeah I know, see the issue bpo-16416 which has a patch. I plan to commit it to 3.4, wait for buildbots, and then backport to 3.3. -- Python 3.3 handles non-ASCII almost everywhere. Python 3.4 will probably handle non-ASCII everywhere. Handling *undecodable* bytes is really hard. We cannot use the same code for UNIX and Windows. If we store data as bytes, it solves the issue, but we don't support any Unicode character on Windows anymore. If we store data as Unicode, it's the opposite (ok for Windows, decode error on UNIX). |
This changeset should fix this issue on FreeBSD and Solaris: see the issue bpo-16455 for more information. |
Victor, are you done all work for the issue? |
The issue is now fixed on all platforms for Python 3.4. Please keep the |
I assign the issue to you than. Is it ok? |
New changeset 41658a4fb3cc by Victor Stinner in branch '3.2': New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3': |
Sure. I backported all changesets related to this issue to Python 3.2 and 3.3. So I can finally close this issue. |
Thanks! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: