Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python launcher does not support unicode characters #60422

Closed
turncc mannequin opened this issue Oct 13, 2012 · 65 comments
Closed

Python launcher does not support unicode characters #60422

turncc mannequin opened this issue Oct 13, 2012 · 65 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@turncc
Copy link
Mannequin

turncc mannequin commented Oct 13, 2012

BPO 16218
Nosy @jcea, @pitrou, @vstinner, @tjguk, @jkloth, @ezio-melotti, @asvetlov, @skrah, @serhiy-storchaka, @koobs
Files
  • pythonrun_filename_decoding.patch
  • pythonrun_filename_decoding_2.patch
  • pythonrun_filename_decoding_test.patch: Fix the test
  • pythonrun_filename_decoding_test_2.patch
  • test_non_ascii.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-01-03.01:08:38.180>
    created_at = <Date 2012-10-13.14:24:38.762>
    labels = ['interpreter-core', 'type-bug']
    title = 'Python launcher does not support unicode characters'
    updated_at = <Date 2016-06-22.19:18:10.177>
    user = 'https://bugs.python.org/turncc'

    bugs.python.org fields:

    activity = <Date 2016-06-22.19:18:10.177>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-01-03.01:08:38.180>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2012-10-13.14:24:38.762>
    creator = 'turncc'
    dependencies = []
    files = ['27630', '27707', '27846', '27854', '27888']
    hgrepos = []
    issue_num = 16218
    keywords = ['patch', '3.3regression']
    message_count = 65.0
    messages = ['172807', '173359', '173373', '173374', '173376', '173382', '173724', '174408', '174409', '174427', '174430', '174433', '174521', '174529', '174531', '174549', '174560', '174568', '174571', '174573', '174577', '174581', '174587', '174588', '174590', '174595', '174603', '174604', '174606', '174611', '174620', '174841', '174842', '174844', '174864', '174865', '174871', '174874', '174876', '174877', '174878', '174881', '174898', '174899', '174901', '174944', '175185', '175255', '175270', '175273', '175274', '175290', '175295', '175414', '175435', '175436', '175437', '176872', '178118', '178171', '178173', '178234', '178869', '178871', '179564']
    nosy_count = 13.0
    nosy_names = ['jcea', 'pitrou', 'vstinner', 'tim.golden', 'jkloth', 'ezio.melotti', 'asvetlov', 'skrah', 'gklein', 'python-dev', 'serhiy.storchaka', 'koobs', 'turncc']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue16218'
    versions = ['Python 3.3', 'Python 3.4']

    @turncc
    Copy link
    Mannequin Author

    turncc mannequin commented Oct 13, 2012

    If there are non ASCII character in the py.exe arguments, the execution will fail. The script file name or path may contain non ASCII characters.

    @turncc turncc mannequin added OS-windows type-crash A hard crash of the interpreter, possibly with a core dump labels Oct 13, 2012
    @tjguk
    Copy link
    Member

    tjguk commented Oct 19, 2012

    Confirming that this doesn't happen on 2.7

    py -2 £.py succeeds
    py -3 £.py gives:

    python: failed to set __main__.__loader__

    @serhiy-storchaka
    Copy link
    Member

    I can reproduce this on Linux (3.3+ only):

    $ name=$(printf "\xff")
    $ echo "print('Hello, world')" >$name
    $ ./python $name
    python: failed to set __main__.__loader__

    The issue is in PyRun_SimpleFileExFlags() function, which gets raw char * as the file name (the documentation says about the filesystem encoding (sys.getfilesystemencoding())), but then this name decoded from UTF-8 in set_main_loader().

    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed OS-windows labels Oct 20, 2012
    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which fixes filename decoding error in PyRun_SimpleFileExFlags().

    @vstinner
    Copy link
    Member

    The patch looks correct, but a test is missing.

    @ezio-melotti ezio-melotti added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Oct 20, 2012
    @serhiy-storchaka
    Copy link
    Member

    Where we have tests for Python launch? I can't find. runpy is not affected.

    @serhiy-storchaka
    Copy link
    Member

    Test added.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 1, 2012

    New changeset 02d25098ad57 by Andrew Svetlov in branch '3.3':
    Issue bpo-16218: Support non ascii characters in python launcher.
    http://hg.python.org/cpython/rev/02d25098ad57

    New changeset 1267d64c14b3 by Andrew Svetlov in branch 'default':
    Merge issue bpo-16218: Support non ascii characters in python launcher.
    http://hg.python.org/cpython/rev/1267d64c14b3

    @asvetlov
    Copy link
    Contributor

    asvetlov commented Nov 1, 2012

    Fixed. Thanks, Serhiy.

    @asvetlov asvetlov closed this as completed Nov 1, 2012
    @vsajip
    Copy link
    Member

    vsajip commented Nov 1, 2012

    I'm not especially familiar with this code, but just trying to understand - how come filename_obj isn't decref'd on normal exit?

    @asvetlov
    Copy link
    Contributor

    asvetlov commented Nov 1, 2012

    Vinay, it's processed in
    PyObject_CallFunction(loader_type, "sN", "__main__", filename_obj)
    Please note "sN" format istead "sO".
    "N" means PyObject* is passed but unlike "sO" that object is not increfed.

    @vsajip
    Copy link
    Member

    vsajip commented Nov 1, 2012

    Please note "sN" format istead "sO".

    I see. Thanks.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Nov 2, 2012

    Some of the buildbots are failing with the new test:

    ======================================================================
    FAIL: test_non_utf8 (test.test_cmd_line_script.CmdLineTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_cmd_line_script.py", line 373, in test_non_utf8
        importlib.machinery.SourceFileLoader)
      File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_cmd_line_script.py", line 126, in _check_script
        rc, out, err = assert_python_ok(*run_args)
      File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/script_helper.py", line 54, in assert_python_ok
        return _assert_python(True, *args, **env_vars)
      File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/script_helper.py", line 46, in _assert_python
        "stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore')))
    AssertionError: Process return code is 1, stderr follows:
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-20: ordinal not in range(128)

    Ran 23 tests in 8.959s

    @jcea
    Copy link
    Member

    jcea commented Nov 2, 2012

    Reopening bug.

    Quite a few buildbots are failing with this patch. Please, commit a new version or revert.

    @jcea jcea reopened this Nov 2, 2012
    @asvetlov
    Copy link
    Contributor

    asvetlov commented Nov 2, 2012

    I see. Sorry, my fault.
    Give me weekend to figure out why it fails.
    Thanks.

    @asvetlov asvetlov self-assigned this Nov 2, 2012
    @serhiy-storchaka
    Copy link
    Member

    I was not able to reproduce this error, I got other errors. The issue not in Python interpreter, the test is broken. Here is a patch that might solve the issue on some platforms (need to test on Windows).

    I guess failing of all command line tests when the path to temporary directory contains non-ascii.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Nov 2, 2012

    Serhiy, your original example from msg173373 still fails on
    FreeBSD:

    $ name=$(printf "\xff")
    $ echo "print('Hello, world')" >$name
    $ ./python $name
    UnicodeEncodeError: 'ascii' codec can't encode character '\xff' in position 0: ordinal not in range(128)
    [41257 refs]

    @serhiy-storchaka
    Copy link
    Member

    Serhiy, your original example from msg173373 still fails on
    FreeBSD:

    Thank you for a report. I have not any ideas what happened (note that
    error on encoding, not decoding). Can you please show me the results of
    sys.getdefaultencoding(), sys.getfilesystemencoding(),
    locale.getpreferredencoding(True), locale.getpreferredencoding(False),
    the output of locale command?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Nov 2, 2012

    This is it:

    >>> 
    >>> sys.getdefaultencoding()
    'utf-8'
    >>> sys.getfilesystemencoding()
    'ascii'
    >>> locale.getpreferredencoding(True)
    'US-ASCII'
    >>> locale.getpreferredencoding(False)
    'US-ASCII'
    >>> 
    $ locale
    LANG=
    LC_CTYPE="C"
    LC_COLLATE="C"
    LC_TIME="C"
    LC_NUMERIC="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_ALL=

    @asvetlov
    Copy link
    Contributor

    asvetlov commented Nov 2, 2012

    Perhaps we have to skip tests if filesystem encoding doesn't support wide characters.
    Not sure about the way: should we skip if sys.getfilesystemencoding() is not utf8 or better to try encode path and skip if it fails?
    I think the later is better.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Nov 2, 2012

    On FreeBSD both Serhiy's original test case as well as the unit test work
    if the locale is ISO8859-15:

    >>> sys.getdefaultencoding()
    'utf-8'
    >>> sys.getfilesystemencoding()
    'iso8859-15'
    >>> locale.getpreferredencoding(True)
    'ISO8859-15'
    >>> locale.getpreferredencoding(False)
    'ISO8859-15'

    Naturally, if the locale is utf-8 the test works as well.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2012

    The test is still failing on Mac OS X:

    ======================================================================
    FAIL: test_non_ascii (test.test_cmd_line_script.CmdLineTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/test_cmd_line_script.py", line 380, in test_non_ascii
        rc, stdout, stderr = assert_python_ok(script_name)
      File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/script_helper.py", line 54, in assert_python_ok
        return _assert_python(True, *args, **env_vars)
      File "/Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/Lib/test/script_helper.py", line 46, in _assert_python
        "stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore')))
    AssertionError: Process return code is 2, stderr follows:
    /Volumes/bay2/buildslave/cpython/3.x.snakebite-mountainlion-amd64/build/python.exe: can't open file './@test_63568_tmp.py': [Errno 2] No such file or directory

    http://buildbot.python.org/all/builders/AMD64%20Mountain%20Lion%20%5BSB%5D%203.x/builds/410/steps/test/logs/stdio

    --

    If I remember correctly, the command line is always decoded from UTF-8/surrogateescape on Mac OS X. That's why we have the function _Py_DecodeUTF8_surrogateescape() (for bootstrap reasons).

    Such example should not work if the locale encoding is not UTF-8 on Mac OS X:
    ---

    arg = _Py_DecodeUTF8_surrogateescape(...);
    filename = _Py_wchar2char(arg);
    fp = fopen(filename, "r");

    run_file() uses a different strategy:

            unicode = PyUnicode_FromWideChar(filename, wcslen(filename));
            if (unicode != NULL) {
                bytes = PyUnicode_EncodeFSDefault(unicode);
                Py_DECREF(unicode);
            }
            if (bytes != NULL)
                filename_str = PyBytes_AsString(bytes);
            else {
                PyErr_Clear();
                filename_str = "<encoding error>";
            }

    run_file() looks to be right. Py_Main() should use similar code.

    We should probably not encode and then decode the filename in each function, but this is another problem.

    @serhiy-storchaka
    Copy link
    Member

    The issue is about Windows and UTF-8 is never used as filesystem encoding
    on Windows.

    The issue exists on Linux as I reported in msg173373.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2012

    "It skipped on locales which does not support "£" (cp1006, cp1250, cp1251, cp737, cp852, cp855, cp866, cp874, cp949, euc_kr, gb2312, gbk, hz, iso2022_kr, iso8859_10, iso8859_11, iso8859_16, iso8859_2, iso8859_4, iso8859_5, iso8859_6, johab, koi8_r, koi8_u, mac_arabic, mac_farsi, ptcp154, tis_620). But the bug is actual on such locales."

    This issue is not specific to this test: I create the issue bpo-16414 to improve the situation.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2012

    >> It tests nothing on utf-8 locale (test passed even when bug is not fixed).
    > The issue is about Windows and UTF-8 is never used as filesystem encoding on Windows.
    The issue exists on Linux as I reported in msg173373.

    I don't understand your problem. Non-ASCII filenames were already supported with UTF-8 locale encoding. The new test checks that there is no regression with UTF-8 locale encoding. The test pass without the fix because it was not supported.

    @serhiy-storchaka
    Copy link
    Member

    Non-ASCII filenames were already supported with UTF-8 locale encoding.

    Test the example in msg173373. It fails without fix.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2012

    I created the issue bpo-16416 to fix the Mac OS X case.

    @serhiy-storchaka
    Copy link
    Member

    I think here should be used something like CommonTest.test_nonascii_abspath() in Lib/test/test_genericpath.py.

    @koobs
    Copy link

    koobs commented Nov 10, 2012

    If there's not another revision of the test patch in the wings, can 56df0d4f0011 also be applied to 3.3, as tests are still failing on at least koobs-freebsd and koobs-freebsd-clang buildbots.

    @vstinner
    Copy link
    Member

    > Non-ASCII filenames were already supported with UTF-8 locale encoding.

    Test the example in msg173373. It fails without fix.

    Oh, I didn't understand that, sorry. I created bpo-16444 to test also UTF-8 locale encoding with undecodable filenames (undecodable from UTF-8 in *strict* mode, not by os.fsencode() which uses surrogateescape).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 10, 2012

    New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default':
    Issue bpo-16444, bpo-16218: Use TESTFN_UNDECODABLE on UNIX
    http://hg.python.org/cpython/rev/6b8a8bc6ba9c

    @vstinner
    Copy link
    Member

    "If there's not another revision of the test patch in the wings, can 56df0d4f0011 also be applied to 3.3, as tests are still failing on at least koobs-freebsd and koobs-freebsd-clang buildbots."

    I just applied the patch of the issue bpo-16444. I will check 3.4 buildbots, and then backport to older Python versions (at least 3.3).

    @pitrou
    Copy link
    Member

    pitrou commented Nov 10, 2012

    If there's not another revision of the test patch in the wings, can
    56df0d4f0011 also be applied to 3.3, as tests are still failing on at
    least koobs-freebsd and koobs-freebsd-clang buildbots.

    Let me insist on what koobs just said. The Windows buildbots are still
    broken on 3.3, so this either needs fixing or reverting.

    @jcea
    Copy link
    Member

    jcea commented Nov 10, 2012

    OpenIndiana 3.3 and 3.x buildbot broken too for a week.

    I suggest to revert this patch and use the custom buildbots to "debug it" before committing again. A week, and counting, it is about time.

    Feel free to hammer my OpenIndiana custom buildbots.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 12, 2012

    New changeset 6017f09ead53 by Victor Stinner in branch '3.3':
    Issue bpo-16218, bpo-16444: Backport improvment on tests for non-ASCII characters
    http://hg.python.org/cpython/rev/6017f09ead53

    @koobs
    Copy link

    koobs commented Nov 12, 2012

    Back to green for all branches on FreeBSD, thank you Victor

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Nov 12, 2012

    The "Mountain Lion" bots still fail. :)

    @vstinner
    Copy link
    Member

    Back to green for all branches on FreeBSD, thank you Victor

    FreeBSD buildbots are green because I disabled the test on undecodable bytes! See issue bpo-16455 which proposes a fix for FreeBSD and OpenIndiana.

    The "Mountain Lion" bots still fail. :)

    Yeah I know, see the issue bpo-16416 which has a patch. I plan to commit it to 3.4, wait for buildbots, and then backport to 3.3.

    --

    Python 3.3 handles non-ASCII almost everywhere. Python 3.4 will probably handle non-ASCII everywhere.

    Handling *undecodable* bytes is really hard. We cannot use the same code for UNIX and Windows. If we store data as bytes, it solves the issue, but we don't support any Unicode character on Windows anymore. If we store data as Unicode, it's the opposite (ok for Windows, decode error on UNIX).

    @vstinner
    Copy link
    Member

    vstinner commented Dec 4, 2012

    New changeset c25635b137cc by Victor Stinner in branch 'default':
    Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
    http://hg.python.org/cpython/rev/c25635b137cc

    This changeset should fix this issue on FreeBSD and Solaris: see the issue bpo-16455 for more information.

    @asvetlov
    Copy link
    Contributor

    Victor, are you done all work for the issue?
    Can it be closed?

    @vstinner
    Copy link
    Member

    The issue is now fixed on all platforms for Python 3.4. Please keep the
    issue open until all changes are backported to Python 3.3 or even Python
    3.2.

    @asvetlov
    Copy link
    Contributor

    I assign the issue to you than. Is it ok?

    @asvetlov asvetlov assigned vstinner and unassigned asvetlov Dec 25, 2012
    @vstinner
    Copy link
    Member

    Status of the different issues:

    bpo-16416, Mac OS X: 3.2, 3.3, 3.4
    bpo-16455, FreeBSD and Solaris: 3.4
    bpo-16218, set_main_loader: 3.3, 3.4
    bpo-16218, test_cmd_line_script: 3.4 (3.3 has an old copy of the test)
    bpo-16414, add support.TESTFN_NONASCII: 3.4
    bpo-16444, use support.TESTFN_NONASCII: 3.4

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 3, 2013

    New changeset 41658a4fb3cc by Victor Stinner in branch '3.2':
    Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII, TESTFN_UNDECODABLE,
    http://hg.python.org/cpython/rev/41658a4fb3cc

    New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3':
    (Merge 3.2) Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII,
    http://hg.python.org/cpython/rev/4d40c1ce8566

    @vstinner
    Copy link
    Member

    vstinner commented Jan 3, 2013

    I assign the issue to you than. Is it ok?

    Sure.

    I backported all changesets related to this issue to Python 3.2 and 3.3. So I can finally close this issue.

    @vstinner vstinner closed this as completed Jan 3, 2013
    @vstinner vstinner removed their assignment Jan 3, 2013
    @asvetlov
    Copy link
    Contributor

    Thanks!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants