Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C #60659

Closed
vstinner opened this issue Nov 11, 2012 · 11 comments

Comments

@vstinner
Copy link
Member

BPO 16455
Nosy @jcea, @vstinner, @ezio-melotti
Files
  • workaround_codeset.patch
  • force_ascii.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-01-03.01:07:03.240>
    created_at = <Date 2012-11-11.22:14:15.060>
    labels = ['expert-unicode']
    title = 'Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C'
    updated_at = <Date 2013-01-03.01:07:03.239>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2013-01-03.01:07:03.239>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-01-03.01:07:03.240>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2012-11-11.22:14:15.060>
    creator = 'vstinner'
    dependencies = []
    files = ['27965', '27970']
    hgrepos = []
    issue_num = 16455
    keywords = ['patch']
    message_count = 11.0
    messages = ['175401', '175408', '175410', '175446', '176434', '176436', '176869', '176870', '176871', '178864', '178866']
    nosy_count = 4.0
    nosy_names = ['jcea', 'vstinner', 'ezio.melotti', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue16455'
    versions = ['Python 3.2', 'Python 3.3', 'Python 3.4']

    @vstinner
    Copy link
    Member Author

    On FreeBSD and OpenIndiana, sys.getfilesystemencoding() is 'ascii' when the locale is not set, whereas the locale encoding is ISO-8859-1.

    This inconsistency causes different issue. For example, os.fsencode(sys.argv[1]) fails if the argument is not ASCII because sys.argv are decoded from the locale encoding (by _Py_char2wchar()).

    sys.getfilesystemencoding() is 'ascii' because nl_langinfo(CODESET) is used to to get the locale encoding and nl_langinfo(CODESET) announces ASCII (or an alias of this encoding).

    Python should detect this case and set sys.getfilesystemencoding() to 'iso8859-1' if the locale encoding is 'iso8859-1' whereas nl_langinfo(CODESET) announces ASCII. We can for example decode b'\xe9' with mbstowcs() and check if it fails or if the result is U+00E9.

    @vstinner
    Copy link
    Member Author

    Attached patch works around the CODESET issue on OpenIndiana and FreeBSD. If the LC_CTYPE locale is "C" and nl_langinfo(CODESET) returns ASCII (or an alias of this encoding), b"\xE9" is decoded from the locale encoding: if the result is U+00E9, the patch Python uses ISO-8859-1. (If decoding fails, the locale encoding is really ASCII, the workaround is not used.)

    If the result is different (b'\xe9' is not decoded from the locale encoding to U+00E9), a ValueError is raised. I wrote this test to detect bugs. I hope that our buildbots will validate the code. We may choose a different behaviour (ex: keep ASCII).

    Example on FreeBSD 8.2, original Python 3.4:

    $ ./python
    >>> import sys, locale
    >>> sys.getfilesystemencoding()
    'ascii'
    >>> locale.getpreferredencoding()
    'US-ASCII'

    Example on FreeBSD 8.2, patched Python 3.4:

    $ ./python 
    >>> import sys, locale
    >>> sys.getfilesystemencoding()
    'iso8859-1'
    >>> locale.getpreferredencoding()
    'iso8859-1'

    @vstinner
    Copy link
    Member Author

    Some tests are failing with the patch:

    ======================================================================
    FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py", line 1606, in test_undecodable_env
        self.assertEqual(stdout.decode('ascii'), ascii(value))
    AssertionError: "'abc\\xff'" != "'abc\\udcff'"
    - 'abc\xff'
    ?      ^
    + 'abc\udcff'
    ?      ^^^

    ======================================================================
    FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 364, in test_strcoll_with_diacritic
        self.assertLess(locale.strcoll('\xe0', 'b'), 0)
    AssertionError: 126 not less than 0

    ======================================================================
    FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic
        self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
    AssertionError: '\xe0' not less than 'b'

    @vstinner
    Copy link
    Member Author

    Hijacking locale.getpreferredencoding() is maybe dangerous. I attached a
    new patch, force_ascii.patch, which uses a different approach: be more
    strict than mbstowcs(), force the ASCII encoding when:

    • the LC_CTYPE locale is C
    • nl_langinfo(CODESET) is ASCII or an alias of ASCII
    • mbstowcs() is able to decode non-ASCII characters

    2012/11/12 STINNER Victor <report@bugs.python.org>

    STINNER Victor added the comment:

    Some tests are failing with the patch:

    ======================================================================
    FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase)
    ----------------------------------------------------------------------

    > Traceback (most recent call last):
    >   File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py",
    > line 1606, in test_undecodable_env
    >     self.assertEqual(stdout.decode('ascii'), ascii(value))
    > AssertionError: "'abc\\xff'" != "'abc\\udcff'"
    > - 'abc\xff'
    > ?      ^
    > + 'abc\udcff'
    > ?      ^^^
    >
    > 

    ======================================================================

    FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation)
    ----------------------------------------------------------------------

    > Traceback (most recent call last):
    >   File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
    > 364, in test_strcoll_with_diacritic
    >     self.assertLess(locale.strcoll('\xe0', 'b'), 0)
    > AssertionError: 126 not less than 0
    >
    > 

    ======================================================================

    FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
    ----------------------------------------------------------------------

    > Traceback (most recent call last):
    >   File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
    > 367, in test_strxfrm_with_diacritic
    >     self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
    > AssertionError: '\xe0' not less than 'b'
    >
    > 


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue16455\>


    @jcea
    Copy link
    Member

    jcea commented Nov 26, 2012

    Victor, any progress on this?

    @vstinner
    Copy link
    Member Author

    Victor, any progress on this?

    We have two options, I don't know which one is the best (safer). Does
    the terminal handle non-ASCII characters with a C locale on FreeBSD or
    Solaris?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 4, 2012

    New changeset c25635b137cc by Victor Stinner in branch 'default':
    Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
    http://hg.python.org/cpython/rev/c25635b137cc

    @vstinner vstinner changed the title sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C Dec 4, 2012
    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 4, 2012

    We have two options, I don't know which one is the best (safer).

    Force ASCII is safer. Python should announce that it does not "understand" non-ASCII bytes on the command line. I also chose this option because isalpha(0xe9) returns 0 (even if mbstowcs(0xe9) returns L"\xe9"): FreeBSD doesn't consider U+00E9 as a letter in the C locale, so Python should also consider this byte as raw data.

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 4, 2012

    New changeset c25635b137cc by Victor Stinner in branch 'default':
    Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
    http://hg.python.org/cpython/rev/c25635b137cc

    This changeset should fix bpo-16218 on FreeBSD and Solaris (these OS should now decode correctly undecodable command line arguments).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 3, 2013

    New changeset c256764e2b3f by Victor Stinner in branch '3.2':
    Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
    http://hg.python.org/cpython/rev/c256764e2b3f

    New changeset 5bb289e4fb35 by Victor Stinner in branch '3.3':
    (Merge 3.2) Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
    http://hg.python.org/cpython/rev/5bb289e4fb35

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 3, 2013

    I backported the fix to Python 3.2 and 3.3 because I consider it important enough.

    @vstinner vstinner closed this as completed Jan 3, 2013
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants