Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 540: Add a new UTF-8 mode #73426

Closed
vstinner opened this issue Jan 11, 2017 · 37 comments
Closed

PEP 540: Add a new UTF-8 mode #73426

vstinner opened this issue Jan 11, 2017 · 37 comments
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

BPO 29240
Nosy @vstinner, @ezio-melotti, @methane, @eryksun
PRs
  • bpo-29240: PEP 540: Add a new UTF-8 mode #855
  • bpo-29240, bpo-32030: pymain_set_sys_argv() now copies argv #4838
  • bpo-29240: Don't define decode_locale() on macOS #4895
  • bpo-29240, bpo-32030: Py_Main() re-reads config if encoding changes #4899
  • bpo-29240: Skip test_readline.test_nonascii() #4968
  • bpo-29240: readline now ignores the UTF-8 Mode #5145
  • bpo-29240: Ignore UTF-8 Mode in time module #5148
  • bpo-29240: Fix locale encodings in UTF-8 Mode #5170
  • bpo-31900: Fix localeconv() encoding for LC_NUMERIC #4174
  • bpo-29240: Skip test_readline.test_nonascii() on C locale #5203
  • [3.6] bpo-29240: Skip test_readline.test_nonascii() on C locale (GH-5203) #5204
  • bpo-29240: PyUnicode_DecodeLocale() uses UTF-8 on Android #5272
  • Files
  • test_all_locales.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-01-17.14:28:41.827>
    created_at = <Date 2017-01-11.11:19:52.086>
    labels = ['interpreter-core', 'type-feature', 'library', 'expert-unicode', '3.7']
    title = 'PEP 540: Add a new UTF-8 mode'
    updated_at = <Date 2022-02-08.11:52:04.506>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2022-02-08.11:52:04.506>
    actor = 'yan12125'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-01-17.14:28:41.827>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Library (Lib)', 'Unicode']
    creation = <Date 2017-01-11.11:19:52.086>
    creator = 'vstinner'
    dependencies = []
    files = ['47385']
    hgrepos = []
    issue_num = 29240
    keywords = ['patch']
    message_count = 37.0
    messages = ['285214', '285215', '285216', '285275', '285276', '285277', '285278', '285280', '285296', '285298', '285325', '285332', '285357', '285407', '285482', '307694', '307695', '308182', '308183', '308198', '308213', '308217', '308430', '308448', '308915', '308916', '309782', '309798', '309958', '309959', '310029', '310092', '310097', '310177', '310443', '310444', '412665']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'ezio.melotti', 'methane', 'eryksun']
    pr_nums = ['855', '4838', '4895', '4899', '4968', '5145', '5148', '5170', '4174', '5203', '5204', '5272']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue29240'
    versions = ['Python 3.7']

    @vstinner
    Copy link
    Member Author

    This issue tracks the implementation of the PEP-540.

    Attached pep540_cli.py script can be used to play with it.

    @vstinner vstinner added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement labels Jan 11, 2017
    @vstinner
    Copy link
    Member Author

    PEP-540.patch: first draft

    Changes:

    • Add sys.flags.utf8mode
    • Add -X utf8 command line option
    • Add PYTHONUTF8 environment variable
    • sys.stdin, sys.stdout and sys.stderr encoding and errors are modified in UTF-8 mode
    • open() default encoding and errors is modified in the UTF-8 mode
    • Add Lib/test/test_utf8mode.py
    • Skip a few tests relying on the locale encoding if the UTF-8 mode is enabled
    • Document changes

    Allowed options:

    • Disable UTF-8 mode: -X utf8=0 or PYTHONUTF8=0
    • Enable UTF-8 mode: -X utf8=1 or PYTHONUTF8=1
    • Enable UTf-8 Strict mode: -X utf8=strict or PYTHONUTF8=strict
    • Other -X utf8 and PYTHONUTF8 values cause a fatal error

    Prioririties (highest to lowest):

    • open() encoding and errors arguments
    • PYTHONIOENCODING
    • UTF-8 mode
    • os.device_encoding()
    • locale encoding

    TODO:

    • re-encode sys.argv from the local encoding to UTF-8 in Py_Main() when the UTF-8 mode is enabled
    • support strict mode in Py_DecodeLocale() and Py_EncodeLocale()

    @vstinner
    Copy link
    Member Author

    Examples with pep540_cli.py.

    Python 3.5:

    $ python3 pep540_cli.py 
    sys.argv: ['pep540_cli.py']
    stdin: UTF-8/strict
    stdout: UTF-8/strict
    stderr: UTF-8/backslashreplace
    open(): UTF-8/strict
    
    $ LC_ALL=C python3 pep540_cli.py 
    sys.argv: ['pep540_cli.py']
    stdin: ANSI_X3.4-1968/surrogateescape
    stdout: ANSI_X3.4-1968/surrogateescape
    stderr: ANSI_X3.4-1968/backslashreplace
    open(): ANSI_X3.4-1968/strict

    Patched Python 3.7:

    $ ./python pep540_cli.py 
    UTF-8 mode: 0
    sys.argv: ['pep540_cli.py']
    stdin: UTF-8/strict
    stdout: UTF-8/strict
    stderr: UTF-8/backslashreplace
    open(): UTF-8/strict
    
    $ LC_ALL=C ./python pep540_cli.py 
    UTF-8 mode: 1
    sys.argv: ['pep540_cli.py']
    stdin: utf-8/surrogateescape
    stdout: utf-8/surrogateescape
    stderr: utf-8/backslashreplace
    open(): utf-8/surrogateescape
    
    $ ./python -X utf8 pep540_cli.py 
    UTF-8 mode: 1
    sys.argv: ['pep540_cli.py']
    stdin: utf-8/surrogateescape
    stdout: utf-8/surrogateescape
    stderr: utf-8/backslashreplace
    open(): utf-8/surrogateescape
    
    $ ./python -X utf8=strict pep540_cli.py 
    UTF-8 mode: 2
    sys.argv: ['pep540_cli.py']
    stdin: utf-8/strict
    stdout: utf-8/strict
    stderr: utf-8/backslashreplace
    open(): utf-8/strict

    @vstinner
    Copy link
    Member Author

    PEP-540-2.patch: Patch version 2, updated to the latest version of the PEP-540. It has no more FIXME/TODO and has more unit tests. The main change is that the strict mode doesn't use strict anymore for OS data, but keeps surrogateescape. See the PEP for the rationale (especially the "Use the strict error handler for operating system data" alternative).

    @vstinner
    Copy link
    Member Author

    Oops, I introduced an obvious bug in my latest refactoring. It's now fixed in the patch version 3: PEP-540-3.patch.

    @vstinner
    Copy link
    Member Author

    Hum, PEP-540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:

    $ LC_ALL=fr_FR ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
    ['-c', '\xff']

    The result should not depend on the locale, it should be the same than:

    $ LC_ALL=fr_FR.utf8 ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
    ['-c', '\udcff']
    
    $ LC_ALL=C ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
    ['-c', '\udcff']

    @vstinner
    Copy link
    Member Author

    I only tested the the PEP-540 implementation on Linux.

    The PEP and its implementation should adjusted for Windows, especially Windows-only env vars like PYTHONLEGACYWINDOWSFSENCODING.

    Changes are maybe also needed for Mac OS X and Android, which always use UTF-8. Currently, the locale encoding is still used on these platforms (ex: by open()). Is it possible to a locale encoding different than UTF-8 on Android for example?

    @methane
    Copy link
    Member

    methane commented Jan 11, 2017

    Hum, PEP-540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:

    I want to skip reencoding.
    On UTF-8 mode, arbitrary bytes in cmdline (e.g. broken filename passed by xarg) should be able to roundtrip by UTF-8/surrogateescape.

    I don't trust wcstombs/mbstowcs. It may not guarantee round tripping of arbitrary bytes.

    Can -X utf8 option be processed before Py_Main()?

    @vstinner
    Copy link
    Member Author

    Can -X utf8 option be processed before Py_Main()?

    I'm trying to implement that, but it's hard to factorize the code. I will probably have to duplicate the code handling -E, -X utf8, PYTHONMALLOC and PYTHONUTF8 for wchar_t* (UCS4 or UTF-16) and char* (bytes).

    @vstinner
    Copy link
    Member Author

    Hum, test_utf8mode lacks an unit test on the -E command line option:
    PYTHONUTF8 should be ignored if -E is used.

    @vstinner
    Copy link
    Member Author

    Patch version 4:

    • Handle PYTHONLEGACYWINDOWSFSENCODING: this env var now disables the UTF-8 mode and has the priority over -X utf8 and PYTHONUTF8
    • Add an unit test on PYTHONUTF8 env var and -E cmdline option
    • Add an unit test on the POSIX locale
    • Fix initstdio() to handle correctly empty PYTHONIOENCODING: this bug affects Python 3.6 as well and is not directly related to the PEP-540
    • Fix to handle correctly PYTHONUTF8 set to an empty string (ignore it)
    • Skip an unit test in test_utf8mode which failed with the POSIX locale

    Note: This patch still has the sys.argv encoding bug with locale encodings different than ASCII and UTF-8.

    @vstinner
    Copy link
    Member Author

    encodings.py: enhancement version of pep540_cli.py, add locale and filesystem encoding. Script to test the implementation of the PEP-540 (and PEP-538).

    @methane
    Copy link
    Member

    methane commented Jan 13, 2017

    How about locale.getpreferredencoding() returns 'utf-8' in utf8 mode?

    @vstinner
    Copy link
    Member Author

    Oh, I just noticed that os.environ uses the hardcoded error handler "surrogateescape": it should be replaced with sys.getfilesystemencodeerrors() to support UTF-8 Strict mode.

    @eryksun
    Copy link
    Contributor

    eryksun commented Jan 14, 2017

    it should be replaced with sys.getfilesystemencodeerrors()
    to support UTF-8 Strict mode.

    I did that in the patch for bpo-28188. The focus of the patch is to add bytes support on Windows for os.putenv and os.environb, but I also tried to maximize consistency (at least parallel structure) between the POSIX and Windows implementations.

    @vstinner vstinner changed the title Implementation of the PEP 540: Add a new UTF-8 mode [WIP] Implementation of the PEP 540: Add a new UTF-8 mode Jun 28, 2017
    @vstinner vstinner changed the title [WIP] Implementation of the PEP 540: Add a new UTF-8 mode PEP 540: Add a new UTF-8 mode Dec 5, 2017
    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 5, 2017

    I rebased my PR on master.

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 5, 2017

    I removed old patches in favor of the now up to date PR 855.

    @vstinner
    Copy link
    Member Author

    The PEP-538 has two open issues: bpo-30672 and bpo-32238.

    I recently refactored the Py_Main() code so it should be simpler to implement the PEP-540: see bpo-32030.

    @vstinner
    Copy link
    Member Author

    Oh, PYTHONCOERCECLOCALE env var is read very early in main() by _Py_CoerceLegacyLocale(), it ignores -E command line option.

     * Ignoring -E and -I is safe from a security perspective, as we only use
     * the setting to turn *off* the implicit locale coercion, and anyone with
     * access to the process environment already has the ability to set
     * `LC_ALL=C` to override the C level locale settings anyway.
    

    @vstinner
    Copy link
    Member Author

    New changeset 91106cd by Victor Stinner in branch 'master':
    bpo-29240: PEP-540: Add a new UTF-8 Mode (#855)
    91106cd

    @vstinner
    Copy link
    Member Author

    New changeset d5dda98 by Victor Stinner in branch 'master':
    pymain_set_sys_argv() now copies argv (bpo-4838)
    d5dda98

    @vstinner
    Copy link
    Member Author

    test_readline failed. It seems to be related to my commit:

    http://buildbot.python.org/all/#/builders/87/builds/360

    ======================================================================
    FAIL: test_nonascii (test.test_readline.TestReadline)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/buildbot/python/3.x.koobs-freebsd10/build/Lib/test/test_readline.py", line 219, in test_nonascii
        self.assertIn(b"text 't\\xeb'\r\n", output)
    AssertionError: b"text 't\\xeb'\r\n" not found in bytearray(b"^A^B^B^B^B^B^B^B\t\tx\t\r\n[\\303\\257nserted]|t\x07\x08\x08\x08\x08\x08\x08\x08\x07\x07xrted]|t\x08\x08\x08\x08\x08\x08\x08\x07\r\nresult \'[\\xefnsexrted]|t\'\r\nhistory \'[\\xefnsexrted]|t\'\r\n")

    @vstinner
    Copy link
    Member Author

    New changeset d2b0231 by Victor Stinner in branch 'master':
    bpo-29240: Don't define decode_locale() on macOS (bpo-4895)
    d2b0231

    @vstinner
    Copy link
    Member Author

    New changeset 9454060 by Victor Stinner in branch 'master':
    bpo-29240, bpo-32030: Py_Main() re-reads config if encoding changes (bpo-4899)
    9454060

    @vstinner
    Copy link
    Member Author

    New changeset 424315f by Victor Stinner in branch 'master':
    bpo-29240: Skip test_readline.test_nonascii() (bpo-4968)
    424315f

    @vstinner
    Copy link
    Member Author

    IHMO test_readline should be fixed by ignoring the UTF-8 mode in Py_EncodeLocale/Py_DecodeLocale, but only when called from the Python readline module. We need maybe new functions, something like: Py_EncodeCurrentLocale/Py_DecodeCurrentLocale.

    I will work on a patch when I will be back from holiday. In the meanwhile, I skipped the test to repair FreeBSD 3.x buildbots.

    @vstinner
    Copy link
    Member Author

    New changeset 2cba6b8 by Victor Stinner in branch 'master':
    bpo-29240: readline now ignores the UTF-8 Mode (bpo-5145)
    2cba6b8

    @vstinner
    Copy link
    Member Author

    New changeset cb3ae55 by Victor Stinner in branch 'master':
    bpo-29240: Ignore UTF-8 Mode in time module (bpo-5148)
    cb3ae55

    @vstinner
    Copy link
    Member Author

    Attached test_all_locales.py is a test suite for locale functions: os.strerror(), locale.localeconv(), time.strftime(). I tested it on Linux Fedora 27, FreeBSD 11.0 and macOS 10.13.2.

    The test should always pass on Python 2.7. On Python 3.6 and the master branch with PR 5170, 2 tests on numeric localeconv() fail because Python uses the wrong encoding: see bpo-31900. master with PR 5170 now has less encoding bugs than Python 3.6.

    @vstinner
    Copy link
    Member Author

    New changeset 7ed7aea by Victor Stinner in branch 'master':
    bpo-29240: Fix locale encodings in UTF-8 Mode (bpo-5170)
    7ed7aea

    @vstinner
    Copy link
    Member Author

    New changeset 7ed7aea by Victor Stinner in branch 'master':
    bpo-29240: Fix locale encodings in UTF-8 Mode (bpo-5170)

    Oh, this change broke test_nonascii() of test_readline() on FreeBSD.

    Previsously, readline used ASCII/surrogateescape encoding for the POSIX locale. Now, mbstowcs() / wcstombs() is called directly, with the surrogateescape error handler.

    @vstinner
    Copy link
    Member Author

    New changeset c495e79 by Victor Stinner in branch 'master':
    Skip test_readline.test_nonascii() on C locale (bpo-5203)
    c495e79

    @vstinner
    Copy link
    Member Author

    New changeset c2740e8 by Victor Stinner (Miss Islington (bot)) in branch '3.6':
    Skip test_readline.test_nonascii() on C locale (GH-5203) (bpo-5204)
    c2740e8

    @vstinner
    Copy link
    Member Author

    test_readline pass again on all buildbots, especially on FreeBSD 3.6 and 3.x buildbots.

    There are no more known issues, the implementation of the PEP-540 (UTF-8 Mode) is now complete!

    @vstinner
    Copy link
    Member Author

    New changeset 9089a26 by Victor Stinner in branch 'master':
    bpo-29240: PyUnicode_DecodeLocale() uses UTF-8 on Android (bpo-5272)
    9089a26

    @vstinner
    Copy link
    Member Author

    I partially reverted the commit 7ed7aea: on Android, UTF-8 is now always used, again. Paul Peny (aka pmpp) confirmed me that my commit broke Python on Android, at least with API 19 (locales don't work properly before API 21).

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 6, 2022

    New changeset 91106cd by Victor Stinner in branch 'master':
    bpo-29240: PEP-540: Add a new UTF-8 Mode (#855)
    91106cd

    Oh, this change broke the mbcs alias on Windows and the test_codecs and test_site tests (2 tests!) missed the bug :-( I fixed it in:

    New changeset 04dd60e by Victor Stinner in branch 'main':
    bpo-46659: Update the test on the mbcs codec alias (GH-31168)
    04dd60e

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants