classification
Title: PEP 540: Add a new UTF-8 mode
Type: enhancement Stage: resolved
Components: Interpreter Core, Library (Lib), Unicode Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, inada.naoki, vstinner, yan12125
Priority: normal Keywords: patch

Created on 2017-01-11 11:19 by vstinner, last changed 2018-01-22 18:09 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
test_all_locales.py vstinner, 2018-01-15 09:38
Pull Requests
URL Status Linked Edit
PR 855 merged vstinner, 2017-03-27 22:03
PR 4838 merged vstinner, 2017-12-13 14:04
PR 4895 merged vstinner, 2017-12-15 21:18
PR 4899 merged vstinner, 2017-12-16 03:10
PR 4968 merged vstinner, 2017-12-21 22:51
PR 5145 merged vstinner, 2018-01-10 17:59
PR 5148 merged vstinner, 2018-01-10 22:22
PR 5170 merged vstinner, 2018-01-13 00:23
PR 4174 vstinner, 2018-01-15 11:17
PR 5203 merged vstinner, 2018-01-16 15:46
PR 5204 merged python-dev, 2018-01-16 16:34
PR 5272 merged vstinner, 2018-01-22 16:48
Messages (36)
msg285214 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 11:19
This issue tracks the implementation of the PEP 540.

Attached pep540_cli.py script can be used to play with it.
msg285215 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 11:27
pep540.patch: first draft

Changes:

* Add sys.flags.utf8mode
* Add -X utf8 command line option
* Add PYTHONUTF8 environment variable
* sys.stdin, sys.stdout and sys.stderr encoding and errors are modified in UTF-8 mode
* open() default encoding and errors is modified in the UTF-8 mode
* Add Lib/test/test_utf8mode.py
* Skip a few tests relying on the locale encoding if the UTF-8 mode is enabled
* Document changes

Allowed options:

* Disable UTF-8 mode: -X utf8=0 or PYTHONUTF8=0
* Enable UTF-8 mode: -X utf8=1 or PYTHONUTF8=1
* Enable UTf-8 Strict mode: -X utf8=strict or PYTHONUTF8=strict
* Other -X utf8 and PYTHONUTF8 values cause a fatal error

Prioririties (highest to lowest):

* open() encoding and errors arguments
* PYTHONIOENCODING
* UTF-8 mode
* os.device_encoding()
* locale encoding

TODO:

* re-encode sys.argv from the local encoding to UTF-8 in Py_Main() when the UTF-8 mode is enabled
* support strict mode in Py_DecodeLocale() and Py_EncodeLocale()
msg285216 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 11:32
Examples with pep540_cli.py.

Python 3.5:

$ python3 pep540_cli.py 
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict

$ LC_ALL=C python3 pep540_cli.py 
sys.argv: ['pep540_cli.py']
stdin: ANSI_X3.4-1968/surrogateescape
stdout: ANSI_X3.4-1968/surrogateescape
stderr: ANSI_X3.4-1968/backslashreplace
open(): ANSI_X3.4-1968/strict


Patched Python 3.7:


$ ./python pep540_cli.py 
UTF-8 mode: 0
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict

$ LC_ALL=C ./python pep540_cli.py 
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape

$ ./python -X utf8 pep540_cli.py 
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape

$ ./python -X utf8=strict pep540_cli.py 
UTF-8 mode: 2
sys.argv: ['pep540_cli.py']
stdin: utf-8/strict
stdout: utf-8/strict
stderr: utf-8/backslashreplace
open(): utf-8/strict
msg285275 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 22:04
pep540-2.patch: Patch version 2, updated to the latest version of the PEP 540. It has no more FIXME/TODO and has more unit tests. The main change is that the strict mode doesn't use strict anymore for OS data, but keeps surrogateescape. See the PEP for the rationale (especially the "Use the strict error handler for operating system data" alternative).
msg285276 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 22:13
Oops, I introduced an obvious bug in my latest refactoring. It's now fixed in the patch version 3: pep540-3.patch.
msg285277 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 23:00
Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:

$ LC_ALL=fr_FR ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\xff']

The result should not depend on the locale, it should be the same than:

$ LC_ALL=fr_FR.utf8 ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']

$ LC_ALL=C ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']
msg285278 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 23:01
I only tested the the PEP 540 implementation on Linux.

The PEP and its implementation should adjusted for Windows, especially Windows-only env vars like PYTHONLEGACYWINDOWSFSENCODING.

Changes are maybe also needed for Mac OS X and Android, which always use UTF-8. Currently, the locale encoding is still used on these platforms (ex: by open()). Is it possible to a locale encoding different than UTF-8 on Android for example?
msg285280 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-11 23:57
> Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:

I want to skip reencoding.
On UTF-8 mode, arbitrary bytes in cmdline (e.g. broken filename passed by xarg) should be able to roundtrip by UTF-8/surrogateescape.

I don't trust wcstombs/mbstowcs.  It may not guarantee round tripping of arbitrary bytes.

Can -X utf8 option be processed before Py_Main()?
msg285296 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 09:18
> Can -X utf8 option be processed before Py_Main()?

I'm trying to implement that, but it's hard to factorize the code. I will probably have to duplicate the code handling -E, -X utf8, PYTHONMALLOC and PYTHONUTF8 for wchar_t* (UCS4 or UTF-16) and char* (bytes).
msg285298 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 10:32
Hum, test_utf8mode lacks an unit test on the -E command line option:
PYTHONUTF8 should be ignored if -E is used.
msg285325 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 13:31
Patch version 4:

* Handle PYTHONLEGACYWINDOWSFSENCODING: this env var now disables the UTF-8 mode and has the priority over -X utf8 and PYTHONUTF8
* Add an unit test on PYTHONUTF8 env var and -E cmdline option
* Add an unit test on the POSIX locale
* Fix initstdio() to handle correctly empty PYTHONIOENCODING: this bug affects Python 3.6 as well and is not directly related to the PEP 540
* Fix to handle correctly PYTHONUTF8 set to an empty string (ignore it)
* Skip an unit test in test_utf8mode which failed with the POSIX locale

Note: This patch still has the sys.argv encoding bug with locale encodings different than ASCII and UTF-8.
msg285332 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 16:45
encodings.py: enhancement version of pep540_cli.py, add locale and filesystem encoding. Script to test the implementation of the PEP 540 (and PEP 538).
msg285357 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-13 00:54
How about locale.getpreferredencoding() returns 'utf-8' in utf8 mode?
msg285407 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-13 15:27
Oh, I just noticed that os.environ uses the hardcoded error handler "surrogateescape": it should be replaced with sys.getfilesystemencodeerrors() to support UTF-8 Strict mode.
msg285482 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-01-14 14:05
> it should be replaced with sys.getfilesystemencodeerrors() 
> to support UTF-8 Strict mode.

I did that in the patch for issue 28188. The focus of the patch is to add bytes support on Windows for os.putenv and os.environb, but I also tried to maximize consistency (at least parallel structure) between the POSIX and Windows implementations.
msg307694 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-05 22:12
I rebased my PR on master.
msg307695 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-05 22:12
I removed old patches in favor of the now up to date PR 855.
msg308182 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-13 01:21
The PEP 538 has two open issues: bpo-30672 and bpo-32238.

I recently refactored the Py_Main() code so it should be simpler to implement the PEP 540: see bpo-32030.
msg308183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-13 01:25
Oh, PYTHONCOERCECLOCALE env var is read very early in main() by _Py_CoerceLegacyLocale(), it ignores -E command line option.

     * Ignoring -E and -I is safe from a security perspective, as we only use
     * the setting to turn *off* the implicit locale coercion, and anyone with
     * access to the process environment already has the ability to set
     * `LC_ALL=C` to override the C level locale settings anyway.
msg308198 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-13 11:29
New changeset 91106cd9ff2f321c0f60fbaa09fd46c80aa5c266 by Victor Stinner in branch 'master':
bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
https://github.com/python/cpython/commit/91106cd9ff2f321c0f60fbaa09fd46c80aa5c266
msg308213 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-13 16:31
New changeset d5dda98fa80405db82e2eb36ac48671b4c8c0983 by Victor Stinner in branch 'master':
pymain_set_sys_argv() now copies argv (#4838)
https://github.com/python/cpython/commit/d5dda98fa80405db82e2eb36ac48671b4c8c0983
msg308217 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-13 16:46
test_readline failed. It seems to be related to my commit:

http://buildbot.python.org/all/#/builders/87/builds/360

======================================================================
FAIL: test_nonascii (test.test_readline.TestReadline)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd10/build/Lib/test/test_readline.py", line 219, in test_nonascii
    self.assertIn(b"text 't\\xeb'\r\n", output)
AssertionError: b"text 't\\xeb'\r\n" not found in bytearray(b"^A^B^B^B^B^B^B^B\t\tx\t\r\n[\\303\\257nserted]|t\x07\x08\x08\x08\x08\x08\x08\x08\x07\x07xrted]|t\x08\x08\x08\x08\x08\x08\x08\x07\r\nresult \'[\\xefnsexrted]|t\'\r\nhistory \'[\\xefnsexrted]|t\'\r\n")
msg308430 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-15 22:06
New changeset d2b02310acbfe6c978a8ad3cd3ac8b3f12927442 by Victor Stinner in branch 'master':
bpo-29240: Don't define decode_locale() on macOS (#4895)
https://github.com/python/cpython/commit/d2b02310acbfe6c978a8ad3cd3ac8b3f12927442
msg308448 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-16 03:54
New changeset 9454060e84a669dde63824d9e2fcaf295e34f687 by Victor Stinner in branch 'master':
bpo-29240, bpo-32030: Py_Main() re-reads config if encoding changes (#4899)
https://github.com/python/cpython/commit/9454060e84a669dde63824d9e2fcaf295e34f687
msg308915 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-21 23:09
New changeset 424315fa865b43f67e36a40647107379adf031da by Victor Stinner in branch 'master':
bpo-29240: Skip test_readline.test_nonascii() (#4968)
https://github.com/python/cpython/commit/424315fa865b43f67e36a40647107379adf031da
msg308916 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-21 23:11
IHMO test_readline should be fixed by ignoring the UTF-8 mode in Py_EncodeLocale/Py_DecodeLocale, but only when called from the Python readline module. We need maybe new functions, something like: Py_EncodeCurrentLocale/Py_DecodeCurrentLocale.

I will work on a patch when I will be back from holiday. In the meanwhile, I skipped the test to repair FreeBSD 3.x buildbots.
msg309782 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-10 21:46
New changeset 2cba6b85797ba60d67389126f184aad5c9e02ff3 by Victor Stinner in branch 'master':
bpo-29240: readline now ignores the UTF-8 Mode (#5145)
https://github.com/python/cpython/commit/2cba6b85797ba60d67389126f184aad5c9e02ff3
msg309798 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-11 09:38
New changeset cb3ae5588bd7733e76dc09277bb7626652d9bb64 by Victor Stinner in branch 'master':
bpo-29240: Ignore UTF-8 Mode in time module (#5148)
https://github.com/python/cpython/commit/cb3ae5588bd7733e76dc09277bb7626652d9bb64
msg309958 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-15 09:38
Attached test_all_locales.py is a test suite for locale functions: os.strerror(), locale.localeconv(), time.strftime(). I tested it on Linux Fedora 27, FreeBSD 11.0 and macOS 10.13.2.

The test should always pass on Python 2.7. On Python 3.6 and the master branch with PR 5170, 2 tests on numeric localeconv() fail because Python uses the wrong encoding: see bpo-31900. master with PR 5170 now has less encoding bugs than Python 3.6.
msg309959 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-15 09:45
New changeset 7ed7aead9503102d2ed316175f198104e0cd674c by Victor Stinner in branch 'master':
bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
https://github.com/python/cpython/commit/7ed7aead9503102d2ed316175f198104e0cd674c
msg310029 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-16 00:08
> New changeset 7ed7aead9503102d2ed316175f198104e0cd674c by Victor Stinner in branch 'master':
> bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)

Oh, this change broke test_nonascii() of test_readline() on FreeBSD.

Previsously, readline used ASCII/surrogateescape encoding for the POSIX locale. Now, mbstowcs() / wcstombs() is called directly, with the surrogateescape error handler.
msg310092 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-16 16:34
New changeset c495e799ed376af91ae2ddf6c4bcc592490fe294 by Victor Stinner in branch 'master':
Skip test_readline.test_nonascii() on C locale (#5203)
https://github.com/python/cpython/commit/c495e799ed376af91ae2ddf6c4bcc592490fe294
msg310097 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-16 17:27
New changeset c2740e8a263e76427a8102a89f4b491a3089b2a1 by Victor Stinner (Miss Islington (bot)) in branch '3.6':
Skip test_readline.test_nonascii() on C locale (GH-5203) (#5204)
https://github.com/python/cpython/commit/c2740e8a263e76427a8102a89f4b491a3089b2a1
msg310177 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-17 14:28
test_readline pass again on all buildbots, especially on FreeBSD 3.6 and 3.x buildbots.

There are no more known issues, the implementation of the PEP 540 (UTF-8 Mode) is now complete!
msg310443 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-22 18:07
New changeset 9089a265918754d95e105a7c4c409ac9352c87bb by Victor Stinner in branch 'master':
bpo-29240: PyUnicode_DecodeLocale() uses UTF-8 on Android (#5272)
https://github.com/python/cpython/commit/9089a265918754d95e105a7c4c409ac9352c87bb
msg310444 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-22 18:09
I partially reverted the commit 7ed7aead9503102d2ed316175f198104e0cd674c: on Android, UTF-8 is now always used, again. Paul Peny (aka pmpp) confirmed me that my commit broke Python on Android, at least with API 19 (locales don't work properly before API 21).
History
Date User Action Args
2018-01-22 18:09:52vstinnersetmessages: + msg310444
2018-01-22 18:07:35vstinnersetmessages: + msg310443
2018-01-22 16:48:36vstinnersetpull_requests: + pull_request5116
2018-01-17 14:28:41vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg310177

stage: patch review -> resolved
2018-01-16 17:27:36vstinnersetmessages: + msg310097
2018-01-16 16:34:45python-devsetpull_requests: + pull_request5058
2018-01-16 16:34:37vstinnersetmessages: + msg310092
2018-01-16 15:46:05vstinnersetpull_requests: + pull_request5057
2018-01-16 00:08:03vstinnersetmessages: + msg310029
2018-01-15 11:17:23vstinnersetpull_requests: + pull_request5043
2018-01-15 09:45:56vstinnersetmessages: + msg309959
2018-01-15 09:38:41vstinnersetfiles: + test_all_locales.py

messages: + msg309958
2018-01-13 00:23:31vstinnersetpull_requests: + pull_request5024
2018-01-11 09:38:07vstinnersetmessages: + msg309798
2018-01-10 22:22:15vstinnersetpull_requests: + pull_request5005
2018-01-10 21:46:18vstinnersetmessages: + msg309782
2018-01-10 17:59:22vstinnersetpull_requests: + pull_request5003
2017-12-21 23:11:01vstinnersetmessages: + msg308916
2017-12-21 23:09:28vstinnersetmessages: + msg308915
2017-12-21 22:51:06vstinnersetpull_requests: + pull_request4860
2017-12-16 03:54:25vstinnersetmessages: + msg308448
2017-12-16 03:10:30vstinnersetpull_requests: + pull_request4793
2017-12-15 22:06:23vstinnersetmessages: + msg308430
2017-12-15 21:18:45vstinnersetpull_requests: + pull_request4787
2017-12-13 16:46:03vstinnersetmessages: + msg308217
2017-12-13 16:31:18vstinnersetmessages: + msg308213
2017-12-13 14:04:28vstinnersetstage: patch review
pull_requests: + pull_request4727
2017-12-13 11:29:11vstinnersetmessages: + msg308198
2017-12-13 01:25:01vstinnersetmessages: + msg308183
2017-12-13 01:21:47vstinnersetmessages: + msg308182
2017-12-05 22:12:54vstinnersetmessages: + msg307695
2017-12-05 22:12:31vstinnersetfiles: - encodings.py
2017-12-05 22:12:14vstinnersetfiles: - pep540_cli.py
2017-12-05 22:12:14vstinnersetfiles: - pep540.patch
2017-12-05 22:12:13vstinnersetfiles: - pep540-2.patch
2017-12-05 22:12:12vstinnersetfiles: - pep540-3.patch
2017-12-05 22:12:11vstinnersetfiles: - pep540-4.patch
2017-12-05 22:12:00vstinnersetmessages: + msg307694
2017-12-05 22:11:45vstinnersettitle: [WIP] Implementation of the PEP 540: Add a new UTF-8 mode -> PEP 540: Add a new UTF-8 mode
2017-06-28 01:00:39vstinnersettitle: Implementation of the PEP 540: Add a new UTF-8 mode -> [WIP] Implementation of the PEP 540: Add a new UTF-8 mode
2017-03-27 22:03:35vstinnersetpull_requests: + pull_request757
2017-03-27 22:03:20vstinnersetpull_requests: - pull_request15
2017-01-14 14:05:46eryksunsetnosy: + eryksun
messages: + msg285482
2017-01-13 15:27:23vstinnersetmessages: + msg285407
2017-01-13 00:54:08inada.naokisetmessages: + msg285357
2017-01-12 16:45:20vstinnersetfiles: + encodings.py

messages: + msg285332
2017-01-12 13:31:42vstinnersetfiles: + pep540-4.patch

messages: + msg285325
2017-01-12 10:32:24vstinnersetmessages: + msg285298
2017-01-12 10:19:41yan12125setnosy: + yan12125
2017-01-12 09:18:36vstinnersetmessages: + msg285296
2017-01-11 23:57:12inada.naokisetmessages: + msg285280
2017-01-11 23:01:39vstinnersetmessages: + msg285278
2017-01-11 23:00:07vstinnersetmessages: + msg285277
2017-01-11 22:13:06vstinnersetfiles: + pep540-3.patch

messages: + msg285276
2017-01-11 22:04:22vstinnersetfiles: + pep540-2.patch

messages: + msg285275
2017-01-11 16:25:18inada.naokisetnosy: + inada.naoki
2017-01-11 11:32:58vstinnersetmessages: + msg285216
2017-01-11 11:27:22vstinnersetfiles: + pep540.patch
keywords: + patch
messages: + msg285215
2017-01-11 11:19:52vstinnercreate