classification
Title: PEP 540: Add a new UTF-8 mode
Type: enhancement Stage:
Components: Interpreter Core, Library (Lib), Unicode Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Chi Hsuan Yen, eryksun, ezio.melotti, inada.naoki, vstinner
Priority: normal Keywords: patch

Created on 2017-01-11 11:19 by vstinner, last changed 2017-12-05 22:12 by vstinner.

Pull Requests
URL Status Linked Edit
PR 855 open vstinner, 2017-03-27 22:03
Messages (17)
msg285214 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 11:19
This issue tracks the implementation of the PEP 540.

Attached pep540_cli.py script can be used to play with it.
msg285215 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 11:27
pep540.patch: first draft

Changes:

* Add sys.flags.utf8mode
* Add -X utf8 command line option
* Add PYTHONUTF8 environment variable
* sys.stdin, sys.stdout and sys.stderr encoding and errors are modified in UTF-8 mode
* open() default encoding and errors is modified in the UTF-8 mode
* Add Lib/test/test_utf8mode.py
* Skip a few tests relying on the locale encoding if the UTF-8 mode is enabled
* Document changes

Allowed options:

* Disable UTF-8 mode: -X utf8=0 or PYTHONUTF8=0
* Enable UTF-8 mode: -X utf8=1 or PYTHONUTF8=1
* Enable UTf-8 Strict mode: -X utf8=strict or PYTHONUTF8=strict
* Other -X utf8 and PYTHONUTF8 values cause a fatal error

Prioririties (highest to lowest):

* open() encoding and errors arguments
* PYTHONIOENCODING
* UTF-8 mode
* os.device_encoding()
* locale encoding

TODO:

* re-encode sys.argv from the local encoding to UTF-8 in Py_Main() when the UTF-8 mode is enabled
* support strict mode in Py_DecodeLocale() and Py_EncodeLocale()
msg285216 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 11:32
Examples with pep540_cli.py.

Python 3.5:

$ python3 pep540_cli.py 
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict

$ LC_ALL=C python3 pep540_cli.py 
sys.argv: ['pep540_cli.py']
stdin: ANSI_X3.4-1968/surrogateescape
stdout: ANSI_X3.4-1968/surrogateescape
stderr: ANSI_X3.4-1968/backslashreplace
open(): ANSI_X3.4-1968/strict


Patched Python 3.7:


$ ./python pep540_cli.py 
UTF-8 mode: 0
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict

$ LC_ALL=C ./python pep540_cli.py 
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape

$ ./python -X utf8 pep540_cli.py 
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape

$ ./python -X utf8=strict pep540_cli.py 
UTF-8 mode: 2
sys.argv: ['pep540_cli.py']
stdin: utf-8/strict
stdout: utf-8/strict
stderr: utf-8/backslashreplace
open(): utf-8/strict
msg285275 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 22:04
pep540-2.patch: Patch version 2, updated to the latest version of the PEP 540. It has no more FIXME/TODO and has more unit tests. The main change is that the strict mode doesn't use strict anymore for OS data, but keeps surrogateescape. See the PEP for the rationale (especially the "Use the strict error handler for operating system data" alternative).
msg285276 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 22:13
Oops, I introduced an obvious bug in my latest refactoring. It's now fixed in the patch version 3: pep540-3.patch.
msg285277 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 23:00
Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:

$ LC_ALL=fr_FR ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\xff']

The result should not depend on the locale, it should be the same than:

$ LC_ALL=fr_FR.utf8 ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']

$ LC_ALL=C ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']
msg285278 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 23:01
I only tested the the PEP 540 implementation on Linux.

The PEP and its implementation should adjusted for Windows, especially Windows-only env vars like PYTHONLEGACYWINDOWSFSENCODING.

Changes are maybe also needed for Mac OS X and Android, which always use UTF-8. Currently, the locale encoding is still used on these platforms (ex: by open()). Is it possible to a locale encoding different than UTF-8 on Android for example?
msg285280 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-11 23:57
> Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:

I want to skip reencoding.
On UTF-8 mode, arbitrary bytes in cmdline (e.g. broken filename passed by xarg) should be able to roundtrip by UTF-8/surrogateescape.

I don't trust wcstombs/mbstowcs.  It may not guarantee round tripping of arbitrary bytes.

Can -X utf8 option be processed before Py_Main()?
msg285296 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 09:18
> Can -X utf8 option be processed before Py_Main()?

I'm trying to implement that, but it's hard to factorize the code. I will probably have to duplicate the code handling -E, -X utf8, PYTHONMALLOC and PYTHONUTF8 for wchar_t* (UCS4 or UTF-16) and char* (bytes).
msg285298 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 10:32
Hum, test_utf8mode lacks an unit test on the -E command line option:
PYTHONUTF8 should be ignored if -E is used.
msg285325 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 13:31
Patch version 4:

* Handle PYTHONLEGACYWINDOWSFSENCODING: this env var now disables the UTF-8 mode and has the priority over -X utf8 and PYTHONUTF8
* Add an unit test on PYTHONUTF8 env var and -E cmdline option
* Add an unit test on the POSIX locale
* Fix initstdio() to handle correctly empty PYTHONIOENCODING: this bug affects Python 3.6 as well and is not directly related to the PEP 540
* Fix to handle correctly PYTHONUTF8 set to an empty string (ignore it)
* Skip an unit test in test_utf8mode which failed with the POSIX locale

Note: This patch still has the sys.argv encoding bug with locale encodings different than ASCII and UTF-8.
msg285332 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-12 16:45
encodings.py: enhancement version of pep540_cli.py, add locale and filesystem encoding. Script to test the implementation of the PEP 540 (and PEP 538).
msg285357 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-13 00:54
How about locale.getpreferredencoding() returns 'utf-8' in utf8 mode?
msg285407 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-13 15:27
Oh, I just noticed that os.environ uses the hardcoded error handler "surrogateescape": it should be replaced with sys.getfilesystemencodeerrors() to support UTF-8 Strict mode.
msg285482 - (view) Author: Eryk Sun (eryksun) * Date: 2017-01-14 14:05
> it should be replaced with sys.getfilesystemencodeerrors() 
> to support UTF-8 Strict mode.

I did that in the patch for issue 28188. The focus of the patch is to add bytes support on Windows for os.putenv and os.environb, but I also tried to maximize consistency (at least parallel structure) between the POSIX and Windows implementations.
msg307694 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-05 22:12
I rebased my PR on master.
msg307695 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-05 22:12
I removed old patches in favor of the now up to date PR 855.
History
Date User Action Args
2017-12-05 22:12:54vstinnersetmessages: + msg307695
2017-12-05 22:12:31vstinnersetfiles: - encodings.py
2017-12-05 22:12:14vstinnersetfiles: - pep540_cli.py
2017-12-05 22:12:14vstinnersetfiles: - pep540.patch
2017-12-05 22:12:13vstinnersetfiles: - pep540-2.patch
2017-12-05 22:12:12vstinnersetfiles: - pep540-3.patch
2017-12-05 22:12:11vstinnersetfiles: - pep540-4.patch
2017-12-05 22:12:00vstinnersetmessages: + msg307694
2017-12-05 22:11:45vstinnersettitle: [WIP] Implementation of the PEP 540: Add a new UTF-8 mode -> PEP 540: Add a new UTF-8 mode
2017-06-28 01:00:39vstinnersettitle: Implementation of the PEP 540: Add a new UTF-8 mode -> [WIP] Implementation of the PEP 540: Add a new UTF-8 mode
2017-03-27 22:03:35vstinnersetpull_requests: + pull_request757
2017-03-27 22:03:20vstinnersetpull_requests: - pull_request15
2017-01-14 14:05:46eryksunsetnosy: + eryksun
messages: + msg285482
2017-01-13 15:27:23vstinnersetmessages: + msg285407
2017-01-13 00:54:08inada.naokisetmessages: + msg285357
2017-01-12 16:45:20vstinnersetfiles: + encodings.py

messages: + msg285332
2017-01-12 13:31:42vstinnersetfiles: + pep540-4.patch

messages: + msg285325
2017-01-12 10:32:24vstinnersetmessages: + msg285298
2017-01-12 10:19:41Chi Hsuan Yensetnosy: + Chi Hsuan Yen
2017-01-12 09:18:36vstinnersetmessages: + msg285296
2017-01-11 23:57:12inada.naokisetmessages: + msg285280
2017-01-11 23:01:39vstinnersetmessages: + msg285278
2017-01-11 23:00:07vstinnersetmessages: + msg285277
2017-01-11 22:13:06vstinnersetfiles: + pep540-3.patch

messages: + msg285276
2017-01-11 22:04:22vstinnersetfiles: + pep540-2.patch

messages: + msg285275
2017-01-11 16:25:18inada.naokisetnosy: + inada.naoki
2017-01-11 11:32:58vstinnersetmessages: + msg285216
2017-01-11 11:27:22vstinnersetfiles: + pep540.patch
keywords: + patch
messages: + msg285215
2017-01-11 11:19:52vstinnercreate