msg285214 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-11 11:19 |
This issue tracks the implementation of the PEP 540.
Attached pep540_cli.py script can be used to play with it.
|
msg285215 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-11 11:27 |
pep540.patch: first draft
Changes:
* Add sys.flags.utf8mode
* Add -X utf8 command line option
* Add PYTHONUTF8 environment variable
* sys.stdin, sys.stdout and sys.stderr encoding and errors are modified in UTF-8 mode
* open() default encoding and errors is modified in the UTF-8 mode
* Add Lib/test/test_utf8mode.py
* Skip a few tests relying on the locale encoding if the UTF-8 mode is enabled
* Document changes
Allowed options:
* Disable UTF-8 mode: -X utf8=0 or PYTHONUTF8=0
* Enable UTF-8 mode: -X utf8=1 or PYTHONUTF8=1
* Enable UTf-8 Strict mode: -X utf8=strict or PYTHONUTF8=strict
* Other -X utf8 and PYTHONUTF8 values cause a fatal error
Prioririties (highest to lowest):
* open() encoding and errors arguments
* PYTHONIOENCODING
* UTF-8 mode
* os.device_encoding()
* locale encoding
TODO:
* re-encode sys.argv from the local encoding to UTF-8 in Py_Main() when the UTF-8 mode is enabled
* support strict mode in Py_DecodeLocale() and Py_EncodeLocale()
|
msg285216 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-11 11:32 |
Examples with pep540_cli.py.
Python 3.5:
$ python3 pep540_cli.py
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict
$ LC_ALL=C python3 pep540_cli.py
sys.argv: ['pep540_cli.py']
stdin: ANSI_X3.4-1968/surrogateescape
stdout: ANSI_X3.4-1968/surrogateescape
stderr: ANSI_X3.4-1968/backslashreplace
open(): ANSI_X3.4-1968/strict
Patched Python 3.7:
$ ./python pep540_cli.py
UTF-8 mode: 0
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict
$ LC_ALL=C ./python pep540_cli.py
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape
$ ./python -X utf8 pep540_cli.py
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape
$ ./python -X utf8=strict pep540_cli.py
UTF-8 mode: 2
sys.argv: ['pep540_cli.py']
stdin: utf-8/strict
stdout: utf-8/strict
stderr: utf-8/backslashreplace
open(): utf-8/strict
|
msg285275 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-11 22:04 |
pep540-2.patch: Patch version 2, updated to the latest version of the PEP 540. It has no more FIXME/TODO and has more unit tests. The main change is that the strict mode doesn't use strict anymore for OS data, but keeps surrogateescape. See the PEP for the rationale (especially the "Use the strict error handler for operating system data" alternative).
|
msg285276 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-11 22:13 |
Oops, I introduced an obvious bug in my latest refactoring. It's now fixed in the patch version 3: pep540-3.patch.
|
msg285277 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-11 23:00 |
Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:
$ LC_ALL=fr_FR ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\xff']
The result should not depend on the locale, it should be the same than:
$ LC_ALL=fr_FR.utf8 ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']
$ LC_ALL=C ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']
|
msg285278 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-11 23:01 |
I only tested the the PEP 540 implementation on Linux.
The PEP and its implementation should adjusted for Windows, especially Windows-only env vars like PYTHONLEGACYWINDOWSFSENCODING.
Changes are maybe also needed for Mac OS X and Android, which always use UTF-8. Currently, the locale encoding is still used on these platforms (ex: by open()). Is it possible to a locale encoding different than UTF-8 on Android for example?
|
msg285280 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-01-11 23:57 |
> Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:
I want to skip reencoding.
On UTF-8 mode, arbitrary bytes in cmdline (e.g. broken filename passed by xarg) should be able to roundtrip by UTF-8/surrogateescape.
I don't trust wcstombs/mbstowcs. It may not guarantee round tripping of arbitrary bytes.
Can -X utf8 option be processed before Py_Main()?
|
msg285296 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-12 09:18 |
> Can -X utf8 option be processed before Py_Main()?
I'm trying to implement that, but it's hard to factorize the code. I will probably have to duplicate the code handling -E, -X utf8, PYTHONMALLOC and PYTHONUTF8 for wchar_t* (UCS4 or UTF-16) and char* (bytes).
|
msg285298 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-12 10:32 |
Hum, test_utf8mode lacks an unit test on the -E command line option:
PYTHONUTF8 should be ignored if -E is used.
|
msg285325 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-12 13:31 |
Patch version 4:
* Handle PYTHONLEGACYWINDOWSFSENCODING: this env var now disables the UTF-8 mode and has the priority over -X utf8 and PYTHONUTF8
* Add an unit test on PYTHONUTF8 env var and -E cmdline option
* Add an unit test on the POSIX locale
* Fix initstdio() to handle correctly empty PYTHONIOENCODING: this bug affects Python 3.6 as well and is not directly related to the PEP 540
* Fix to handle correctly PYTHONUTF8 set to an empty string (ignore it)
* Skip an unit test in test_utf8mode which failed with the POSIX locale
Note: This patch still has the sys.argv encoding bug with locale encodings different than ASCII and UTF-8.
|
msg285332 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-12 16:45 |
encodings.py: enhancement version of pep540_cli.py, add locale and filesystem encoding. Script to test the implementation of the PEP 540 (and PEP 538).
|
msg285357 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-01-13 00:54 |
How about locale.getpreferredencoding() returns 'utf-8' in utf8 mode?
|
msg285407 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-13 15:27 |
Oh, I just noticed that os.environ uses the hardcoded error handler "surrogateescape": it should be replaced with sys.getfilesystemencodeerrors() to support UTF-8 Strict mode.
|
msg285482 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2017-01-14 14:05 |
> it should be replaced with sys.getfilesystemencodeerrors()
> to support UTF-8 Strict mode.
I did that in the patch for issue 28188. The focus of the patch is to add bytes support on Windows for os.putenv and os.environb, but I also tried to maximize consistency (at least parallel structure) between the POSIX and Windows implementations.
|
msg307694 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-05 22:12 |
I rebased my PR on master.
|
msg307695 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-05 22:12 |
I removed old patches in favor of the now up to date PR 855.
|
msg308182 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-13 01:21 |
The PEP 538 has two open issues: bpo-30672 and bpo-32238.
I recently refactored the Py_Main() code so it should be simpler to implement the PEP 540: see bpo-32030.
|
msg308183 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-13 01:25 |
Oh, PYTHONCOERCECLOCALE env var is read very early in main() by _Py_CoerceLegacyLocale(), it ignores -E command line option.
* Ignoring -E and -I is safe from a security perspective, as we only use
* the setting to turn *off* the implicit locale coercion, and anyone with
* access to the process environment already has the ability to set
* `LC_ALL=C` to override the C level locale settings anyway.
|
msg308198 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-13 11:29 |
New changeset 91106cd9ff2f321c0f60fbaa09fd46c80aa5c266 by Victor Stinner in branch 'master':
bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
https://github.com/python/cpython/commit/91106cd9ff2f321c0f60fbaa09fd46c80aa5c266
|
msg308213 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-13 16:31 |
New changeset d5dda98fa80405db82e2eb36ac48671b4c8c0983 by Victor Stinner in branch 'master':
pymain_set_sys_argv() now copies argv (#4838)
https://github.com/python/cpython/commit/d5dda98fa80405db82e2eb36ac48671b4c8c0983
|
msg308217 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-13 16:46 |
test_readline failed. It seems to be related to my commit:
http://buildbot.python.org/all/#/builders/87/builds/360
======================================================================
FAIL: test_nonascii (test.test_readline.TestReadline)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/home/buildbot/python/3.x.koobs-freebsd10/build/Lib/test/test_readline.py", line 219, in test_nonascii
self.assertIn(b"text 't\\xeb'\r\n", output)
AssertionError: b"text 't\\xeb'\r\n" not found in bytearray(b"^A^B^B^B^B^B^B^B\t\tx\t\r\n[\\303\\257nserted]|t\x07\x08\x08\x08\x08\x08\x08\x08\x07\x07xrted]|t\x08\x08\x08\x08\x08\x08\x08\x07\r\nresult \'[\\xefnsexrted]|t\'\r\nhistory \'[\\xefnsexrted]|t\'\r\n")
|
msg308430 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-15 22:06 |
New changeset d2b02310acbfe6c978a8ad3cd3ac8b3f12927442 by Victor Stinner in branch 'master':
bpo-29240: Don't define decode_locale() on macOS (#4895)
https://github.com/python/cpython/commit/d2b02310acbfe6c978a8ad3cd3ac8b3f12927442
|
msg308448 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-16 03:54 |
New changeset 9454060e84a669dde63824d9e2fcaf295e34f687 by Victor Stinner in branch 'master':
bpo-29240, bpo-32030: Py_Main() re-reads config if encoding changes (#4899)
https://github.com/python/cpython/commit/9454060e84a669dde63824d9e2fcaf295e34f687
|
msg308915 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-21 23:09 |
New changeset 424315fa865b43f67e36a40647107379adf031da by Victor Stinner in branch 'master':
bpo-29240: Skip test_readline.test_nonascii() (#4968)
https://github.com/python/cpython/commit/424315fa865b43f67e36a40647107379adf031da
|
msg308916 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-12-21 23:11 |
IHMO test_readline should be fixed by ignoring the UTF-8 mode in Py_EncodeLocale/Py_DecodeLocale, but only when called from the Python readline module. We need maybe new functions, something like: Py_EncodeCurrentLocale/Py_DecodeCurrentLocale.
I will work on a patch when I will be back from holiday. In the meanwhile, I skipped the test to repair FreeBSD 3.x buildbots.
|
msg309782 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-10 21:46 |
New changeset 2cba6b85797ba60d67389126f184aad5c9e02ff3 by Victor Stinner in branch 'master':
bpo-29240: readline now ignores the UTF-8 Mode (#5145)
https://github.com/python/cpython/commit/2cba6b85797ba60d67389126f184aad5c9e02ff3
|
msg309798 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-11 09:38 |
New changeset cb3ae5588bd7733e76dc09277bb7626652d9bb64 by Victor Stinner in branch 'master':
bpo-29240: Ignore UTF-8 Mode in time module (#5148)
https://github.com/python/cpython/commit/cb3ae5588bd7733e76dc09277bb7626652d9bb64
|
msg309958 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-15 09:38 |
Attached test_all_locales.py is a test suite for locale functions: os.strerror(), locale.localeconv(), time.strftime(). I tested it on Linux Fedora 27, FreeBSD 11.0 and macOS 10.13.2.
The test should always pass on Python 2.7. On Python 3.6 and the master branch with PR 5170, 2 tests on numeric localeconv() fail because Python uses the wrong encoding: see bpo-31900. master with PR 5170 now has less encoding bugs than Python 3.6.
|
msg309959 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-15 09:45 |
New changeset 7ed7aead9503102d2ed316175f198104e0cd674c by Victor Stinner in branch 'master':
bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
https://github.com/python/cpython/commit/7ed7aead9503102d2ed316175f198104e0cd674c
|
msg310029 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-16 00:08 |
> New changeset 7ed7aead9503102d2ed316175f198104e0cd674c by Victor Stinner in branch 'master':
> bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
Oh, this change broke test_nonascii() of test_readline() on FreeBSD.
Previsously, readline used ASCII/surrogateescape encoding for the POSIX locale. Now, mbstowcs() / wcstombs() is called directly, with the surrogateescape error handler.
|
msg310092 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-16 16:34 |
New changeset c495e799ed376af91ae2ddf6c4bcc592490fe294 by Victor Stinner in branch 'master':
Skip test_readline.test_nonascii() on C locale (#5203)
https://github.com/python/cpython/commit/c495e799ed376af91ae2ddf6c4bcc592490fe294
|
msg310097 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-16 17:27 |
New changeset c2740e8a263e76427a8102a89f4b491a3089b2a1 by Victor Stinner (Miss Islington (bot)) in branch '3.6':
Skip test_readline.test_nonascii() on C locale (GH-5203) (#5204)
https://github.com/python/cpython/commit/c2740e8a263e76427a8102a89f4b491a3089b2a1
|
msg310177 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-17 14:28 |
test_readline pass again on all buildbots, especially on FreeBSD 3.6 and 3.x buildbots.
There are no more known issues, the implementation of the PEP 540 (UTF-8 Mode) is now complete!
|
msg310443 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-22 18:07 |
New changeset 9089a265918754d95e105a7c4c409ac9352c87bb by Victor Stinner in branch 'master':
bpo-29240: PyUnicode_DecodeLocale() uses UTF-8 on Android (#5272)
https://github.com/python/cpython/commit/9089a265918754d95e105a7c4c409ac9352c87bb
|
msg310444 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-01-22 18:09 |
I partially reverted the commit 7ed7aead9503102d2ed316175f198104e0cd674c: on Android, UTF-8 is now always used, again. Paul Peny (aka pmpp) confirmed me that my commit broke Python on Android, at least with API 19 (locales don't work properly before API 21).
|
msg412665 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2022-02-06 20:51 |
> New changeset 91106cd9ff2f321c0f60fbaa09fd46c80aa5c266 by Victor Stinner in branch 'master':
> bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
> https://github.com/python/cpython/commit/91106cd9ff2f321c0f60fbaa09fd46c80aa5c266
Oh, this change broke the mbcs alias on Windows and the test_codecs and test_site tests (2 tests!) missed the bug :-( I fixed it in:
New changeset 04dd60e50cd3da48fd19cdab4c0e4cc600d6af30 by Victor Stinner in branch 'main':
bpo-46659: Update the test on the mbcs codec alias (GH-31168)
https://github.com/python/cpython/commit/04dd60e50cd3da48fd19cdab4c0e4cc600d6af30
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:41 | admin | set | github: 73426 |
2022-02-08 11:52:04 | yan12125 | set | nosy:
- yan12125
|
2022-02-06 20:51:32 | vstinner | set | messages:
+ msg412665 |
2018-01-22 18:09:52 | vstinner | set | messages:
+ msg310444 |
2018-01-22 18:07:35 | vstinner | set | messages:
+ msg310443 |
2018-01-22 16:48:36 | vstinner | set | pull_requests:
+ pull_request5116 |
2018-01-17 14:28:41 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg310177
stage: patch review -> resolved |
2018-01-16 17:27:36 | vstinner | set | messages:
+ msg310097 |
2018-01-16 16:34:45 | python-dev | set | pull_requests:
+ pull_request5058 |
2018-01-16 16:34:37 | vstinner | set | messages:
+ msg310092 |
2018-01-16 15:46:05 | vstinner | set | pull_requests:
+ pull_request5057 |
2018-01-16 00:08:03 | vstinner | set | messages:
+ msg310029 |
2018-01-15 11:17:23 | vstinner | set | pull_requests:
+ pull_request5043 |
2018-01-15 09:45:56 | vstinner | set | messages:
+ msg309959 |
2018-01-15 09:38:41 | vstinner | set | files:
+ test_all_locales.py
messages:
+ msg309958 |
2018-01-13 00:23:31 | vstinner | set | pull_requests:
+ pull_request5024 |
2018-01-11 09:38:07 | vstinner | set | messages:
+ msg309798 |
2018-01-10 22:22:15 | vstinner | set | pull_requests:
+ pull_request5005 |
2018-01-10 21:46:18 | vstinner | set | messages:
+ msg309782 |
2018-01-10 17:59:22 | vstinner | set | pull_requests:
+ pull_request5003 |
2017-12-21 23:11:01 | vstinner | set | messages:
+ msg308916 |
2017-12-21 23:09:28 | vstinner | set | messages:
+ msg308915 |
2017-12-21 22:51:06 | vstinner | set | pull_requests:
+ pull_request4860 |
2017-12-16 03:54:25 | vstinner | set | messages:
+ msg308448 |
2017-12-16 03:10:30 | vstinner | set | pull_requests:
+ pull_request4793 |
2017-12-15 22:06:23 | vstinner | set | messages:
+ msg308430 |
2017-12-15 21:18:45 | vstinner | set | pull_requests:
+ pull_request4787 |
2017-12-13 16:46:03 | vstinner | set | messages:
+ msg308217 |
2017-12-13 16:31:18 | vstinner | set | messages:
+ msg308213 |
2017-12-13 14:04:28 | vstinner | set | stage: patch review pull_requests:
+ pull_request4727 |
2017-12-13 11:29:11 | vstinner | set | messages:
+ msg308198 |
2017-12-13 01:25:01 | vstinner | set | messages:
+ msg308183 |
2017-12-13 01:21:47 | vstinner | set | messages:
+ msg308182 |
2017-12-05 22:12:54 | vstinner | set | messages:
+ msg307695 |
2017-12-05 22:12:31 | vstinner | set | files:
- encodings.py |
2017-12-05 22:12:14 | vstinner | set | files:
- pep540_cli.py |
2017-12-05 22:12:14 | vstinner | set | files:
- pep540.patch |
2017-12-05 22:12:13 | vstinner | set | files:
- pep540-2.patch |
2017-12-05 22:12:12 | vstinner | set | files:
- pep540-3.patch |
2017-12-05 22:12:11 | vstinner | set | files:
- pep540-4.patch |
2017-12-05 22:12:00 | vstinner | set | messages:
+ msg307694 |
2017-12-05 22:11:45 | vstinner | set | title: [WIP] Implementation of the PEP 540: Add a new UTF-8 mode -> PEP 540: Add a new UTF-8 mode |
2017-06-28 01:00:39 | vstinner | set | title: Implementation of the PEP 540: Add a new UTF-8 mode -> [WIP] Implementation of the PEP 540: Add a new UTF-8 mode |
2017-03-27 22:03:35 | vstinner | set | pull_requests:
+ pull_request757 |
2017-03-27 22:03:20 | vstinner | set | pull_requests:
- pull_request15 |
2017-01-14 14:05:46 | eryksun | set | nosy:
+ eryksun messages:
+ msg285482
|
2017-01-13 15:27:23 | vstinner | set | messages:
+ msg285407 |
2017-01-13 00:54:08 | methane | set | messages:
+ msg285357 |
2017-01-12 16:45:20 | vstinner | set | files:
+ encodings.py
messages:
+ msg285332 |
2017-01-12 13:31:42 | vstinner | set | files:
+ pep540-4.patch
messages:
+ msg285325 |
2017-01-12 10:32:24 | vstinner | set | messages:
+ msg285298 |
2017-01-12 10:19:41 | yan12125 | set | nosy:
+ yan12125
|
2017-01-12 09:18:36 | vstinner | set | messages:
+ msg285296 |
2017-01-11 23:57:12 | methane | set | messages:
+ msg285280 |
2017-01-11 23:01:39 | vstinner | set | messages:
+ msg285278 |
2017-01-11 23:00:07 | vstinner | set | messages:
+ msg285277 |
2017-01-11 22:13:06 | vstinner | set | files:
+ pep540-3.patch
messages:
+ msg285276 |
2017-01-11 22:04:22 | vstinner | set | files:
+ pep540-2.patch
messages:
+ msg285275 |
2017-01-11 16:25:18 | methane | set | nosy:
+ methane
|
2017-01-11 11:32:58 | vstinner | set | messages:
+ msg285216 |
2017-01-11 11:27:22 | vstinner | set | files:
+ pep540.patch keywords:
+ patch messages:
+ msg285215
|
2017-01-11 11:19:52 | vstinner | create | |