classification
Title: Tracking issue for adjustments to binary/text boundary handling
Type: enhancement Stage:
Components: Versions:
process
Status: open Resolution:
Dependencies: 1602 6135 9951 15216 17620 17909 18814 19977 20284 22264 22286 22385 27781 Superseder:
Assigned To: ncoghlan Nosy List: Drekin, barry, berker.peksag, bkabrda, brett.cannon, encukou, martin.panter, ncoghlan, rkuska, steve.dower, vstinner
Priority: normal Keywords:

Created on 2014-10-05 04:01 by ncoghlan, last changed 2016-09-10 10:23 by ncoghlan.

Files
File name Uploaded Description Edit
test_cmd_line_unicode.py ncoghlan, 2015-11-16 12:32 Test case for boundary handling
Messages (14)
msg228536 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-10-05 04:01
See PEP 478 for the PEP level items targeting 3.5: http://www.python.org/dev/peps/pep-0478/

This is a tracking issue to help me keep track of some lower level items that didn't make the release PEP:

* Improved Windows console Unicode support (see
https://pypi.python.org/pypi/win_unicode_console for details)
* Changing the encoding and error handling of an existing stream
(http://bugs.python.org/issue15216)
* Allowing "backslashreplace" to be used on input (http://bugs.python.org/issue22286)
* Adding "codecs.convert_surrogates" (http://bugs.python.org/issue18814)
* Adding "wsgiref.util.dump_wsgistr" and "wsgiref.util.load_wsgistr" (http://bugs.python.org/issue22264)
* Adding "bytes.hex", "bytearray.hex" and "memoryview.hex" (http://bugs.python.org/issue9951)
* Adding a binary data formatting mini-language (depends on 9951, likely needs to be escalated to a full PEP for design discussion visibility) (http://bugs.python.org/issue22385)

Going back and updating http://www.python.org/dev/peps/pep-0467/ based on the last round of feedback is also on my personal todo list for 3.5.
msg228537 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-10-05 04:09
PEP 461 binary interpolation implementation issue: http://bugs.python.org/issue20284
msg228541 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-10-05 07:21
Assigning to myself, since there's nothing specifically to *do* for this bug, it's just to make it easier to track the status of the various other RFEs it depends on.
msg236120 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-02-17 02:09
Slavek et al - you folks may be interested in this one, as it tracks several issues that I consider relevant to the Python 2 -> 3 migration effort.

Redoing the list in a way that should render the strike-throughs for closed issues:

* Improved Windows console Unicode support (see
https://pypi.python.org/pypi/win_unicode_console for details)
* Changing the encoding and error handling of an existing stream
(issue 15216)
* Allowing "backslashreplace" to be used on input (issue 22286)
* Adding "codecs.convert_surrogates" (issue 18814)
* Adding "wsgiref.util.dump_wsgistr" and "wsgiref.util.load_wsgistr" (issue 22264)
* Adding "bytes.hex", "bytearray.hex" and "memoryview.hex" (issue 9951)
* Adding a binary data formatting mini-language (depends on issue 9951, likely needs to be escalated to a full PEP for design discussion visibility) (issue 22385)
msg237108 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-03-03 06:25
PEP 461 landed, restoring binary interpolation support: https://hg.python.org/cpython/rev/8d802fb6ae32

There are also some relevant around standardising the C.UTF-8 locale currently available on some Linux systems:

Fedora RFE: https://bugzilla.redhat.com/show_bug.cgi?id=902094
glibc RFE: https://sourceware.org/bugzilla/show_bug.cgi?id=17318
glibc-alpha discussion: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html
msg242943 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-05-12 05:09
I just went through the still-open issues referenced from here, and recommended deferring further consideration of all of the remaining items to 3.6:

* utilities for clearing out surrogates from strings: issue 18814
* treating "wsgistr" as a serialisation format: issue 22264
* defining a formatting mini-language for hex output: issue 22385
* providing a way to change the encoding of an existing stream: issue 15216

I also added two new dependencies to this tracking issue:

* Improved Unicode handling in the Windows console: issue 17620
* Using sys.stdin consistently at the default interactive prompt: issue 1602
msg249439 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-08-31 23:56
For historical purposes, also linking the change in issue #19977 to enable surrogateescape by default on stdin and stdout when the OS claims the locale encoding is ASCII.
msg251404 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-09-23 05:16
The Fedora RFE at https://bugzilla.redhat.com/show_bug.cgi?id=902094 to provide a C.UTF-8 locale by default has been addressed for Fedora 24 (the current Fedora Rawhide).

This means the "LANG=C.UTF-8 python3" replacement for the ASCII-centric "LANG=C python3" will become more widely available over the course of 2016.
msg254728 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-11-16 12:32
In discussing the Windows aspects of the bytes/text boundary handling issues with Brett & Steve recently, I realised I hadn't clearly defined what "fixed" looked like from my perspective.

The attached test case is an initial attempt at that. It currently fails on a UTF-8 Linux system, with the "test_dash_c_unicode" case failing when the interpreter is misconfigured with "LANG=C" - the problem there is that when we encode from the -c command line argument back to bytes, we don't pass "surrogateescape".

I'd be interested in knowing how much of this already passes on a Windows system.

There's also a currently missing test case, which is to pass the info to the subprocess via stdin - "assert_python_ok()" doesn't currently support that, so implementing it will either require a new flag, or direct invocation of spawn_python().
msg254741 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2015-11-16 18:27
Right now all of the tests fail on Windows by default (cp437 for me).

If I change the default IO encoding to utf-8 (hacked into pylifecycle.c, since PYTHONIOENCODING is ignored by subprocesses using -E), the four "Misconfigured" tests crash at the os.fsencode() call (as "mbcs:strict" cannot encode the characters - this may be a real issue, haven't dug into it yet).

Adding more hacks to get past this point brings me back into the ASCII encoding performed by the test, and I'm not sure whether that's just an incorrect assumption for Windows or not.


Separate issue: if I run "chcp 437" before the tests, the output is garbage. If I run "chcp 65001" then it shows the characters in the font correctly. The std streams encoding is taken from this value, but it doesn't map back to UTF-8, which is probably another issue. If I add a separate check in fileutils.c at _Py_device_encoding then I get UTF-8 enabled streams when the console is set for cp65001.

However, there are still a number of places that use GetACP() to determine the locale and encoding to use, which is incorrect for Unicode-aware programs. In particular, this should not happen:

>>> f=open('test.txt', 'w')
>>> f.encoding
'cp1252'

There's no good reason for the default encoding to not be UTF-8 these days, but this is a much bigger change. It's probably worth doing for 3.6, but may need more discussion...
msg254774 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-11-17 00:38
Thanks. I suspect some of the Windows problems are indeed due to bogus assumptions in my draft tests, but at the same time, folks should be able to invoke subprocesses with Unicode values without needing extensive knowledge of platform specific Unicode handling arcana (whether that's *nix or Windows).

I've added Victor to the nosy list as well, since he'd previously expressed interest in implementing a cross-platform "force UTF-8" mode for 3.6 (akin to the default behaviour on Mac OS X), and I suspect these proposed test cases will be relevant to such a capability.
msg254803 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2015-11-17 15:40
The thing about bogus assumptions is that Python should paper over those anyway. I can guarantee there's production code out there with the same assumptions.

How do we make this work? No idea in the context of the bytes/str filename convention differences.
msg274952 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-09-08 02:08
Likely to be resolved, or at least significantly updated, for 3.6 due to PEP 528 and PEP 529:


* Using sys.stdin consistently at the default interactive prompt: issue 1602
* Improved Unicode handling in the Windows console: issue 17620
* Allowing text encoding and error handling to be specified in subprocess module APIs: issue 6135

New change landing in 3.6:

* Changing the Windows default encoding to UTF-8 to better match bytes handling conventions on *nix systems: issue 27781


Likely deferred to 3.7:

* providing a way to change the encoding of an existing stream: issue 15216
* utilities for clearing out surrogates from strings: issue 18814
* treating "wsgistr" as a serialisation format: issue 22264
* defining a formatting mini-language for hex output: issue 22385
msg275616 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-09-10 10:23
Added another issue to the tracking list: 

* Automatically decode binary data in json.loads: issue #17909
History
Date User Action Args
2016-09-10 10:23:39ncoghlansetdependencies: + Autodetecting JSON encoding
messages: + msg275616
2016-09-08 02:08:48ncoghlansetdependencies: + subprocess seems to use local encoding and give no choice, Change sys.getfilesystemencoding() on Windows to UTF-8
messages: + msg274952
2015-11-17 15:40:59steve.dowersetmessages: + msg254803
2015-11-17 00:38:26ncoghlansetnosy: + vstinner
messages: + msg254774
2015-11-16 18:27:01steve.dowersetmessages: + msg254741
2015-11-16 12:32:58ncoghlansetfiles: + test_cmd_line_unicode.py
nosy: + brett.cannon, steve.dower
messages: + msg254728

2015-09-23 05:16:49ncoghlansetmessages: + msg251404
2015-08-31 23:56:33ncoghlansetdependencies: + Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
messages: + msg249439
2015-07-21 07:12:57ethan.furmansetnosy: - ethan.furman
2015-05-13 18:43:20Drekinsetnosy: + Drekin
2015-05-13 17:25:17ethan.furmansetnosy: + ethan.furman
2015-05-12 06:16:57berker.peksagsetnosy: + berker.peksag
2015-05-12 05:09:39ncoghlansetdependencies: + windows console doesn't print or input Unicode, Python interactive console doesn't use sys.stdin for input
messages: + msg242943
2015-03-03 06:25:09ncoghlansetmessages: + msg237108
2015-02-17 02:09:43ncoghlansetnosy: + encukou, bkabrda, rkuska
messages: + msg236120
2014-10-05 11:59:05barrysetnosy: + barry
2014-10-05 07:21:32ncoghlansetassignee: ncoghlan
type: enhancement
messages: + msg228541
2014-10-05 06:57:56martin.pantersetnosy: + martin.panter
2014-10-05 04:09:50ncoghlansetdependencies: + introduce bytes.hex method (also for bytearray and memoryview), Add encoding & errors parameters to TextIOWrapper.reconfigure(), Add utilities to "clean" surrogate code points from strings, patch to implement PEP 461 (%-interpolation for bytes), Add wsgiref.util.dump_wsgistr & load_wsgistr, Allow backslashreplace error handler to be used on input, Define a binary output formatting mini-language for *.hex()
messages: + msg228537
2014-10-05 04:01:43ncoghlancreate