Tracking issue for adjustments to binary/text boundary handling #66745

ncoghlan · 2014-10-05T04:01:43Z

BPO	22555
Nosy	@warsaw, @brettcannon, @ncoghlan, @vstinner, @encukou, @berkerpeksag, @vadmium, @zooba, @rkuska
Dependencies	bpo-1602: windows console doesn't print or input Unicode bpo-6135: subprocess seems to use local encoding and give no choice bpo-9951: introduce bytes.hex method (also for bytearray and memoryview) bpo-15216: Add encoding & errors parameters to TextIOWrapper.reconfigure() bpo-17909: Autodetecting JSON encoding bpo-19977: Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale bpo-20284: patch to implement PEP 461 (%-interpolation for bytes) bpo-22286: Allow backslashreplace error handler to be used on input bpo-27781: Change sys.getfilesystemencoding() on Windows to UTF-8
Files	test_cmd_line_unicode.py: Test case for boundary handling

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ncoghlan'
closed_at = <Date 2018-06-09.11:03:02.545>
created_at = <Date 2014-10-05.04:01:43.336>
labels = ['type-feature']
title = 'Tracking issue for adjustments to binary/text boundary handling'
updated_at = <Date 2018-06-09.16:09:48.196>
user = 'https://github.com/ncoghlan'

bugs.python.org fields:

activity = <Date 2018-06-09.16:09:48.196>
actor = 'vstinner'
assignee = 'ncoghlan'
closed = True
closed_date = <Date 2018-06-09.11:03:02.545>
closer = 'ncoghlan'
components = []
creation = <Date 2014-10-05.04:01:43.336>
creator = 'ncoghlan'
dependencies = ['1602', '6135', '9951', '15216', '17909', '19977', '20284', '22286', '27781']
files = ['41054']
hgrepos = []
issue_num = 22555
keywords = []
message_count = 18.0
messages = ['228536', '228537', '228541', '236120', '237108', '242943', '249439', '251404', '254728', '254741', '254774', '254803', '274952', '275616', '319140', '319142', '319143', '319149']
nosy_count = 11.0
nosy_names = ['barry', 'brett.cannon', 'ncoghlan', 'vstinner', 'petr.viktorin', 'berker.peksag', 'martin.panter', 'bkabrda', 'Drekin', 'steve.dower', 'rkuska']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue22555'
versions = []

ncoghlan · 2014-10-05T04:01:42Z

See PEP-478 for the PEP level items targeting 3.5: http://www.python.org/dev/peps/pep-0478/

This is a tracking issue to help me keep track of some lower level items that didn't make the release PEP:

Improved Windows console Unicode support (see
https://pypi.python.org/pypi/win_unicode_console for details)
Changing the encoding and error handling of an existing stream
(http://bugs.python.org/issue15216)
Allowing "backslashreplace" to be used on input (http://bugs.python.org/issue22286)
Adding "codecs.convert_surrogates" (http://bugs.python.org/issue18814)
Adding "wsgiref.util.dump_wsgistr" and "wsgiref.util.load_wsgistr" (http://bugs.python.org/issue22264)
Adding "bytes.hex", "bytearray.hex" and "memoryview.hex" (http://bugs.python.org/issue9951)
Adding a binary data formatting mini-language (depends on 9951, likely needs to be escalated to a full PEP for design discussion visibility) (http://bugs.python.org/issue22385)

Going back and updating http://www.python.org/dev/peps/pep-0467/ based on the last round of feedback is also on my personal todo list for 3.5.

ncoghlan · 2014-10-05T04:09:51Z

PEP-461 binary interpolation implementation issue: http://bugs.python.org/issue20284

ncoghlan · 2014-10-05T07:21:32Z

Assigning to myself, since there's nothing specifically to *do* for this bug, it's just to make it easier to track the status of the various other RFEs it depends on.

ncoghlan · 2015-02-17T02:09:42Z

Slavek et al - you folks may be interested in this one, as it tracks several issues that I consider relevant to the Python 2 -> 3 migration effort.

Redoing the list in a way that should render the strike-throughs for closed issues:

Improved Windows console Unicode support (see
https://pypi.python.org/pypi/win_unicode_console for details)
Changing the encoding and error handling of an existing stream
(bpo-15216)
Allowing "backslashreplace" to be used on input (bpo-22286)
Adding "codecs.convert_surrogates" (bpo-18814)
Adding "wsgiref.util.dump_wsgistr" and "wsgiref.util.load_wsgistr" (bpo-22264)
Adding "bytes.hex", "bytearray.hex" and "memoryview.hex" (bpo-9951)
Adding a binary data formatting mini-language (depends on bpo-9951, likely needs to be escalated to a full PEP for design discussion visibility) (bpo-22385)

ncoghlan · 2015-03-03T06:25:09Z

PEP-461 landed, restoring binary interpolation support: https://hg.python.org/cpython/rev/8d802fb6ae32

There are also some relevant around standardising the C.UTF-8 locale currently available on some Linux systems:

Fedora RFE: https://bugzilla.redhat.com/show_bug.cgi?id=902094
glibc RFE: https://sourceware.org/bugzilla/show_bug.cgi?id=17318
glibc-alpha discussion: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html

ncoghlan · 2015-05-12T05:09:39Z

I just went through the still-open issues referenced from here, and recommended deferring further consideration of all of the remaining items to 3.6:

utilities for clearing out surrogates from strings: bpo-18814
treating "wsgistr" as a serialisation format: bpo-22264
defining a formatting mini-language for hex output: bpo-22385
providing a way to change the encoding of an existing stream: bpo-15216

I also added two new dependencies to this tracking issue:

Improved Unicode handling in the Windows console: bpo-17620
Using sys.stdin consistently at the default interactive prompt: bpo-1602

ncoghlan · 2015-08-31T23:56:33Z

For historical purposes, also linking the change in issue bpo-19977 to enable surrogateescape by default on stdin and stdout when the OS claims the locale encoding is ASCII.

ncoghlan · 2015-09-23T05:16:50Z

The Fedora RFE at https://bugzilla.redhat.com/show_bug.cgi?id=902094 to provide a C.UTF-8 locale by default has been addressed for Fedora 24 (the current Fedora Rawhide).

This means the "LANG=C.UTF-8 python3" replacement for the ASCII-centric "LANG=C python3" will become more widely available over the course of 2016.

ncoghlan · 2015-11-16T12:32:57Z

In discussing the Windows aspects of the bytes/text boundary handling issues with Brett & Steve recently, I realised I hadn't clearly defined what "fixed" looked like from my perspective.

The attached test case is an initial attempt at that. It currently fails on a UTF-8 Linux system, with the "test_dash_c_unicode" case failing when the interpreter is misconfigured with "LANG=C" - the problem there is that when we encode from the -c command line argument back to bytes, we don't pass "surrogateescape".

I'd be interested in knowing how much of this already passes on a Windows system.

There's also a currently missing test case, which is to pass the info to the subprocess via stdin - "assert_python_ok()" doesn't currently support that, so implementing it will either require a new flag, or direct invocation of spawn_python().

zooba · 2015-11-16T18:27:01Z

Right now all of the tests fail on Windows by default (cp437 for me).

If I change the default IO encoding to utf-8 (hacked into pylifecycle.c, since PYTHONIOENCODING is ignored by subprocesses using -E), the four "Misconfigured" tests crash at the os.fsencode() call (as "mbcs:strict" cannot encode the characters - this may be a real issue, haven't dug into it yet).

Adding more hacks to get past this point brings me back into the ASCII encoding performed by the test, and I'm not sure whether that's just an incorrect assumption for Windows or not.

Separate issue: if I run "chcp 437" before the tests, the output is garbage. If I run "chcp 65001" then it shows the characters in the font correctly. The std streams encoding is taken from this value, but it doesn't map back to UTF-8, which is probably another issue. If I add a separate check in fileutils.c at _Py_device_encoding then I get UTF-8 enabled streams when the console is set for cp65001.

However, there are still a number of places that use GetACP() to determine the locale and encoding to use, which is incorrect for Unicode-aware programs. In particular, this should not happen:

>>> f=open('test.txt', 'w')
>>> f.encoding
'cp1252'

There's no good reason for the default encoding to not be UTF-8 these days, but this is a much bigger change. It's probably worth doing for 3.6, but may need more discussion...

ncoghlan · 2015-11-17T00:38:26Z

Thanks. I suspect some of the Windows problems are indeed due to bogus assumptions in my draft tests, but at the same time, folks should be able to invoke subprocesses with Unicode values without needing extensive knowledge of platform specific Unicode handling arcana (whether that's *nix or Windows).

I've added Victor to the nosy list as well, since he'd previously expressed interest in implementing a cross-platform "force UTF-8" mode for 3.6 (akin to the default behaviour on Mac OS X), and I suspect these proposed test cases will be relevant to such a capability.

zooba · 2015-11-17T15:40:59Z

The thing about bogus assumptions is that Python should paper over those anyway. I can guarantee there's production code out there with the same assumptions.

How do we make this work? No idea in the context of the bytes/str filename convention differences.

ncoghlan · 2016-09-08T02:08:48Z

Likely to be resolved, or at least significantly updated, for 3.6 due to PEP-528 and PEP-529:

Using sys.stdin consistently at the default interactive prompt: bpo-1602
Improved Unicode handling in the Windows console: bpo-17620
Allowing text encoding and error handling to be specified in subprocess module APIs: bpo-6135

New change landing in 3.6:

Changing the Windows default encoding to UTF-8 to better match bytes handling conventions on *nix systems: bpo-27781

Likely deferred to 3.7:

providing a way to change the encoding of an existing stream: bpo-15216
utilities for clearing out surrogates from strings: bpo-18814
treating "wsgistr" as a serialisation format: bpo-22264
defining a formatting mini-language for hex output: bpo-22385

ncoghlan · 2016-09-10T10:23:39Z

Added another issue to the tracking list:

Automatically decode binary data in json.loads: issue bpo-17909

ncoghlan · 2018-06-09T11:03:02Z

With PEPs 538 and 540 merged for Python 3.7 (so we'll almost always use UTF-8 instead of ASCII when the platform nominates the C or POSIX locale as the currently active one), and Windows previously switching to assuming UTF-8 instead of mbcs for binary interfaces in Python 3.6, I think this tracking issue has served its purpose.

Of the issues previously mentioned here, the following are still open:

Improved Unicode handling in the Windows console: bpo-17620
Utilities for clearing out surrogates from strings: bpo-18814
Treating "wsgistr" as a serialisation format: bpo-22264
Defining a formatting mini-language for hex output: bpo-22385

I don't think any of those share enough characteristics to be worth continuing to track as a group, so I'm closing this meta-issue as out of date :)

ncoghlan · 2018-06-09T11:18:51Z

Correction: I just rejected my proposed wsgiref in bpo-22264 as failing to make a sufficient case for their practical utility, so that one is closed as well :)

ncoghlan · 2018-06-09T11:26:20Z

Adding a link to the first post in a series of articles from Victor Stinner regarding the evolution over time of the text encoding assumptions in Python 3's operating system interfaces:

https://vstinner.github.io/python30-listdir-undecodable-filenames.html

That way if anyone does stumble across this meta-issue, they'll have an easier time discovering that more readable version of the history involved :)

vstinner · 2018-06-09T16:09:48Z

https://vstinner.github.io/python30-listdir-undecodable-filenames.html

Oh, thanks for mentioning my series of articles.

It's also nice to see that we are now able to close this 4 years old issue!

ncoghlan self-assigned this Oct 5, 2014

ncoghlan added the type-feature A feature request or enhancement label Oct 5, 2014

ncoghlan closed this as completed Jun 9, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for adjustments to binary/text boundary handling #66745

Tracking issue for adjustments to binary/text boundary handling #66745

ncoghlan commented Oct 5, 2014

ncoghlan commented Oct 5, 2014

ncoghlan commented Oct 5, 2014

ncoghlan commented Oct 5, 2014

ncoghlan commented Feb 17, 2015

ncoghlan commented Mar 3, 2015

ncoghlan commented May 12, 2015

ncoghlan commented Aug 31, 2015

ncoghlan commented Sep 23, 2015

ncoghlan commented Nov 16, 2015

zooba commented Nov 16, 2015

ncoghlan commented Nov 17, 2015

zooba commented Nov 17, 2015

ncoghlan commented Sep 8, 2016

ncoghlan commented Sep 10, 2016

ncoghlan commented Jun 9, 2018

ncoghlan commented Jun 9, 2018

ncoghlan commented Jun 9, 2018

vstinner commented Jun 9, 2018

Tracking issue for adjustments to binary/text boundary handling #66745

Tracking issue for adjustments to binary/text boundary handling #66745

Comments

ncoghlan commented Oct 5, 2014

ncoghlan commented Oct 5, 2014

ncoghlan commented Oct 5, 2014

ncoghlan commented Oct 5, 2014

ncoghlan commented Feb 17, 2015

ncoghlan commented Mar 3, 2015

ncoghlan commented May 12, 2015

ncoghlan commented Aug 31, 2015

ncoghlan commented Sep 23, 2015

ncoghlan commented Nov 16, 2015

zooba commented Nov 16, 2015

ncoghlan commented Nov 17, 2015

zooba commented Nov 17, 2015

ncoghlan commented Sep 8, 2016

ncoghlan commented Sep 10, 2016

ncoghlan commented Jun 9, 2018

ncoghlan commented Jun 9, 2018

ncoghlan commented Jun 9, 2018

vstinner commented Jun 9, 2018