msg228536 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2014-10-05 04:01 |
See PEP 478 for the PEP level items targeting 3.5: http://www.python.org/dev/peps/pep-0478/
This is a tracking issue to help me keep track of some lower level items that didn't make the release PEP:
* Improved Windows console Unicode support (see
https://pypi.python.org/pypi/win_unicode_console for details)
* Changing the encoding and error handling of an existing stream
(http://bugs.python.org/issue15216)
* Allowing "backslashreplace" to be used on input (http://bugs.python.org/issue22286)
* Adding "codecs.convert_surrogates" (http://bugs.python.org/issue18814)
* Adding "wsgiref.util.dump_wsgistr" and "wsgiref.util.load_wsgistr" (http://bugs.python.org/issue22264)
* Adding "bytes.hex", "bytearray.hex" and "memoryview.hex" (http://bugs.python.org/issue9951)
* Adding a binary data formatting mini-language (depends on 9951, likely needs to be escalated to a full PEP for design discussion visibility) (http://bugs.python.org/issue22385)
Going back and updating http://www.python.org/dev/peps/pep-0467/ based on the last round of feedback is also on my personal todo list for 3.5.
|
msg228537 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2014-10-05 04:09 |
PEP 461 binary interpolation implementation issue: http://bugs.python.org/issue20284
|
msg228541 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2014-10-05 07:21 |
Assigning to myself, since there's nothing specifically to *do* for this bug, it's just to make it easier to track the status of the various other RFEs it depends on.
|
msg236120 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2015-02-17 02:09 |
Slavek et al - you folks may be interested in this one, as it tracks several issues that I consider relevant to the Python 2 -> 3 migration effort.
Redoing the list in a way that should render the strike-throughs for closed issues:
* Improved Windows console Unicode support (see
https://pypi.python.org/pypi/win_unicode_console for details)
* Changing the encoding and error handling of an existing stream
(issue 15216)
* Allowing "backslashreplace" to be used on input (issue 22286)
* Adding "codecs.convert_surrogates" (issue 18814)
* Adding "wsgiref.util.dump_wsgistr" and "wsgiref.util.load_wsgistr" (issue 22264)
* Adding "bytes.hex", "bytearray.hex" and "memoryview.hex" (issue 9951)
* Adding a binary data formatting mini-language (depends on issue 9951, likely needs to be escalated to a full PEP for design discussion visibility) (issue 22385)
|
msg237108 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2015-03-03 06:25 |
PEP 461 landed, restoring binary interpolation support: https://hg.python.org/cpython/rev/8d802fb6ae32
There are also some relevant around standardising the C.UTF-8 locale currently available on some Linux systems:
Fedora RFE: https://bugzilla.redhat.com/show_bug.cgi?id=902094
glibc RFE: https://sourceware.org/bugzilla/show_bug.cgi?id=17318
glibc-alpha discussion: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html
|
msg242943 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2015-05-12 05:09 |
I just went through the still-open issues referenced from here, and recommended deferring further consideration of all of the remaining items to 3.6:
* utilities for clearing out surrogates from strings: issue 18814
* treating "wsgistr" as a serialisation format: issue 22264
* defining a formatting mini-language for hex output: issue 22385
* providing a way to change the encoding of an existing stream: issue 15216
I also added two new dependencies to this tracking issue:
* Improved Unicode handling in the Windows console: issue 17620
* Using sys.stdin consistently at the default interactive prompt: issue 1602
|
msg249439 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2015-08-31 23:56 |
For historical purposes, also linking the change in issue #19977 to enable surrogateescape by default on stdin and stdout when the OS claims the locale encoding is ASCII.
|
msg251404 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2015-09-23 05:16 |
The Fedora RFE at https://bugzilla.redhat.com/show_bug.cgi?id=902094 to provide a C.UTF-8 locale by default has been addressed for Fedora 24 (the current Fedora Rawhide).
This means the "LANG=C.UTF-8 python3" replacement for the ASCII-centric "LANG=C python3" will become more widely available over the course of 2016.
|
msg254728 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2015-11-16 12:32 |
In discussing the Windows aspects of the bytes/text boundary handling issues with Brett & Steve recently, I realised I hadn't clearly defined what "fixed" looked like from my perspective.
The attached test case is an initial attempt at that. It currently fails on a UTF-8 Linux system, with the "test_dash_c_unicode" case failing when the interpreter is misconfigured with "LANG=C" - the problem there is that when we encode from the -c command line argument back to bytes, we don't pass "surrogateescape".
I'd be interested in knowing how much of this already passes on a Windows system.
There's also a currently missing test case, which is to pass the info to the subprocess via stdin - "assert_python_ok()" doesn't currently support that, so implementing it will either require a new flag, or direct invocation of spawn_python().
|
msg254741 - (view) |
Author: Steve Dower (steve.dower) * |
Date: 2015-11-16 18:27 |
Right now all of the tests fail on Windows by default (cp437 for me).
If I change the default IO encoding to utf-8 (hacked into pylifecycle.c, since PYTHONIOENCODING is ignored by subprocesses using -E), the four "Misconfigured" tests crash at the os.fsencode() call (as "mbcs:strict" cannot encode the characters - this may be a real issue, haven't dug into it yet).
Adding more hacks to get past this point brings me back into the ASCII encoding performed by the test, and I'm not sure whether that's just an incorrect assumption for Windows or not.
Separate issue: if I run "chcp 437" before the tests, the output is garbage. If I run "chcp 65001" then it shows the characters in the font correctly. The std streams encoding is taken from this value, but it doesn't map back to UTF-8, which is probably another issue. If I add a separate check in fileutils.c at _Py_device_encoding then I get UTF-8 enabled streams when the console is set for cp65001.
However, there are still a number of places that use GetACP() to determine the locale and encoding to use, which is incorrect for Unicode-aware programs. In particular, this should not happen:
>>> f=open('test.txt', 'w')
>>> f.encoding
'cp1252'
There's no good reason for the default encoding to not be UTF-8 these days, but this is a much bigger change. It's probably worth doing for 3.6, but may need more discussion...
|
msg254774 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2015-11-17 00:38 |
Thanks. I suspect some of the Windows problems are indeed due to bogus assumptions in my draft tests, but at the same time, folks should be able to invoke subprocesses with Unicode values without needing extensive knowledge of platform specific Unicode handling arcana (whether that's *nix or Windows).
I've added Victor to the nosy list as well, since he'd previously expressed interest in implementing a cross-platform "force UTF-8" mode for 3.6 (akin to the default behaviour on Mac OS X), and I suspect these proposed test cases will be relevant to such a capability.
|
msg254803 - (view) |
Author: Steve Dower (steve.dower) * |
Date: 2015-11-17 15:40 |
The thing about bogus assumptions is that Python should paper over those anyway. I can guarantee there's production code out there with the same assumptions.
How do we make this work? No idea in the context of the bytes/str filename convention differences.
|
msg274952 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2016-09-08 02:08 |
Likely to be resolved, or at least significantly updated, for 3.6 due to PEP 528 and PEP 529:
* Using sys.stdin consistently at the default interactive prompt: issue 1602
* Improved Unicode handling in the Windows console: issue 17620
* Allowing text encoding and error handling to be specified in subprocess module APIs: issue 6135
New change landing in 3.6:
* Changing the Windows default encoding to UTF-8 to better match bytes handling conventions on *nix systems: issue 27781
Likely deferred to 3.7:
* providing a way to change the encoding of an existing stream: issue 15216
* utilities for clearing out surrogates from strings: issue 18814
* treating "wsgistr" as a serialisation format: issue 22264
* defining a formatting mini-language for hex output: issue 22385
|
msg275616 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2016-09-10 10:23 |
Added another issue to the tracking list:
* Automatically decode binary data in json.loads: issue #17909
|
msg319140 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2018-06-09 11:03 |
With PEPs 538 and 540 merged for Python 3.7 (so we'll almost always use UTF-8 instead of ASCII when the platform nominates the C or POSIX locale as the currently active one), and Windows previously switching to assuming UTF-8 instead of mbcs for binary interfaces in Python 3.6, I think this tracking issue has served its purpose.
Of the issues previously mentioned here, the following are still open:
* Improved Unicode handling in the Windows console: issue 17620
* Utilities for clearing out surrogates from strings: issue 18814
* Treating "wsgistr" as a serialisation format: issue 22264
* Defining a formatting mini-language for hex output: issue 22385
I don't think any of those share enough characteristics to be worth continuing to track as a group, so I'm closing this meta-issue as out of date :)
|
msg319142 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2018-06-09 11:18 |
Correction: I just rejected my proposed wsgiref in issue 22264 as failing to make a sufficient case for their practical utility, so that one is closed as well :)
|
msg319143 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2018-06-09 11:26 |
Adding a link to the first post in a series of articles from Victor Stinner regarding the evolution over time of the text encoding assumptions in Python 3's operating system interfaces:
https://vstinner.github.io/python30-listdir-undecodable-filenames.html
That way if anyone does stumble across this meta-issue, they'll have an easier time discovering that more readable version of the history involved :)
|
msg319149 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-06-09 16:09 |
> https://vstinner.github.io/python30-listdir-undecodable-filenames.html
Oh, thanks for mentioning my series of articles.
It's also nice to see that we are now able to close this 4 years old issue!
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:08 | admin | set | github: 66745 |
2018-06-09 16:09:48 | vstinner | set | messages:
+ msg319149 |
2018-06-09 11:26:20 | ncoghlan | set | messages:
+ msg319143 |
2018-06-09 11:18:50 | ncoghlan | set | messages:
+ msg319142 |
2018-06-09 11:03:02 | ncoghlan | set | status: open -> closed messages:
+ msg319140
dependencies:
- Python interactive console doesn't use sys.stdin for input, Add utilities to "clean" surrogate code points from strings, Add wsgiref.util.dump_wsgistr & load_wsgistr, Define a binary output formatting mini-language for *.hex() resolution: out of date stage: resolved |
2016-09-10 10:23:39 | ncoghlan | set | dependencies:
+ Autodetecting JSON encoding messages:
+ msg275616 |
2016-09-08 02:08:48 | ncoghlan | set | dependencies:
+ subprocess seems to use local encoding and give no choice, Change sys.getfilesystemencoding() on Windows to UTF-8 messages:
+ msg274952 |
2015-11-17 15:40:59 | steve.dower | set | messages:
+ msg254803 |
2015-11-17 00:38:26 | ncoghlan | set | nosy:
+ vstinner messages:
+ msg254774
|
2015-11-16 18:27:01 | steve.dower | set | messages:
+ msg254741 |
2015-11-16 12:32:58 | ncoghlan | set | files:
+ test_cmd_line_unicode.py nosy:
+ brett.cannon, steve.dower messages:
+ msg254728
|
2015-09-23 05:16:49 | ncoghlan | set | messages:
+ msg251404 |
2015-08-31 23:56:33 | ncoghlan | set | dependencies:
+ Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale messages:
+ msg249439 |
2015-07-21 07:12:57 | ethan.furman | set | nosy:
- ethan.furman
|
2015-05-13 18:43:20 | Drekin | set | nosy:
+ Drekin
|
2015-05-13 17:25:17 | ethan.furman | set | nosy:
+ ethan.furman
|
2015-05-12 06:16:57 | berker.peksag | set | nosy:
+ berker.peksag
|
2015-05-12 05:09:39 | ncoghlan | set | dependencies:
+ windows console doesn't print or input Unicode, Python interactive console doesn't use sys.stdin for input messages:
+ msg242943 |
2015-03-03 06:25:09 | ncoghlan | set | messages:
+ msg237108 |
2015-02-17 02:09:43 | ncoghlan | set | nosy:
+ petr.viktorin, bkabrda, rkuska messages:
+ msg236120
|
2014-10-05 11:59:05 | barry | set | nosy:
+ barry
|
2014-10-05 07:21:32 | ncoghlan | set | assignee: ncoghlan type: enhancement messages:
+ msg228541 |
2014-10-05 06:57:56 | martin.panter | set | nosy:
+ martin.panter
|
2014-10-05 04:09:50 | ncoghlan | set | dependencies:
+ introduce bytes.hex method (also for bytearray and memoryview), Add encoding & errors parameters to TextIOWrapper.reconfigure(), Add utilities to "clean" surrogate code points from strings, patch to implement PEP 461 (%-interpolation for bytes), Add wsgiref.util.dump_wsgistr & load_wsgistr, Allow backslashreplace error handler to be used on input, Define a binary output formatting mini-language for *.hex() messages:
+ msg228537 |
2014-10-05 04:01:43 | ncoghlan | create | |