Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for adjustments to binary/text boundary handling #66745

Closed
ncoghlan opened this issue Oct 5, 2014 · 18 comments
Closed

Tracking issue for adjustments to binary/text boundary handling #66745

ncoghlan opened this issue Oct 5, 2014 · 18 comments
Assignees
Labels
type-feature A feature request or enhancement

Comments

@ncoghlan
Copy link
Contributor

ncoghlan commented Oct 5, 2014

BPO 22555
Nosy @warsaw, @brettcannon, @ncoghlan, @vstinner, @encukou, @berkerpeksag, @vadmium, @zooba, @rkuska
Dependencies
  • bpo-1602: windows console doesn't print or input Unicode
  • bpo-6135: subprocess seems to use local encoding and give no choice
  • bpo-9951: introduce bytes.hex method (also for bytearray and memoryview)
  • bpo-15216: Add encoding & errors parameters to TextIOWrapper.reconfigure()
  • bpo-17909: Autodetecting JSON encoding
  • bpo-19977: Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
  • bpo-20284: patch to implement PEP 461 (%-interpolation for bytes)
  • bpo-22286: Allow backslashreplace error handler to be used on input
  • bpo-27781: Change sys.getfilesystemencoding() on Windows to UTF-8
  • Files
  • test_cmd_line_unicode.py: Test case for boundary handling
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ncoghlan'
    closed_at = <Date 2018-06-09.11:03:02.545>
    created_at = <Date 2014-10-05.04:01:43.336>
    labels = ['type-feature']
    title = 'Tracking issue for adjustments to binary/text boundary handling'
    updated_at = <Date 2018-06-09.16:09:48.196>
    user = 'https://github.com/ncoghlan'

    bugs.python.org fields:

    activity = <Date 2018-06-09.16:09:48.196>
    actor = 'vstinner'
    assignee = 'ncoghlan'
    closed = True
    closed_date = <Date 2018-06-09.11:03:02.545>
    closer = 'ncoghlan'
    components = []
    creation = <Date 2014-10-05.04:01:43.336>
    creator = 'ncoghlan'
    dependencies = ['1602', '6135', '9951', '15216', '17909', '19977', '20284', '22286', '27781']
    files = ['41054']
    hgrepos = []
    issue_num = 22555
    keywords = []
    message_count = 18.0
    messages = ['228536', '228537', '228541', '236120', '237108', '242943', '249439', '251404', '254728', '254741', '254774', '254803', '274952', '275616', '319140', '319142', '319143', '319149']
    nosy_count = 11.0
    nosy_names = ['barry', 'brett.cannon', 'ncoghlan', 'vstinner', 'petr.viktorin', 'berker.peksag', 'martin.panter', 'bkabrda', 'Drekin', 'steve.dower', 'rkuska']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue22555'
    versions = []

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Oct 5, 2014

    See PEP-478 for the PEP level items targeting 3.5: http://www.python.org/dev/peps/pep-0478/

    This is a tracking issue to help me keep track of some lower level items that didn't make the release PEP:

    Going back and updating http://www.python.org/dev/peps/pep-0467/ based on the last round of feedback is also on my personal todo list for 3.5.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Oct 5, 2014

    PEP-461 binary interpolation implementation issue: http://bugs.python.org/issue20284

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Oct 5, 2014

    Assigning to myself, since there's nothing specifically to *do* for this bug, it's just to make it easier to track the status of the various other RFEs it depends on.

    @ncoghlan ncoghlan self-assigned this Oct 5, 2014
    @ncoghlan ncoghlan added the type-feature A feature request or enhancement label Oct 5, 2014
    @ncoghlan
    Copy link
    Contributor Author

    Slavek et al - you folks may be interested in this one, as it tracks several issues that I consider relevant to the Python 2 -> 3 migration effort.

    Redoing the list in a way that should render the strike-throughs for closed issues:

    • Improved Windows console Unicode support (see
      https://pypi.python.org/pypi/win_unicode_console for details)
    • Changing the encoding and error handling of an existing stream
      (bpo-15216)
    • Allowing "backslashreplace" to be used on input (bpo-22286)
    • Adding "codecs.convert_surrogates" (bpo-18814)
    • Adding "wsgiref.util.dump_wsgistr" and "wsgiref.util.load_wsgistr" (bpo-22264)
    • Adding "bytes.hex", "bytearray.hex" and "memoryview.hex" (bpo-9951)
    • Adding a binary data formatting mini-language (depends on bpo-9951, likely needs to be escalated to a full PEP for design discussion visibility) (bpo-22385)

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Mar 3, 2015

    PEP-461 landed, restoring binary interpolation support: https://hg.python.org/cpython/rev/8d802fb6ae32

    There are also some relevant around standardising the C.UTF-8 locale currently available on some Linux systems:

    Fedora RFE: https://bugzilla.redhat.com/show_bug.cgi?id=902094
    glibc RFE: https://sourceware.org/bugzilla/show_bug.cgi?id=17318
    glibc-alpha discussion: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html

    @ncoghlan
    Copy link
    Contributor Author

    I just went through the still-open issues referenced from here, and recommended deferring further consideration of all of the remaining items to 3.6:

    • utilities for clearing out surrogates from strings: bpo-18814
    • treating "wsgistr" as a serialisation format: bpo-22264
    • defining a formatting mini-language for hex output: bpo-22385
    • providing a way to change the encoding of an existing stream: bpo-15216

    I also added two new dependencies to this tracking issue:

    • Improved Unicode handling in the Windows console: bpo-17620
    • Using sys.stdin consistently at the default interactive prompt: bpo-1602

    @ncoghlan
    Copy link
    Contributor Author

    For historical purposes, also linking the change in issue bpo-19977 to enable surrogateescape by default on stdin and stdout when the OS claims the locale encoding is ASCII.

    @ncoghlan
    Copy link
    Contributor Author

    The Fedora RFE at https://bugzilla.redhat.com/show_bug.cgi?id=902094 to provide a C.UTF-8 locale by default has been addressed for Fedora 24 (the current Fedora Rawhide).

    This means the "LANG=C.UTF-8 python3" replacement for the ASCII-centric "LANG=C python3" will become more widely available over the course of 2016.

    @ncoghlan
    Copy link
    Contributor Author

    In discussing the Windows aspects of the bytes/text boundary handling issues with Brett & Steve recently, I realised I hadn't clearly defined what "fixed" looked like from my perspective.

    The attached test case is an initial attempt at that. It currently fails on a UTF-8 Linux system, with the "test_dash_c_unicode" case failing when the interpreter is misconfigured with "LANG=C" - the problem there is that when we encode from the -c command line argument back to bytes, we don't pass "surrogateescape".

    I'd be interested in knowing how much of this already passes on a Windows system.

    There's also a currently missing test case, which is to pass the info to the subprocess via stdin - "assert_python_ok()" doesn't currently support that, so implementing it will either require a new flag, or direct invocation of spawn_python().

    @zooba
    Copy link
    Member

    zooba commented Nov 16, 2015

    Right now all of the tests fail on Windows by default (cp437 for me).

    If I change the default IO encoding to utf-8 (hacked into pylifecycle.c, since PYTHONIOENCODING is ignored by subprocesses using -E), the four "Misconfigured" tests crash at the os.fsencode() call (as "mbcs:strict" cannot encode the characters - this may be a real issue, haven't dug into it yet).

    Adding more hacks to get past this point brings me back into the ASCII encoding performed by the test, and I'm not sure whether that's just an incorrect assumption for Windows or not.

    Separate issue: if I run "chcp 437" before the tests, the output is garbage. If I run "chcp 65001" then it shows the characters in the font correctly. The std streams encoding is taken from this value, but it doesn't map back to UTF-8, which is probably another issue. If I add a separate check in fileutils.c at _Py_device_encoding then I get UTF-8 enabled streams when the console is set for cp65001.

    However, there are still a number of places that use GetACP() to determine the locale and encoding to use, which is incorrect for Unicode-aware programs. In particular, this should not happen:

    >>> f=open('test.txt', 'w')
    >>> f.encoding
    'cp1252'

    There's no good reason for the default encoding to not be UTF-8 these days, but this is a much bigger change. It's probably worth doing for 3.6, but may need more discussion...

    @ncoghlan
    Copy link
    Contributor Author

    Thanks. I suspect some of the Windows problems are indeed due to bogus assumptions in my draft tests, but at the same time, folks should be able to invoke subprocesses with Unicode values without needing extensive knowledge of platform specific Unicode handling arcana (whether that's *nix or Windows).

    I've added Victor to the nosy list as well, since he'd previously expressed interest in implementing a cross-platform "force UTF-8" mode for 3.6 (akin to the default behaviour on Mac OS X), and I suspect these proposed test cases will be relevant to such a capability.

    @zooba
    Copy link
    Member

    zooba commented Nov 17, 2015

    The thing about bogus assumptions is that Python should paper over those anyway. I can guarantee there's production code out there with the same assumptions.

    How do we make this work? No idea in the context of the bytes/str filename convention differences.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Sep 8, 2016

    Likely to be resolved, or at least significantly updated, for 3.6 due to PEP-528 and PEP-529:

    • Using sys.stdin consistently at the default interactive prompt: bpo-1602
    • Improved Unicode handling in the Windows console: bpo-17620
    • Allowing text encoding and error handling to be specified in subprocess module APIs: bpo-6135

    New change landing in 3.6:

    • Changing the Windows default encoding to UTF-8 to better match bytes handling conventions on *nix systems: bpo-27781

    Likely deferred to 3.7:

    • providing a way to change the encoding of an existing stream: bpo-15216
    • utilities for clearing out surrogates from strings: bpo-18814
    • treating "wsgistr" as a serialisation format: bpo-22264
    • defining a formatting mini-language for hex output: bpo-22385

    @ncoghlan
    Copy link
    Contributor Author

    Added another issue to the tracking list:

    • Automatically decode binary data in json.loads: issue bpo-17909

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Jun 9, 2018

    With PEPs 538 and 540 merged for Python 3.7 (so we'll almost always use UTF-8 instead of ASCII when the platform nominates the C or POSIX locale as the currently active one), and Windows previously switching to assuming UTF-8 instead of mbcs for binary interfaces in Python 3.6, I think this tracking issue has served its purpose.

    Of the issues previously mentioned here, the following are still open:

    • Improved Unicode handling in the Windows console: bpo-17620
    • Utilities for clearing out surrogates from strings: bpo-18814
    • Treating "wsgistr" as a serialisation format: bpo-22264
    • Defining a formatting mini-language for hex output: bpo-22385

    I don't think any of those share enough characteristics to be worth continuing to track as a group, so I'm closing this meta-issue as out of date :)

    @ncoghlan ncoghlan closed this as completed Jun 9, 2018
    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Jun 9, 2018

    Correction: I just rejected my proposed wsgiref in bpo-22264 as failing to make a sufficient case for their practical utility, so that one is closed as well :)

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Jun 9, 2018

    Adding a link to the first post in a series of articles from Victor Stinner regarding the evolution over time of the text encoding assumptions in Python 3's operating system interfaces:

    https://vstinner.github.io/python30-listdir-undecodable-filenames.html

    That way if anyone does stumble across this meta-issue, they'll have an easier time discovering that more readable version of the history involved :)

    @vstinner
    Copy link
    Member

    vstinner commented Jun 9, 2018

    https://vstinner.github.io/python30-listdir-undecodable-filenames.html

    Oh, thanks for mentioning my series of articles.

    It's also nice to see that we are now able to close this 4 years old issue!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants