Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cp65001 codec #57425

Closed
vstinner opened this issue Oct 18, 2011 · 10 comments
Closed

Add cp65001 codec #57425

vstinner opened this issue Oct 18, 2011 · 10 comments

Comments

@vstinner
Copy link
Member

BPO 13216
Nosy @loewis, @amauryfa, @vstinner
Files
  • cp65001.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-10-26.23:43:05.986>
    created_at = <Date 2011-10-18.22:44:30.671>
    labels = ['expert-unicode']
    title = 'Add cp65001 codec'
    updated_at = <Date 2011-10-26.23:44:52.551>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2011-10-26.23:44:52.551>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-10-26.23:43:05.986>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2011-10-18.22:44:30.671>
    creator = 'vstinner'
    dependencies = []
    files = ['23453']
    hgrepos = []
    issue_num = 13216
    keywords = []
    message_count = 10.0
    messages = ['145871', '145872', '145891', '145894', '145901', '145922', '145932', '146463', '146464', '146466']
    nosy_count = 4.0
    nosy_names = ['loewis', 'amaury.forgeotdarc', 'vstinner', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue13216'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    Thanks to bpo-12281, it is now trivial to implement any Windows code page in Python. I don't know if existing code pages (e.g. cp932) should use codecs.code_page_encode/.code_page_decode on Windows, or continue to use the (portable) Python code.

    Users want the code page 65001, even if I consider that it is useless to set the ANSI code page to 65001 in a console (see issue bpo-1602), but that's a different story. Attached patch implements this code page.

    @vstinner
    Copy link
    Member Author

    Users want the code page 65001

    See issues bpo-6058, bpo-7441 and bpo-10920.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 19, 2011

    We shouldn't use the MS codec if we have our own, as they may differ.

    As for the 65001 bug: is that actually solved by this codec?

    @vstinner
    Copy link
    Member Author

    We shouldn't use the MS codec if we have our own, as they may differ.

    Ok, I agree. MS codec has a nice replacement behaviour (search for a similar
    glyph): cp1252 encodes Ł to b'L' for example. Our codec raises a
    UnicodeEncodeError on u'\u0141'.encode('cp1252').

    As for the 65001 bug: is that actually solved by this codec?

    Sorry, which bug?

    See tests using CP_UTF8 in test_codecs. Depending on the Windows version, you
    don't get the same behaviour on surrogates. Before Windows Vista, surrogates
    were always encoded, whereas you can now choose the behaviour using the Python
    error handler:

            if self.vista_or_later():
                tests.append(('\udc80', 'strict', None)) # None=UnicodeEncodeError
                tests.append(('\udc80', 'ignore', b''))
                tests.append(('\udc80', 'replace', b'?'))
            else:
                tests.append(('\udc80', 'strict', b'\xed\xb2\x80'))

    @vstinner
    Copy link
    Member Author

    I consider that it is useless to set the ANSI code page to 65001 in a console

    I did more tests on the Windows console, focused on output, see:
    http://bugs.python.org/issue1602#msg145898

    I was wrong, it *is* useful to change the code page to 65001. Even if we have fully Unicode compliant sys.stdout and sys.stderr, setting the code page to CP_UTF8 (65001) does still improve Unicode support in some cases:

    • if the output (stdout and/or stderr) is redirected
    • if you encode Unicode to the console code page to use directly sys.stdout.buffer and sys.stderr.buffer

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 19, 2011

    > As for the 65001 bug: is that actually solved by this codec?

    Sorry, which bug?

    bpo-6501 and friends (isn't it interesting that the issue of code page
    65001 is reported as bug 6501?)

    @vstinner
    Copy link
    Member Author

    > Sorry, which bug?

    bpo-6501 and friends

    Hum, this particular issue, bpo-6501, doesn't concern the code page 65001. The typical usecase (issues bpo-7441 and bpo-10920) is:
    ------------
    C:\victor\cpython>chcp 65001
    Page de codes active : 65001

    C:\victor\cpython>pcbuild\python_d.exe
    Fatal Python error: Py_Initialize: can't initialize sys standard streams
    LookupError: unknown encoding: cp65001
    ------------

    The console and console output code pages may be changed by something else.

    The current workaround is to set PYTHONIOENCODING environment variable to utf-8, but as explained in msg132831, the workaround is not applicable if Python is embeded or if the program has been frozen by cx-freeze ("cx-freeze deliberately sets Py_IgnoreEnvironmentFlag").

    --

    The issue bpo-6501 was a bug in io.device_encoding(). I fixed it in Python 3.3 and I'm waiting... since 5 months... for Graham Dumpleton before backporting the fix. The issue suggests also to not fail if the encoding cannot be found (I dislike this idea).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 26, 2011

    New changeset 0eac706d82d1 by Victor Stinner in branch 'default':
    Fix the issue number of my cp65001 commit: 13247 => issue bpo-13216
    http://hg.python.org/cpython/rev/0eac706d82d1

    @vstinner
    Copy link
    Member Author

    New changeset 2cad20e2e588 by Victor Stinner in branch 'default':
    Close bpo-13247: Add cp65001 codec, the Windows UTF-8 (CP_UTF8)
    http://hg.python.org/cpython/rev/2cad20e2e588

    @vstinner
    Copy link
    Member Author

    Lib/encodings/cp65001.py uses a little trick to mark the codec as specific to Windows:
    -----------------

    if not hasattr(codecs, 'code_page_encode'):
        raise LookupError("cp65001 encoding is only available on Windows")

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    TODO: <Get performance changes in windows>.
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    Benchmarking on windows:
    * Baseline (http://gpaste/6701096112750592):
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    Benchmarking on windows:
    * Baseline (http://gpaste/6701096112750592): ~1min 41sec.
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant