Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement incremental decoder for cp65001 #64773

Closed
vstinner opened this issue Feb 9, 2014 · 9 comments
Closed

Implement incremental decoder for cp65001 #64773

vstinner opened this issue Feb 9, 2014 · 9 comments
Labels
type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Feb 9, 2014

BPO 20574
Nosy @loewis, @vstinner, @larryhastings, @serhiy-storchaka
Files
  • incremental_cp_utf8.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-03-18.13:22:37.172>
    created_at = <Date 2014-02-09.13:18:25.151>
    labels = ['type-feature']
    title = 'Implement incremental decoder for cp65001'
    updated_at = <Date 2015-03-18.13:22:37.171>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2015-03-18.13:22:37.171>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-03-18.13:22:37.172>
    closer = 'vstinner'
    components = []
    creation = <Date 2014-02-09.13:18:25.151>
    creator = 'vstinner'
    dependencies = []
    files = ['34008']
    hgrepos = []
    issue_num = 20574
    keywords = ['patch']
    message_count = 9.0
    messages = ['210759', '210764', '210783', '213905', '213906', '213907', '213908', '213923', '213926']
    nosy_count = 5.0
    nosy_names = ['loewis', 'vstinner', 'larry', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue20574'
    versions = ['Python 3.5']

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 9, 2014

    (Follow up of issue bpo-20538 and bpo-20571.) Attached patch implements incremental decoders for multibyte code pages (on Windows), especially for CP_UTF8 aka "cp65001" in Python.

    Code pages 932, 936, 949, 950 and 1361 already have an incremental decoder since:
    ---
    changeset: 38817:549c547700af
    branch: legacy-trunk
    user: Martin v. Löwis <martin@v.loewis.de>
    date: Wed Jun 14 05:21:04 2006 +0000
    files: Doc/api/concrete.tex Include/unicodeobject.h Lib/encodings/mbcs.py Misc/NEWS Modules/_codecsmodule.c Objects/unicodeobject.c
    description:
    Patch bpo-1455898: Incremental mode for "mbcs" codec.
    ---

    Python currently uses IsDBCSLeadByteEx():
    http://msdn.microsoft.com/en-us/library/windows/desktop/dd318667%28v=vs.85%29.aspx

    And CharPrevA():
    http://msdn.microsoft.com/en-us/library/windows/desktop/ms647471%28v=vs.85%29.aspx

    But IsDBCSLeadByteEx() only supports code pages 932, 936, 949, 950 and 1361.

    Python supports the code page 65001 (codec "cp65001") since Python 3.3. New tests on incremental decoders were added in Python 3.4: I addedd a skip for cp65001 since it was not supported (bpo-20571). This issue implements the incremental decoder and so removes the skip.

    I prefer to wait for Python 3.5 (not rush for add this new feature after 3.4 beta 3). cp65001 is mostly used for output (sys.stdout/sys.stderr) on Windows, not for input.

    @vstinner vstinner added the type-feature A feature request or enhancement label Feb 9, 2014
    @serhiy-storchaka
    Copy link
    Member

    Nice.

    Could you please also add test_partial for CP65001 (if this will make sense)?

    What is performance regression of this patch?

    I considered this issue as a bug. And if performance regression is not too big, I think it can be applied to 3.3+. Otherwise a warning should be added that CP65001 doesn't not work with input text streams.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 9, 2014

    It might be faster, or (more likely) has zero impact on performances.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 17, 2014

    New changeset 08f9b881f78c by Victor Stinner in branch 'default':
    Issue bpo-20574: Implement incremental decoder for cp65001 code
    http://hg.python.org/cpython/rev/08f9b881f78c

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 17, 2014

    New changeset 85b87789f048 by Victor Stinner in branch 'default':
    Issue bpo-20574: Add more tests for cp65001
    http://hg.python.org/cpython/rev/85b87789f048

    @vstinner
    Copy link
    Member Author

    Could you please also add test_partial for CP65001 (if this will make sense)?

    I added CP65001Test which inherit from UTF8Test and so runs all UTF-8 tests on cp65001 codec. I'm surprised that the test pass.

    @vstinner
    Copy link
    Member Author

    I don't feel the need to backport the new feature, so I'm closing the issue.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 18, 2014

    New changeset f6794a0fb2b3 by Victor Stinner in branch 'default':
    Issue bpo-20574: Remove duplicated test failing on Windows XP
    http://hg.python.org/cpython/rev/f6794a0fb2b3

    @vstinner
    Copy link
    Member Author

    I removed the test because there were two classes tesing the same codec and that tests were failing. I need to refactor tests, and so I reopen the issue.

    http://buildbot.python.org/all/builders/x86%20XP-4%203.x/builds/10291/steps/test/logs/stdio

    ======================================================================
    FAIL: test_lone_surrogates (test.test_codecs.CP65001Test)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 773, in test_lone_surrogates
        super().test_lone_surrogates()
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 349, in test_lone_surrogates
        self.assertRaises(UnicodeEncodeError, "\ud800".encode, self.encoding)
    AssertionError: UnicodeEncodeError not raised by encode

    @vstinner vstinner reopened this Mar 18, 2014
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants