Implement incremental decoder for cp65001 #64773

vstinner · 2014-02-09T13:18:25Z

BPO	20574
Nosy	@loewis, @vstinner, @larryhastings, @serhiy-storchaka
Files	incremental_cp_utf8.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-03-18.13:22:37.172>
created_at = <Date 2014-02-09.13:18:25.151>
labels = ['type-feature']
title = 'Implement incremental decoder for cp65001'
updated_at = <Date 2015-03-18.13:22:37.171>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2015-03-18.13:22:37.171>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2015-03-18.13:22:37.172>
closer = 'vstinner'
components = []
creation = <Date 2014-02-09.13:18:25.151>
creator = 'vstinner'
dependencies = []
files = ['34008']
hgrepos = []
issue_num = 20574
keywords = ['patch']
message_count = 9.0
messages = ['210759', '210764', '210783', '213905', '213906', '213907', '213908', '213923', '213926']
nosy_count = 5.0
nosy_names = ['loewis', 'vstinner', 'larry', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue20574'
versions = ['Python 3.5']

vstinner · 2014-02-09T13:18:24Z

(Follow up of issue bpo-20538 and bpo-20571.) Attached patch implements incremental decoders for multibyte code pages (on Windows), especially for CP_UTF8 aka "cp65001" in Python.

Code pages 932, 936, 949, 950 and 1361 already have an incremental decoder since:
---
changeset: 38817:549c547700af
branch: legacy-trunk
user: Martin v. Löwis <martin@v.loewis.de>
date: Wed Jun 14 05:21:04 2006 +0000
files: Doc/api/concrete.tex Include/unicodeobject.h Lib/encodings/mbcs.py Misc/NEWS Modules/_codecsmodule.c Objects/unicodeobject.c
description:
Patch bpo-1455898: Incremental mode for "mbcs" codec.
---

Python currently uses IsDBCSLeadByteEx():
http://msdn.microsoft.com/en-us/library/windows/desktop/dd318667%28v=vs.85%29.aspx

And CharPrevA():
http://msdn.microsoft.com/en-us/library/windows/desktop/ms647471%28v=vs.85%29.aspx

But IsDBCSLeadByteEx() only supports code pages 932, 936, 949, 950 and 1361.

Python supports the code page 65001 (codec "cp65001") since Python 3.3. New tests on incremental decoders were added in Python 3.4: I addedd a skip for cp65001 since it was not supported (bpo-20571). This issue implements the incremental decoder and so removes the skip.

I prefer to wait for Python 3.5 (not rush for add this new feature after 3.4 beta 3). cp65001 is mostly used for output (sys.stdout/sys.stderr) on Windows, not for input.

serhiy-storchaka · 2014-02-09T13:50:56Z

Nice.

Could you please also add test_partial for CP65001 (if this will make sense)?

What is performance regression of this patch?

I considered this issue as a bug. And if performance regression is not too big, I think it can be applied to 3.3+. Otherwise a warning should be added that CP65001 doesn't not work with input text streams.

vstinner · 2014-02-09T19:47:43Z

It might be faster, or (more likely) has zero impact on performances.

python-dev · 2014-03-17T22:12:29Z

New changeset 08f9b881f78c by Victor Stinner in branch 'default':
Issue bpo-20574: Implement incremental decoder for cp65001 code
http://hg.python.org/cpython/rev/08f9b881f78c

python-dev · 2014-03-17T22:17:43Z

New changeset 85b87789f048 by Victor Stinner in branch 'default':
Issue bpo-20574: Add more tests for cp65001
http://hg.python.org/cpython/rev/85b87789f048

vstinner · 2014-03-17T22:27:50Z

Could you please also add test_partial for CP65001 (if this will make sense)?

I added CP65001Test which inherit from UTF8Test and so runs all UTF-8 tests on cp65001 codec. I'm surprised that the test pass.

vstinner · 2014-03-17T22:28:27Z

I don't feel the need to backport the new feature, so I'm closing the issue.

python-dev · 2014-03-18T00:40:32Z

New changeset f6794a0fb2b3 by Victor Stinner in branch 'default':
Issue bpo-20574: Remove duplicated test failing on Windows XP
http://hg.python.org/cpython/rev/f6794a0fb2b3

vstinner · 2014-03-18T00:51:37Z

I removed the test because there were two classes tesing the same codec and that tests were failing. I need to refactor tests, and so I reopen the issue.

http://buildbot.python.org/all/builders/x86%20XP-4%203.x/builds/10291/steps/test/logs/stdio

======================================================================
FAIL: test_lone_surrogates (test.test_codecs.CP65001Test)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 773, in test_lone_surrogates
    super().test_lone_surrogates()
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 349, in test_lone_surrogates
    self.assertRaises(UnicodeEncodeError, "\ud800".encode, self.encoding)
AssertionError: UnicodeEncodeError not raised by encode

vstinner added the type-feature A feature request or enhancement label Feb 9, 2014

vstinner closed this as completed Mar 17, 2014

vstinner reopened this Mar 18, 2014

vstinner closed this as completed Mar 18, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement incremental decoder for cp65001 #64773

Implement incremental decoder for cp65001 #64773

vstinner commented Feb 9, 2014

vstinner commented Feb 9, 2014

serhiy-storchaka commented Feb 9, 2014

vstinner commented Feb 9, 2014

python-dev mannequin commented Mar 17, 2014

python-dev mannequin commented Mar 17, 2014

vstinner commented Mar 17, 2014

vstinner commented Mar 17, 2014

python-dev mannequin commented Mar 18, 2014

vstinner commented Mar 18, 2014

Implement incremental decoder for cp65001 #64773

Implement incremental decoder for cp65001 #64773

Comments

vstinner commented Feb 9, 2014

vstinner commented Feb 9, 2014

serhiy-storchaka commented Feb 9, 2014

vstinner commented Feb 9, 2014

python-dev mannequin commented Mar 17, 2014

python-dev mannequin commented Mar 17, 2014

vstinner commented Mar 17, 2014

vstinner commented Mar 17, 2014

python-dev mannequin commented Mar 18, 2014

vstinner commented Mar 18, 2014