classification
Title: Implement incremental decoder for cp65001
Type: enhancement Stage:
Components: Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, larry, loewis, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2014-02-09 13:18 by haypo, last changed 2015-03-18 13:22 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
incremental_cp_utf8.patch haypo, 2014-02-09 13:18 review
Messages (9)
msg210759 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-02-09 13:18
(Follow up of issue #20538 and #20571.) Attached patch implements incremental decoders for multibyte code pages (on Windows), especially for CP_UTF8 aka "cp65001" in Python.

Code pages 932, 936, 949, 950 and 1361 already have an incremental decoder since:
---
changeset:   38817:549c547700af
branch:      legacy-trunk
user:        Martin v. Löwis <martin@v.loewis.de>
date:        Wed Jun 14 05:21:04 2006 +0000
files:       Doc/api/concrete.tex Include/unicodeobject.h Lib/encodings/mbcs.py Misc/NEWS Modules/_codecsmodule.c Objects/unicodeobject.c
description:
Patch #1455898: Incremental mode for "mbcs" codec.
---

Python currently uses IsDBCSLeadByteEx():
http://msdn.microsoft.com/en-us/library/windows/desktop/dd318667%28v=vs.85%29.aspx

And CharPrevA():
http://msdn.microsoft.com/en-us/library/windows/desktop/ms647471%28v=vs.85%29.aspx

But IsDBCSLeadByteEx() only supports code pages 932, 936, 949, 950 and 1361.

Python supports the code page 65001 (codec "cp65001") since Python 3.3. New tests on incremental decoders were added in Python 3.4: I addedd a skip for cp65001 since it was not supported (#20571). This issue implements the incremental decoder and so removes the skip.

I prefer to wait for Python 3.5 (not rush for add this new feature after 3.4 beta 3). cp65001 is mostly used for output (sys.stdout/sys.stderr) on Windows, not for input.
msg210764 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-09 13:50
Nice.

Could you please also add test_partial for CP65001 (if this will make sense)?

What is performance regression of this patch?

I considered this issue as a bug. And if performance regression is not too big, I think it can be applied to 3.3+. Otherwise a warning should be added that CP65001 doesn't not work with input text streams.
msg210783 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-02-09 19:47
It might be faster, or (more likely) has zero impact on performances.
msg213905 - (view) Author: Roundup Robot (python-dev) Date: 2014-03-17 22:12
New changeset 08f9b881f78c by Victor Stinner in branch 'default':
Issue #20574: Implement incremental decoder for cp65001 code
http://hg.python.org/cpython/rev/08f9b881f78c
msg213906 - (view) Author: Roundup Robot (python-dev) Date: 2014-03-17 22:17
New changeset 85b87789f048 by Victor Stinner in branch 'default':
Issue #20574: Add more tests for cp65001
http://hg.python.org/cpython/rev/85b87789f048
msg213907 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-03-17 22:27
> Could you please also add test_partial for CP65001 (if this will make sense)?

I added CP65001Test which inherit from UTF8Test and so runs all UTF-8 tests on cp65001 codec. I'm surprised that the test pass.
msg213908 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-03-17 22:28
I don't feel the need to backport the new feature, so I'm closing the issue.
msg213923 - (view) Author: Roundup Robot (python-dev) Date: 2014-03-18 00:40
New changeset f6794a0fb2b3 by Victor Stinner in branch 'default':
Issue #20574: Remove duplicated test failing on Windows XP
http://hg.python.org/cpython/rev/f6794a0fb2b3
msg213926 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-03-18 00:51
I removed the test because there were two classes tesing the same codec and that tests were failing. I need to refactor tests, and so I reopen the issue.

http://buildbot.python.org/all/builders/x86%20XP-4%203.x/builds/10291/steps/test/logs/stdio

======================================================================
FAIL: test_lone_surrogates (test.test_codecs.CP65001Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 773, in test_lone_surrogates
    super().test_lone_surrogates()
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 349, in test_lone_surrogates
    self.assertRaises(UnicodeEncodeError, "\ud800".encode, self.encoding)
AssertionError: UnicodeEncodeError not raised by encode
History
Date User Action Args
2015-03-18 13:22:37hayposetstatus: open -> closed
resolution: fixed
2014-03-18 00:51:37hayposetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg213926
2014-03-18 00:40:31python-devsetmessages: + msg213923
2014-03-17 22:28:26hayposetstatus: open -> closed
resolution: fixed
messages: + msg213908
2014-03-17 22:27:50hayposetmessages: + msg213907
2014-03-17 22:17:43python-devsetmessages: + msg213906
2014-03-17 22:12:28python-devsetnosy: + python-dev
messages: + msg213905
2014-02-09 19:47:43hayposetmessages: + msg210783
2014-02-09 13:50:56serhiy.storchakasetmessages: + msg210764
2014-02-09 13:18:25haypocreate