New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement incremental decoder for cp65001 #64773
Comments
(Follow up of issue bpo-20538 and bpo-20571.) Attached patch implements incremental decoders for multibyte code pages (on Windows), especially for CP_UTF8 aka "cp65001" in Python. Code pages 932, 936, 949, 950 and 1361 already have an incremental decoder since: Python currently uses IsDBCSLeadByteEx(): And CharPrevA(): But IsDBCSLeadByteEx() only supports code pages 932, 936, 949, 950 and 1361. Python supports the code page 65001 (codec "cp65001") since Python 3.3. New tests on incremental decoders were added in Python 3.4: I addedd a skip for cp65001 since it was not supported (bpo-20571). This issue implements the incremental decoder and so removes the skip. I prefer to wait for Python 3.5 (not rush for add this new feature after 3.4 beta 3). cp65001 is mostly used for output (sys.stdout/sys.stderr) on Windows, not for input. |
Nice. Could you please also add test_partial for CP65001 (if this will make sense)? What is performance regression of this patch? I considered this issue as a bug. And if performance regression is not too big, I think it can be applied to 3.3+. Otherwise a warning should be added that CP65001 doesn't not work with input text streams. |
It might be faster, or (more likely) has zero impact on performances. |
New changeset 08f9b881f78c by Victor Stinner in branch 'default': |
New changeset 85b87789f048 by Victor Stinner in branch 'default': |
I added CP65001Test which inherit from UTF8Test and so runs all UTF-8 tests on cp65001 codec. I'm surprised that the test pass. |
I don't feel the need to backport the new feature, so I'm closing the issue. |
New changeset f6794a0fb2b3 by Victor Stinner in branch 'default': |
I removed the test because there were two classes tesing the same codec and that tests were failing. I need to refactor tests, and so I reopen the issue. http://buildbot.python.org/all/builders/x86%20XP-4%203.x/builds/10291/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 773, in test_lone_surrogates
super().test_lone_surrogates()
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 349, in test_lone_surrogates
self.assertRaises(UnicodeEncodeError, "\ud800".encode, self.encoding)
AssertionError: UnicodeEncodeError not raised by encode |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: