classification
Title: Incremental encoders of CJK codecs reset the codec at each call to encode()
Type: Stage:
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: arigo, gz, haypo, hyeshik.chang, kennyluck, lemburg, loewis, python-dev
Priority: normal Keywords: patch

Created on 2011-05-17 23:10 by haypo, last changed 2012-02-01 01:25 by kennyluck. This issue is now closed.

Files
File name Uploaded Description Edit
cjk_no_reset.patch haypo, 2011-05-17 23:10 review
Messages (10)
msg136194 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-17 23:10
Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output:

>>> import codecs
>>> encoder = codecs.getincrementalencoder('hz')()
>>> encoder.encode('\u804a') + encoder.encode('\u804a')
b'聊聊'
>>> '\u804a\u804a'.encode('hz')
b'聊聊'

Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp).

Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue #12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first?

I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0).

See also issues #12016 and #12057.
msg136534 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-05-22 15:02
I think it's better to use a StringIO instance for the tests.

Regarding resetting the incremental codec every time .encode() is called: Hye-Shik will have to comment. Perhaps there's an internal reason why they do this.
msg136727 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-24 09:55
> I think it's better to use a StringIO instance for the tests.

For which test excatly? An encoder produces bytes, I don't the relation with StringIO.
msg136730 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-05-24 10:17
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> I think it's better to use a StringIO instance for the tests.
> 
> For which test excatly? An encoder produces bytes, I don't the relation with StringIO.

Sorry, BytesIO in Python3-speak. In Python2 you'd use StringIO.
msg136777 - (view) Author: Martin (gz) Date: 2011-05-24 18:13
Does Victor Stinner have a psychic link with Armin Rigo? :)

https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py

"""
# My theory is that they are not widely used on CPython either, because
# I found two bugs just by looking at their .c source: they always call
# encreset() after a piece of data, even though I think it's wrong ---
# it should be called only once at the end; and mbiencoder_reset() calls
# decreset() instead of encreset().
"""

The answer to Armin's theory is that they're bugs but not ones users are likely to notice?
msg136780 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-24 19:04
Le mardi 24 mai 2011  18:13 +0000, Martin a crit :
> Martin <gzlist@googlemail.com> added the comment:
> 
> Does Victor Stinner have a psychic link with Armin Rigo? :)
> 
> https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py
> 
> """
> # My theory is that they are not widely used on CPython either, because
> # I found two bugs just by looking at their .c source

Sorry, I only found one bug, and while testing HZ, not while reading the
source code.

> ... and mbiencoder_reset() calls decreset() instead of encreset()

This is a new bug that you should be fixed. Armin did not reported the
bug upstream (in this bug tracker)?

> The answer to Armin's theory is that they're bugs but not ones users are likely to notice?

Ok, I will apply my fix.
msg136788 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2011-05-24 20:08
Hi :-)  I did not report the two issues I found so far because I didn't finish the PyPy implementation of CJK yet, and I'm very new to anything related to codecs; additionally I didn't check Python 3.x, as I was just following the 2.7 sources.  Can someone confirm that the two bugs I suspect are really bugs?  And should I open another report to help tracking the 2nd bug?
msg136791 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-24 20:24
New changeset bd17396895fb by Victor Stinner in branch '3.1':
Issue #12100: Don't reset incremental encoders of CJK codecs at each call to
http://hg.python.org/cpython/rev/bd17396895fb

New changeset 7f2ab2f95a04 by Victor Stinner in branch '3.2':
(Merge 3.1) Issue #12100: Don't reset incremental encoders of CJK codecs at
http://hg.python.org/cpython/rev/7f2ab2f95a04

New changeset cb9867dab15e by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12100: Don't reset incremental encoders of CJK codecs at
http://hg.python.org/cpython/rev/cb9867dab15e
msg136792 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-24 20:29
New changeset e789b4cda872 by Victor Stinner in branch '2.7':
Issue #12100: Don't reset incremental encoders of CJK codecs at each call to
http://hg.python.org/cpython/rev/e789b4cda872
msg136910 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-25 22:01
The initial problem (reset() at each call to .encode()) is fixed in Python 2.7, 3.1, 3.2 and 3.3.

I opened a new issue, #12171, for the second problem noticed by Armin (decreset vs encreset).
History
Date User Action Args
2012-02-01 01:25:06kennylucksetnosy: + kennyluck
2011-05-25 22:01:10hayposetstatus: open -> closed
resolution: fixed
messages: + msg136910
2011-05-24 20:29:21python-devsetmessages: + msg136792
2011-05-24 20:24:36python-devsetnosy: + python-dev
messages: + msg136791
2011-05-24 20:08:14arigosetmessages: + msg136788
2011-05-24 19:04:14hayposetmessages: + msg136780
2011-05-24 18:13:35gzsetnosy: + arigo, gz
messages: + msg136777
2011-05-24 10:17:15lemburgsetmessages: + msg136730
title: Incremental encoders of CJK codecs reset the codec at each call to encode() -> Incremental encoders of CJK codecs reset the codec at each call to encode()
2011-05-24 09:55:35hayposetmessages: + msg136727
2011-05-22 15:02:13lemburgsetmessages: + msg136534
2011-05-22 13:44:17hayposetnosy: + loewis
2011-05-21 00:33:45haypolinkissue12057 dependencies
2011-05-17 23:10:05haypocreate