Reset method of the incremental encoders of CJK codecs calls the decoder reset function #56380

vstinner · 2011-05-24T21:30:10Z

BPO	12171
Nosy	@malemburg, @doerwalter, @arigo, @amauryfa, @hyeshik, @vstinner, @bz2
Files	cjk_reset_result.patch cjk_encreset.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-05-30.20:57:33.119>
created_at = <Date 2011-05-24.21:30:09.519>
labels = ['library', 'expert-unicode']
title = 'Reset method of the incremental encoders of CJK codecs\tcalls the decoder reset function'
updated_at = <Date 2011-05-30.20:57:33.118>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2011-05-30.20:57:33.118>
actor = 'python-dev'
assignee = 'none'
closed = True
closed_date = <Date 2011-05-30.20:57:33.119>
closer = 'python-dev'
components = ['Library (Lib)', 'Unicode']
creation = <Date 2011-05-24.21:30:09.519>
creator = 'vstinner'
dependencies = []
files = ['22096', '22119']
hgrepos = []
issue_num = 12171
keywords = ['patch']
message_count = 10.0
messages = ['136794', '136796', '136831', '136908', '136915', '136918', '136919', '136922', '136946', '137326']
nosy_count = 8.0
nosy_names = ['lemburg', 'doerwalter', 'arigo', 'amaury.forgeotdarc', 'hyeshik.chang', 'vstinner', 'gz', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue12171'
versions = ['Python 3.3']

vstinner · 2011-05-24T21:30:09Z

HZ and ISO-2022 family codecs may generate an escape sequence at the end of a stream. For example, the HZ codec uses '~{' to switchs from ASCII to GB2312, and '~}' resets the encoding to ASCII. At the end of a stream, the encoding should be reset to ASCII. '\u804a'.encode('hz') returns b'~{AD~}', which is correct.

Incremental encoders generate also the escape sequence if the last call to encode() is done using final=True.

It would be nice to be able to generate the escape sequence without the final flag, because sometimes you don't know which call to encode() is the last one. For example if you write data in a file, you may want to write the escape sequence at the end when the file is closed.

I propose to change the reset() method of incremental encoders: they may return a bytes object to close the stream. For example, the reset() method of the HZ codec may returns b'~}' if the encoder was using GB2312 (if it emited previously b'~{').

So the 3 following code should returns b'~{AD~}':

'\u804a'.encode('hz')
encoder = codecs.lookup('hz').incrementalencoder(); encoder.encode('\u804a', final=True)
encoder = codecs.lookup('hz').incrementalencoder(); encoder.encode('\u804a') + encoder.reset()

For backward compatibility, reset() returns None if there is no pending buffer or any escape sequence.

--

This proposition comes from bpo-12000: Armin Rigo noticed that the reset method of the incremental encoders of CJK codecs calls the decoder reset function. Extract of Modules/cjkcodecs/multibytecodec.c:

static PyObject *
mbiencoder_reset(MultibyteIncrementalEncoderObject *self)
{
    if (self->codec->decreset != NULL &&
        self->codec->decreset(&self->state, self->codec->config) != 0)
        return NULL;
    self->pendingsize = 0;

    Py_RETURN_NONE;
}

I suppose that codec->encreset() is not called here because we need an output buffer, and there is no such buffer. Or it's just a copy-paste failure.

--

I am not sure that it is really useful to emit b'~}' at the end of a HZ stream, but the change is simple and if you don't care of b'~}': just ignore the result of reset() (as everybody does today, so it doesn't hurt).

--

Only HZ and ISO-2022 encodings implement the reset method of their incremental encoder. For example, encodings of the UTF family don't need to emit bytes at reset.

For the maximum length of reset() output: HZ may generates 2 bytes (b'~}') and ISO-2022 may generates 4 bytes (b'\x0F' + b'\x1F(B').

--

See also the issue bpo-12100.

amauryfa · 2011-05-24T21:48:20Z

Do we need an additional method? It seems that this reset() could also be written encoder.encode('', final=True)

malemburg · 2011-05-25T08:23:15Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

Do we need an additional method? It seems that this reset() could also be written encoder.encode('', final=True)

+1

I think that's a much more natural way to implement "finalize the
encoding output without adding any data to it".

vstinner · 2011-05-25T21:52:43Z

Le mercredi 25 mai 2011 à 08:23 +0000, Marc-Andre Lemburg a écrit :

> Do we need an additional method? It seems that this reset() could
> also be written encoder.encode('', final=True)

+1

I think that's a much more natural way to implement "finalize the
encoding output without adding any data to it".

And so, reset() should discard the output? I can easily adapt my patch
to discard the output (but still call encreset() instead of decreset()).

vstinner · 2011-05-25T22:31:05Z

cjk_encreset.patch: Fix multibytecodec.MultibyteIncrementalEncoder.reset(), call encreset() instead of decreset(). Improve also incremental encoder tests: reset the encoder using .encode('', final=True) and check the output.

I also prefer to reuse the existing API instead of changing reset() API just for a corner case.

bz2 · 2011-05-25T22:48:37Z

New cjk_encreset.patch looks good to me.

As with bpo-12100 this likely wasn't reported before because decoders are robust against escape sequence oddities.

malemburg · 2011-05-25T23:42:29Z

STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> Le mercredi 25 mai 2011 à 08:23 +0000, Marc-Andre Lemburg a écrit :
>>> Do we need an additional method? It seems that this reset() could
>>> also be written encoder.encode('', final=True)
>>
>> +1
>>
>> I think that's a much more natural way to implement "finalize the
>> encoding output without adding any data to it".
> 
> And so, reset() should discard the output? I can easily adapt my patch
> to discard the output (but still call encreset() instead of decreset()).

I'm not sure what you mean by "discard the output".

Calling .reset() should still add the closing sequence to the output
buffer, if needed.

The purpose of .reset() is flush all data and put the codec into a
clean state (comparable to the state you get when you start using
it).

vstinner · 2011-05-26T00:14:09Z

I'm not sure what you mean by "discard the output".

Calling .reset() should still add the closing sequence to the output
buffer, if needed.

Incremental encoders don't have any buffer.

Python 3.3a0 (default:3c7792ec4547+, May 26 2011, 00:24:51) 
>>> import codecs

>>> enc=codecs.lookup('hz').incrementalencoder()
>>> enc.encode('\u8000')
b'~{R+'
>>> enc.reset()

>>> enc.encode('\u8000')
b'~{R+'
>>> enc.encode('', final=True)
b'~}'

>>> import io
>>> buffer=io.BytesIO()
>>> enc=codecs.lookup('hz').streamwriter(buffer)
>>> enc.write('\u8000')
>>> buffer.getvalue()
b'~{R+'
>>> enc.reset()
>>> buffer.getvalue()
b'~{R+~}'

IncrementalEncoder.reset() may mention that the output is discarded:

.. method:: reset()

  Reset the encoder to the initial state. The output is discarded: 
  use the ``encode('', final=True)`` method to reset the encoder 
  and to get the output.

doerwalter · 2011-05-26T11:43:09Z

+1 on the documentation changes.

python-dev · 2011-05-30T20:57:33Z

New changeset 61aaec67b521 by Victor Stinner in branch 'default':
Close bpo-12171: IncrementalEncoder.reset() of CJK codecs (multibytecodec) calls
http://hg.python.org/cpython/rev/61aaec67b521

vstinner added stdlib Python modules in the Lib dir topic-unicode labels May 24, 2011

malemburg changed the title ~~Reset method of the incremental encoders of CJK codecs calls the decoder reset function~~ Reset method of the incremental encoders of CJK codecs calls the decoder reset function May 25, 2011

python-dev mannequin closed this as completed May 30, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset method of the incremental encoders of CJK codecs calls the decoder reset function #56380

Reset method of the incremental encoders of CJK codecs calls the decoder reset function #56380

vstinner commented May 24, 2011

vstinner commented May 24, 2011

amauryfa commented May 24, 2011

malemburg commented May 25, 2011

vstinner commented May 25, 2011

vstinner commented May 25, 2011

bz2 mannequin commented May 25, 2011

malemburg commented May 25, 2011

vstinner commented May 26, 2011

doerwalter commented May 26, 2011

python-dev mannequin commented May 30, 2011

Reset method of the incremental encoders of CJK codecs calls the decoder reset function #56380

Reset method of the incremental encoders of CJK codecs calls the decoder reset function #56380

Comments

vstinner commented May 24, 2011

vstinner commented May 24, 2011

amauryfa commented May 24, 2011

malemburg commented May 25, 2011

vstinner commented May 25, 2011

vstinner commented May 25, 2011

bz2 mannequin commented May 25, 2011

malemburg commented May 25, 2011

vstinner commented May 26, 2011

doerwalter commented May 26, 2011

python-dev mannequin commented May 30, 2011