This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: BufferedIncrementalEncoder violates IncrementalEncoder interface
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: doerwalter, lemburg, loewis, martin.panter, serhiy.storchaka
Priority: normal Keywords:

Created on 2014-01-28 16:34 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin.

Messages (4)
msg209563 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-28 16:34
The documentation of IncrementalEncoder.getstate() says:

"""
Return the current state of the encoder which must be an integer. The implementation should make sure that 0 is the most common state. (States that are more complicated than integers can be converted into an integer by marshaling/pickling the state and encoding the bytes of the resulting string into an integer).
"""

But implementation of BufferedIncrementalEncoder.getstate() is

    def getstate(self):
        return self.buffer or 0

self.buffer is "unencoded input that is kept between calls to encode()", e.g. a string.
msg209791 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2014-01-31 14:18
I dug up an ancient email about that subject:

>>> However, I've discovered that BufferedIncrementalEncoder.getstate()
>>> doesn't match the specification (i.e. it returns the buffer, not an
>>> int). However this class is unused (and probably useless, because it
>>> doesn't make sense to delay encoding the input). The simplest solution
>>> would be to simply drop the class.
>>
>> Sounds like a plan; go right ahead!
>
> Oops, there *is* one codec that uses it: The idna encoder. It buffers
> the input until a '.' is encountered (or encode() is called with
> final==True) and then encodes this part.
>
> Either the idna encoder encodes the unencoded input as a int, or we drop
> the specification that encoder.getstate() must return an int, or we
> change it to mirror the decoder specification (i.e. return a
> (buffered_input, additional_state_info) tuple.
>
> (A more radical solution would be to completely drop the incremental
> codecs for idna).
>
> Maybe we should wait and see how the implementation of writing turns out?

And indeed the incremental encoder for idna behaves strange:

>>> import io
>>> b = io.BytesIO()
>>> s = io.TextIOWrapper(b, 'idna')
>>> s.write('x')
1
>>> s.tell()
0
>>> b.getvalue()
b''
>>> s.write('.')
1
>>> s.tell()
2
>>> b.getvalue()
b'x.'
>>> b = io.BytesIO()
>>> s = io.TextIOWrapper(b, 'idna')
>>> s.write('x')
1
>>> s.seek(s.tell())
0
>>> s.write('.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/walter/.local/lib/python3.3/codecs.py", line 218, in encode
    (result, consumed) = self._buffer_encode(data, self.errors, final)
  File "/Users/walter/.local/lib/python3.3/encodings/idna.py", line 246, in _buffer_encode
    result.extend(ToASCII(label))
  File "/Users/walter/.local/lib/python3.3/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long

The cleanest solution might probably by to switch to a (buffered_input, additional_state_info) state.

However I don't know what changes this would require in the seek/tell imlementations.
msg222473 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-07 16:47
IncrementalNewlineDecoder requires that decoder state is integer (C implementation requires at most 63-bit unsigned integer). TextIOWrapper requires that decoder state is at most 64-bit unsigned integer (only 63-bit if universal newlines is enabled).
msg234164 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-17 11:12
For what it’s worth, both io.TextIOWrapper and _pyio.TextIOWrapper appear to only ever call IncrementalEncoder.setstate(0). And the newline _decoder_ is not relevant because it doesn’t use any _encoder_.
History
Date User Action Args
2022-04-11 14:57:57adminsetgithub: 64619
2021-12-09 22:10:24iritkatrielsetcomponents: + Library (Lib)
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.3, Python 3.4
2015-01-17 11:12:55martin.pantersetnosy: + martin.panter
messages: + msg234164
2014-07-07 16:47:41serhiy.storchakasetmessages: + msg222473
2014-01-31 14:18:42doerwaltersetmessages: + msg209791
2014-01-28 16:34:45serhiy.storchakacreate