classification
Title: BUG in codecs.BufferedIncrementalDecoder
Type: Stage:
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: doerwalter, jamercee, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-10-15 08:21 by jamercee, last changed 2019-10-17 08:59 by doerwalter.

Files
File name Uploaded Description Edit
codecs.patch jamercee, 2019-10-15 08:27 Patch to codecs.py
Messages (6)
msg354707 - (view) Author: Jim Carroll (jamercee) * Date: 2019-10-15 08:21
While working with codecs.iterdecode(), encountered "can't concat int to bytes". The source of the problem is BufferedIncrementalDecoder stores it's internal buffer as a bytes (ie: b""), but decode() can be passed either a byte string or in the case of iterdecode() an int.  The solution is to test for this in the decode and if passed an int to coerce to bytes (see attach patch)

Platform: Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32

Code to demonstrate the issue:

>>> import codecs
>>> source = ''.join([chr(x) for x in range(256)])
>>> enc = b''.join(codecs.iterencode(source, 'utf-8'))
>>> list(''.join(codecs.iterdecode(enc, 'utf-8')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python37\lib\codecs.py", line 1048, in iterdecode
    output = decoder.decode(input)
  File "C:\Python37\lib\codecs.py", line 321, in decode
    data = self.buffer + input
TypeError: can't concat int to bytes
msg354711 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-10-15 10:42
The first argument of iterdecode() should be an iterable of bytes objects, not a bytes object. Try codecs.iterdecode([enc], 'utf-8')
msg354716 - (view) Author: Jim Carroll (jamercee) * Date: 2019-10-15 11:59
According to the documentation (https://docs.python.org/3.7/library/codecs.html#codecs.iterdecode), the first parameter is a bytes object to decode (not an iterable of bytes). Which is also consistent with it's companion iterencode() which accepts a str object, not an iterable of chars.

Seems logical that one should be able to pass the output from iterencode() as the direct input to iterdecode() without having to convert, no?
msg354731 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2019-10-15 14:04
The documentation might be unclear here. But the argument iterator of

   iterdecode(iterator, encoding, errors='strict', **kwargs)

*is* supposed to be an iterable over bytes objects.

In fact iterencode() transforms an iterator over strings into an iterator over bytes and iterdecode() transforms an iterator over bytes into an iterator over strings.

Since iterating over strings iterates over the characters, it's possible to pass a string to iterencode(). However it's not possible to pass a bytes object to iterdecode() since iterating over a bytes object yields integers:

>>> import codecs
>>> list(codecs.iterencode(['spam'], 'utf-8'))
[b'spam']
>>> list(codecs.iterencode('spam', 'utf-8'))
[b's', b'p', b'a', b'm']
>>> list(codecs.iterdecode([b'spam'], 'utf-8'))
['spam']
>>> list(codecs.iterdecode(b'spam', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 1048, in iterdecode
    output = decoder.decode(input)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 321, in decode
    data = self.buffer + input
TypeError: can't concat int to bytes
msg354755 - (view) Author: Jim Carroll (jamercee) * Date: 2019-10-15 23:17
I understand.

btw; I did a deep dive on cpython codebase, and the only references to codecs.iterencode()/iterdecode() is in ./Lib/tests/test_codecs.py. I suspect functions are not used by many people.

The patch I proposed was a three line change that would allow passing either an int or bytes...not sure if that sways any opinions on this topic.

If we decide to just stick with existing functionality, a small clarification to the docs might be in order?
msg354836 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2019-10-17 08:59
codecs.iterencode()/iterdecode() are just shallow 10-line wrappers around incremental codecs (which are used as the basis of io streams).

Note that the doc string for iterencode() contains:

   Encodes the input strings from the iterator using an IncrementalEncoder.

i.e. "strings" (plural) should give a hint that iterator is an iterator over strings.

But maybe this could be made clearer.

And https://docs.python.org/3/library/codecs.html#codecs.iterencode and https://docs.python.org/3/library/codecs.html#codecs.iterdecode could indead be clearer about what iterator should be. An example might also help.
History
Date User Action Args
2019-10-17 08:59:44doerwaltersetmessages: + msg354836
2019-10-15 23:17:59jamerceesetmessages: + msg354755
2019-10-15 14:04:43doerwaltersetnosy: + doerwalter
messages: + msg354731
2019-10-15 11:59:23jamerceesetmessages: + msg354716
2019-10-15 10:42:54serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg354711
2019-10-15 08:27:39jamerceesetfiles: + codecs.patch
2019-10-15 08:26:27jamerceesetfiles: - codecs.patch
2019-10-15 08:21:30jamerceecreate