Issue 38482: BUG in codecs.BufferedIncrementalDecoder

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82663

classification

Title:	BUG in codecs.BufferedIncrementalDecoder
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	doerwalter, jamercee, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2019-10-15 08:21 by jamercee, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
codecs.patch	jamercee, 2019-10-15 08:27	Patch to codecs.py

Messages (6)
msg354707 - (view)	Author: Jim Carroll (jamercee) *	Date: 2019-10-15 08:21
While working with codecs.iterdecode(), encountered "can't concat int to bytes". The source of the problem is BufferedIncrementalDecoder stores it's internal buffer as a bytes (ie: b""), but decode() can be passed either a byte string or in the case of iterdecode() an int. The solution is to test for this in the decode and if passed an int to coerce to bytes (see attach patch) Platform: Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32 Code to demonstrate the issue: >>> import codecs >>> source = ''.join([chr(x) for x in range(256)]) >>> enc = b''.join(codecs.iterencode(source, 'utf-8')) >>> list(''.join(codecs.iterdecode(enc, 'utf-8'))) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python37\lib\codecs.py", line 1048, in iterdecode output = decoder.decode(input) File "C:\Python37\lib\codecs.py", line 321, in decode data = self.buffer + input TypeError: can't concat int to bytes
msg354711 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-10-15 10:42
The first argument of iterdecode() should be an iterable of bytes objects, not a bytes object. Try codecs.iterdecode([enc], 'utf-8')
msg354716 - (view)	Author: Jim Carroll (jamercee) *	Date: 2019-10-15 11:59
According to the documentation (https://docs.python.org/3.7/library/codecs.html#codecs.iterdecode), the first parameter is a bytes object to decode (not an iterable of bytes). Which is also consistent with it's companion iterencode() which accepts a str object, not an iterable of chars. Seems logical that one should be able to pass the output from iterencode() as the direct input to iterdecode() without having to convert, no?
msg354731 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2019-10-15 14:04
The documentation might be unclear here. But the argument iterator of iterdecode(iterator, encoding, errors='strict', *kwargs) is* supposed to be an iterable over bytes objects. In fact iterencode() transforms an iterator over strings into an iterator over bytes and iterdecode() transforms an iterator over bytes into an iterator over strings. Since iterating over strings iterates over the characters, it's possible to pass a string to iterencode(). However it's not possible to pass a bytes object to iterdecode() since iterating over a bytes object yields integers: >>> import codecs >>> list(codecs.iterencode(['spam'], 'utf-8')) [b'spam'] >>> list(codecs.iterencode('spam', 'utf-8')) [b's', b'p', b'a', b'm'] >>> list(codecs.iterdecode([b'spam'], 'utf-8')) ['spam'] >>> list(codecs.iterdecode(b'spam', 'utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 1048, in iterdecode output = decoder.decode(input) File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 321, in decode data = self.buffer + input TypeError: can't concat int to bytes
msg354755 - (view)	Author: Jim Carroll (jamercee) *	Date: 2019-10-15 23:17
I understand. btw; I did a deep dive on cpython codebase, and the only references to codecs.iterencode()/iterdecode() is in ./Lib/tests/test_codecs.py. I suspect functions are not used by many people. The patch I proposed was a three line change that would allow passing either an int or bytes...not sure if that sways any opinions on this topic. If we decide to just stick with existing functionality, a small clarification to the docs might be in order?
msg354836 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2019-10-17 08:59
codecs.iterencode()/iterdecode() are just shallow 10-line wrappers around incremental codecs (which are used as the basis of io streams). Note that the doc string for iterencode() contains: Encodes the input strings from the iterator using an IncrementalEncoder. i.e. "strings" (plural) should give a hint that iterator is an iterator over strings. But maybe this could be made clearer. And https://docs.python.org/3/library/codecs.html#codecs.iterencode and https://docs.python.org/3/library/codecs.html#codecs.iterdecode could indead be clearer about what iterator should be. An example might also help.

History
Date	User	Action	Args
2022-04-11 14:59:21	admin	set	github: 82663
2019-10-17 08:59:44	doerwalter	set	messages: + msg354836
2019-10-15 23:17:59	jamercee	set	messages: + msg354755
2019-10-15 14:04:43	doerwalter	set	nosy: + doerwalter messages: + msg354731
2019-10-15 11:59:23	jamercee	set	messages: + msg354716
2019-10-15 10:42:54	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg354711
2019-10-15 08:27:39	jamercee	set	files: + codecs.patch
2019-10-15 08:26:27	jamercee	set	files: - codecs.patch
2019-10-15 08:21:30	jamercee	create