Issue 12508: codecs.StreamReader doesn't pass final=1 to the UTF-8 codec

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56717

classification

Title:	codecs.StreamReader doesn't pass final=1 to the UTF-8 codec
Type:	behavior	Stage:
Components:	Unicode	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, lemburg, spatz123, vstinner
Priority:	high	Keywords:

Created on 2011-07-06 22:45 by spatz123, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
fffd.py	spatz123, 2011-07-06 22:45
fffd-2.py	serhiy.storchaka, 2012-06-15 08:57	Demo script. 2+3 portable, added io.open.

Messages (6)
msg139954 - (view)	Author: Saul Spatz (spatz123)	Date: 2011-07-06 22:45
The attached script produces the output 'A\ufffdBC\ufffd' 'A\ufffdBC' although it seems to me that both lines should be the same. The first line is correct, I think, since the <F4> at the end is a maximal subpart of an ill-formed subsequence, according to the definition in the Unicode standard.
msg139956 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-07-06 22:53
I confirm, there is a bug in codecs.StreamReader.
msg139957 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-07-06 22:55
You should use the io module, it doesn't have the bug :)
msg143473 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-04 04:20
IIUC this happens because StreamReader calls codecs.utf_8_decode without passing final=1 [0], so when the decoder finds the trailing F4 it doesn't decode it yet because it waits from the other 3 bytes (F4 is the start byte of a 4-bytes UTF-8 sequence): >>> b = b'A\xf5BC\xf4' >>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0) # final=0 >>> chars, decnum ('A�BC', 4) # F4 not decoded yet >>> b = b[decnum:] >>> b b'\xf4' # F4 still here >>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0) >>> chars, decnum ('', 0) # additional calls keep waiting for the other 3 bytes >>> chars, decnum = codecs.utf_8_decode(b, 'replace', 1) # final=1 >>> chars, decnum ('�', 1) # when final=1 is passed F4 is decoded, but it never happens While passing 1 makes the attached script work as expected, it breaks several other test in test_codecs (apparently not all the decoders accept the 'final' argument). Also passing 1 should be done only for the last call: read can be called several times with a specific size, and it shouldn't use final=1 until the last call to avoid errors mid-stream. [0]: see Lib/codecs.py:482
msg144373 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-09-21 11:20
The final parameter is an extension to the decoder API signature, so it's not surprising that not all codecs implement it. The ones that do should use it for all calls, since that way the actual consumed number of bytes is correctly reported back to the StreamReader instance. Note: The parameter name "final" is a bit misleading. What happens is that the number of bytes consumed by the decoder were previously always reported as len(buffer), since the C API for decoders did not provide a way to report back the number of bytes consumed. This was changed when stateful decoders were added to the C API, since these do allow reporting back the consumed bytes. A more appropriate name for the parameter would have been "report_bytes_consumed".
msg144376 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-21 12:23
AFAIU final means: * final=0: I'm passing in a few bytes, but there are more to come, so if the last byte(s) doesn't make sense on its own (e.g. it's a start byte but the continuation bytes are missing), wait for the others before raising an error; * final=1: these are the last bytes, so if the last byte(s) doesn't make sense raise an error (or ignore/replace) because there won't be other bytes that might turn that in a well-formed byte sequence.

History
Date	User	Action	Args
2022-04-11 14:57:19	admin	set	github: 56717
2020-11-11 19:26:15	vstinner	set	title: Codecs Anomaly -> codecs.StreamReader doesn't pass final=1 to the UTF-8 codec
2020-11-11 18:22:23	iritkatriel	set	versions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.2, Python 3.3
2012-06-15 08:57:38	serhiy.storchaka	set	files: + fffd-2.py versions: + Python 2.7, Python 3.3
2011-09-21 12:23:35	ezio.melotti	set	messages: + msg144376
2011-09-21 11:20:51	lemburg	set	messages: + msg144373
2011-09-17 16:47:47	ezio.melotti	set	nosy: + lemburg
2011-09-04 04:20:13	ezio.melotti	set	messages: + msg143473
2011-07-06 23:26:17	rhettinger	set	priority: normal -> high
2011-07-06 22:55:17	vstinner	set	messages: + msg139957
2011-07-06 22:53:14	vstinner	set	nosy: + vstinner messages: + msg139956
2011-07-06 22:52:30	ezio.melotti	set	nosy: + ezio.melotti
2011-07-06 22:45:47	spatz123	create