This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: codecs.StreamReader doesn't pass final=1 to the UTF-8 codec
Type: behavior Stage:
Components: Unicode Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, lemburg, spatz123, vstinner
Priority: high Keywords:

Created on 2011-07-06 22:45 by spatz123, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
fffd.py spatz123, 2011-07-06 22:45
fffd-2.py serhiy.storchaka, 2012-06-15 08:57 Demo script. 2+3 portable, added io.open.
Messages (6)
msg139954 - (view) Author: Saul Spatz (spatz123) Date: 2011-07-06 22:45
The attached script produces the output 

'A\ufffdBC\ufffd'
'A\ufffdBC'

although it seems to me that both lines should be the same.  The first line is correct, I think, since the <F4> at the end is a maximal subpart of an ill-formed subsequence, according to the definition in the Unicode standard.
msg139956 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-07-06 22:53
I confirm, there is a bug in codecs.StreamReader.
msg139957 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-07-06 22:55
You should use the io module, it doesn't have the bug :)
msg143473 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-04 04:20
IIUC this happens because StreamReader calls codecs.utf_8_decode without passing final=1 [0], so when the decoder finds the trailing F4 it doesn't decode it yet because it waits from the other 3 bytes (F4 is the start byte of a 4-bytes UTF-8 sequence):

>>> b = b'A\xf5BC\xf4'
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0)  # final=0
>>> chars, decnum
('A�BC', 4)  # F4 not decoded yet
>>> b = b[decnum:]
>>> b
b'\xf4'  # F4 still here
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0)
>>> chars, decnum
('', 0)  # additional calls keep waiting for the other 3 bytes
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 1)  # final=1
>>> chars, decnum
('�', 1)  # when final=1 is passed F4 is decoded, but it never happens

While passing 1 makes the attached script work as expected, it breaks several other test in test_codecs (apparently not all the decoders accept the 'final' argument).
Also passing 1 should be done only for the last call: read can be called several times with a specific size, and it shouldn't use final=1 until the last call to avoid errors mid-stream.

[0]: see Lib/codecs.py:482
msg144373 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-09-21 11:20
The final parameter is an extension to the decoder API signature,
so it's not surprising that not all codecs implement it.

The ones that do should use it for all calls, since that way
the actual consumed number of bytes is correctly reported
back to the StreamReader instance.

Note: The parameter name "final" is a bit misleading. What happens
is that the number of bytes consumed by the decoder were previously
always reported as len(buffer), since the C API for decoders did
not provide a way to report back the number of bytes consumed.
This was changed when stateful decoders were added to the C API,
since these do allow reporting back the consumed bytes. A more
appropriate name for the parameter would have been
"report_bytes_consumed".
msg144376 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-21 12:23
AFAIU final means:
  * final=0: I'm passing in a few bytes, but there are more to come, so if the last byte(s) doesn't make sense on its own (e.g. it's a start byte but the continuation bytes are missing), wait for the others before raising an error;
  * final=1: these are the last bytes, so if the last byte(s) doesn't make sense raise an error (or ignore/replace) because there won't be other bytes that might turn that in a well-formed byte sequence.
History
Date User Action Args
2022-04-11 14:57:19adminsetgithub: 56717
2020-11-11 19:26:15vstinnersettitle: Codecs Anomaly -> codecs.StreamReader doesn't pass final=1 to the UTF-8 codec
2020-11-11 18:22:23iritkatrielsetversions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.2, Python 3.3
2012-06-15 08:57:38serhiy.storchakasetfiles: + fffd-2.py
versions: + Python 2.7, Python 3.3
2011-09-21 12:23:35ezio.melottisetmessages: + msg144376
2011-09-21 11:20:51lemburgsetmessages: + msg144373
2011-09-17 16:47:47ezio.melottisetnosy: + lemburg
2011-09-04 04:20:13ezio.melottisetmessages: + msg143473
2011-07-06 23:26:17rhettingersetpriority: normal -> high
2011-07-06 22:55:17vstinnersetmessages: + msg139957
2011-07-06 22:53:14vstinnersetnosy: + vstinner
messages: + msg139956
2011-07-06 22:52:30ezio.melottisetnosy: + ezio.melotti
2011-07-06 22:45:47spatz123create