Message 143473 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti, spatz123, vstinner
Date	2011-09-04.04:20:13
SpamBayes Score	9.273765e-08
Marked as misclassified	No
Message-id	<1315110014.35.0.231303484438.issue12508@psf.upfronthosting.co.za>
In-reply-to

Content
IIUC this happens because StreamReader calls codecs.utf_8_decode without passing final=1 [0], so when the decoder finds the trailing F4 it doesn't decode it yet because it waits from the other 3 bytes (F4 is the start byte of a 4-bytes UTF-8 sequence): >>> b = b'A\xf5BC\xf4' >>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0) # final=0 >>> chars, decnum ('A�BC', 4) # F4 not decoded yet >>> b = b[decnum:] >>> b b'\xf4' # F4 still here >>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0) >>> chars, decnum ('', 0) # additional calls keep waiting for the other 3 bytes >>> chars, decnum = codecs.utf_8_decode(b, 'replace', 1) # final=1 >>> chars, decnum ('�', 1) # when final=1 is passed F4 is decoded, but it never happens While passing 1 makes the attached script work as expected, it breaks several other test in test_codecs (apparently not all the decoders accept the 'final' argument). Also passing 1 should be done only for the last call: read can be called several times with a specific size, and it shouldn't use final=1 until the last call to avoid errors mid-stream. [0]: see Lib/codecs.py:482

IIUC this happens because StreamReader calls codecs.utf_8_decode without passing final=1 [0], so when the decoder finds the trailing F4 it doesn't decode it yet because it waits from the other 3 bytes (F4 is the start byte of a 4-bytes UTF-8 sequence):

>>> b = b'A\xf5BC\xf4'
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0)  # final=0
>>> chars, decnum
('A�BC', 4)  # F4 not decoded yet
>>> b = b[decnum:]
>>> b
b'\xf4'  # F4 still here
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0)
>>> chars, decnum
('', 0)  # additional calls keep waiting for the other 3 bytes
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 1)  # final=1
>>> chars, decnum
('�', 1)  # when final=1 is passed F4 is decoded, but it never happens

While passing 1 makes the attached script work as expected, it breaks several other test in test_codecs (apparently not all the decoders accept the 'final' argument).
Also passing 1 should be done only for the last call: read can be called several times with a specific size, and it shouldn't use final=1 until the last call to avoid errors mid-stream.

[0]: see Lib/codecs.py:482

History
Date	User	Action	Args
2011-09-04 04:20:14	ezio.melotti	set	recipients: + ezio.melotti, vstinner, spatz123
2011-09-04 04:20:14	ezio.melotti	set	messageid: <1315110014.35.0.231303484438.issue12508@psf.upfronthosting.co.za>
2011-09-04 04:20:13	ezio.melotti	link	issue12508 messages
2011-09-04 04:20:13	ezio.melotti	create