Message 159363 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	Arfrever, Henri.Salo, Huzaifa.Sidhpurwala, asvetlov, benjamin.peterson, ezio.melotti, loewis, pitrou, serhiy.storchaka, vstinner
Date	2012-04-26.11:54:12
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1335441165.3421.1.camel@localhost.localdomain>
In-reply-to	<1335440186.13.0.485206103926.issue14579@psf.upfronthosting.co.za>

Content
> I ran tests of utf16_error_handling-3.2_4.patch on Python 3.1. Two tests are failing: > - b'\x00\xd8'.decode('utf-16le', 'replace')='\ufffd\ufffd' != '\ufffd' > - b'\xd8\x00'.decode('utf-16be', 'replace')='\ufffd\ufffd' != '\ufffd' > > I don't think that the test is correct: UTF-16 should resynchronize as > early as possible (ignore the first invalid byte and restart at the > following byte), so '\ufffd\ufffd' is the correct answer. UTF-16 units are 16-bit words, not bytes, so '\uffffd' sounds correct to me. You resynchronize on the word boundary: the invalid word is skipped. > - with UTF-8 decoder: (b'\xC3' + > '\xe9'.encode('utf-8')).decode('utf-8', 'replace') returns '\ufffd > \xe9' That's because UTF-8 operates on bytes: the invalid byte is skipped.

> I ran tests of utf16_error_handling-3.2_4.patch on Python 3.1. Two tests are failing:
>  - b'\x00\xd8'.decode('utf-16le', 'replace')='\ufffd\ufffd' != '\ufffd'
>  - b'\xd8\x00'.decode('utf-16be', 'replace')='\ufffd\ufffd' != '\ufffd'
> 
> I don't think that the test is correct: UTF-16 should resynchronize as
> early as possible (ignore the first invalid byte and restart at the
> following byte), so '\ufffd\ufffd' is the correct answer.

UTF-16 units are 16-bit words, not bytes, so '\uffffd' sounds correct to
me. You resynchronize on the word boundary: the invalid word is skipped.

>  - with UTF-8 decoder: (b'\xC3' +
> '\xe9'.encode('utf-8')).decode('utf-8', 'replace') returns '\ufffd
> \xe9'

That's because UTF-8 operates on bytes: the invalid byte is skipped.

History
Date	User	Action	Args
2012-04-26 11:54:13	pitrou	set	recipients: + pitrou, loewis, vstinner, benjamin.peterson, ezio.melotti, Arfrever, asvetlov, Henri.Salo, Huzaifa.Sidhpurwala, serhiy.storchaka
2012-04-26 11:54:12	pitrou	link	issue14579 messages
2012-04-26 11:54:12	pitrou	create