Message159363
> I ran tests of utf16_error_handling-3.2_4.patch on Python 3.1. Two tests are failing:
> - b'\x00\xd8'.decode('utf-16le', 'replace')='\ufffd\ufffd' != '\ufffd'
> - b'\xd8\x00'.decode('utf-16be', 'replace')='\ufffd\ufffd' != '\ufffd'
>
> I don't think that the test is correct: UTF-16 should resynchronize as
> early as possible (ignore the first invalid byte and restart at the
> following byte), so '\ufffd\ufffd' is the correct answer.
UTF-16 units are 16-bit words, not bytes, so '\uffffd' sounds correct to
me. You resynchronize on the word boundary: the invalid word is skipped.
> - with UTF-8 decoder: (b'\xC3' +
> '\xe9'.encode('utf-8')).decode('utf-8', 'replace') returns '\ufffd
> \xe9'
That's because UTF-8 operates on bytes: the invalid byte is skipped. |
|
Date |
User |
Action |
Args |
2012-04-26 11:54:13 | pitrou | set | recipients:
+ pitrou, loewis, vstinner, benjamin.peterson, ezio.melotti, Arfrever, asvetlov, Henri.Salo, Huzaifa.Sidhpurwala, serhiy.storchaka |
2012-04-26 11:54:12 | pitrou | link | issue14579 messages |
2012-04-26 11:54:12 | pitrou | create | |
|