Message327357
The following code issues a misleading exception message:
>>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
The cause for the exception is *not* an invalid continuation byte, but UTF-8 encoded surrogates. In fact using the 'surrogatepass' error handler doesn't raise an exception:
>>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8", "surrogatepass")
'\ud83d\udcde'
I would have expected an exception message like:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: surrogates not allowed
(Note that the input bytes are an improperly UTF-8 encoded version of U+1F4DE (telephone receiver)) |
|
Date |
User |
Action |
Args |
2018-10-08 15:38:20 | doerwalter | set | recipients:
+ doerwalter, vstinner, ezio.melotti |
2018-10-08 15:38:20 | doerwalter | set | messageid: <1539013100.29.0.545547206417.issue34935@psf.upfronthosting.co.za> |
2018-10-08 15:38:20 | doerwalter | link | issue34935 messages |
2018-10-08 15:38:20 | doerwalter | create | |
|