Message237572
>>> b"\xed\xb4\x80".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position
0: invalid continuation byte
The actual problem here is that this byte sequence would decode to U+DD00, which, being a surrogate, is invalid for the encoding. It's correct to raise UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure whether the "invalid continuation byte" is talking about the "0xed in position 0" or about one of the others; 0xED is not a continuation byte, it's a start byte - and a perfectly valid one:
>>> b"\xed\x9f\xbf".decode("utf-8")
'\ud7ff'
Pike is more explicit about what the problem is:
> utf8_to_string("\xed\xb4\x80");
UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a
UTF-16 surrogate character.
Is this something worth fixing?
Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x versions. (2.7 actually permits this, which is a bigger bug, but one with backward-compatibility issues.) |
|
Date |
User |
Action |
Args |
2015-03-08 21:36:35 | Rosuav | set | recipients:
+ Rosuav, vstinner, ezio.melotti |
2015-03-08 21:36:35 | Rosuav | set | messageid: <1425850595.48.0.491367645605.issue23614@psf.upfronthosting.co.za> |
2015-03-08 21:36:35 | Rosuav | link | issue23614 messages |
2015-03-08 21:36:35 | Rosuav | create | |
|