This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Rosuav
Recipients Rosuav, ezio.melotti, vstinner
Date 2015-03-08.21:36:35
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1425850595.48.0.491367645605.issue23614@psf.upfronthosting.co.za>
In-reply-to
Content
>>> b"\xed\xb4\x80".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position
0: invalid continuation byte

The actual problem here is that this byte sequence would decode to U+DD00, which, being a surrogate, is invalid for the encoding. It's correct to raise UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure whether the "invalid continuation byte" is talking about the "0xed in position 0" or about one of the others; 0xED is not a continuation byte, it's a start byte - and a perfectly valid one:

>>> b"\xed\x9f\xbf".decode("utf-8")
'\ud7ff'

Pike is more explicit about what the problem is:

> utf8_to_string("\xed\xb4\x80");
UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a
UTF-16 surrogate character.

Is this something worth fixing?

Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x versions. (2.7 actually permits this, which is a bigger bug, but one with backward-compatibility issues.)
History
Date User Action Args
2015-03-08 21:36:35Rosuavsetrecipients: + Rosuav, vstinner, ezio.melotti
2015-03-08 21:36:35Rosuavsetmessageid: <1425850595.48.0.491367645605.issue23614@psf.upfronthosting.co.za>
2015-03-08 21:36:35Rosuavlinkissue23614 messages
2015-03-08 21:36:35Rosuavcreate