This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients Rosuav, ezio.melotti, serhiy.storchaka, vstinner
Date 2015-03-13.17:57:35
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1426269455.76.0.231217505884.issue23614@psf.upfronthosting.co.za>
In-reply-to
Content
The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 of the book, or 40 of the pdf) shows that if the start byte is ED the continuation byte must be in range 80..9F.  This means that, in order to decode a sequence starting with ED, you need two more valid continuation bytes.  Since the following byte (B4) is not in allowed range 80..9F and is thus an invalid continuation byte, the decoder doesn't know how to decode the byte in position 0 (i.e. ED).

It is also true that this particular sequence, if allowed, would result in a surrogate.  However, by looking at the first two bytes only, you don't have enough information to be sure about that (e.g. ED B4 00 begins doesn't decode to a surrogate, so Pike's error message is imprecise).

If handling this special case doesn't require too much extra code, it would be ok with me to have something like:
>>> b"\xed\xb4\x80".decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte (possible start of a surrogate)
History
Date User Action Args
2015-03-13 17:57:35ezio.melottisetrecipients: + ezio.melotti, vstinner, Rosuav, serhiy.storchaka
2015-03-13 17:57:35ezio.melottisetmessageid: <1426269455.76.0.231217505884.issue23614@psf.upfronthosting.co.za>
2015-03-13 17:57:35ezio.melottilinkissue23614 messages
2015-03-13 17:57:35ezio.melotticreate