Message 238043 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Rosuav, ezio.melotti, serhiy.storchaka, vstinner
Date	2015-03-13.17:57:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1426269455.76.0.231217505884.issue23614@psf.upfronthosting.co.za>
In-reply-to

Content
The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 of the book, or 40 of the pdf) shows that if the start byte is ED the continuation byte must be in range 80..9F. This means that, in order to decode a sequence starting with ED, you need two more valid continuation bytes. Since the following byte (B4) is not in allowed range 80..9F and is thus an invalid continuation byte, the decoder doesn't know how to decode the byte in position 0 (i.e. ED). It is also true that this particular sequence, if allowed, would result in a surrogate. However, by looking at the first two bytes only, you don't have enough information to be sure about that (e.g. ED B4 00 begins doesn't decode to a surrogate, so Pike's error message is imprecise). If handling this special case doesn't require too much extra code, it would be ok with me to have something like: >>> b"\xed\xb4\x80".decode("utf-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte (possible start of a surrogate)

The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 of the book, or 40 of the pdf) shows that if the start byte is ED the continuation byte must be in range 80..9F.  This means that, in order to decode a sequence starting with ED, you need two more valid continuation bytes.  Since the following byte (B4) is not in allowed range 80..9F and is thus an invalid continuation byte, the decoder doesn't know how to decode the byte in position 0 (i.e. ED).

It is also true that this particular sequence, if allowed, would result in a surrogate.  However, by looking at the first two bytes only, you don't have enough information to be sure about that (e.g. ED B4 00 begins doesn't decode to a surrogate, so Pike's error message is imprecise).

If handling this special case doesn't require too much extra code, it would be ok with me to have something like:
>>> b"\xed\xb4\x80".decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte (possible start of a surrogate)

History
Date	User	Action	Args
2015-03-13 17:57:35	ezio.melotti	set	recipients: + ezio.melotti, vstinner, Rosuav, serhiy.storchaka
2015-03-13 17:57:35	ezio.melotti	set	messageid: <1426269455.76.0.231217505884.issue23614@psf.upfronthosting.co.za>
2015-03-13 17:57:35	ezio.melotti	link	issue23614 messages
2015-03-13 17:57:35	ezio.melotti	create