Message 237572 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rosuav
Recipients	Rosuav, ezio.melotti, vstinner
Date	2015-03-08.21:36:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1425850595.48.0.491367645605.issue23614@psf.upfronthosting.co.za>
In-reply-to

Content
>>> b"\xed\xb4\x80".decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte The actual problem here is that this byte sequence would decode to U+DD00, which, being a surrogate, is invalid for the encoding. It's correct to raise UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure whether the "invalid continuation byte" is talking about the "0xed in position 0" or about one of the others; 0xED is not a continuation byte, it's a start byte - and a perfectly valid one: >>> b"\xed\x9f\xbf".decode("utf-8") '\ud7ff' Pike is more explicit about what the problem is: > utf8_to_string("\xed\xb4\x80"); UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a UTF-16 surrogate character. Is this something worth fixing? Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x versions. (2.7 actually permits this, which is a bigger bug, but one with backward-compatibility issues.)

>>> b"\xed\xb4\x80".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position
0: invalid continuation byte

The actual problem here is that this byte sequence would decode to U+DD00, which, being a surrogate, is invalid for the encoding. It's correct to raise UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure whether the "invalid continuation byte" is talking about the "0xed in position 0" or about one of the others; 0xED is not a continuation byte, it's a start byte - and a perfectly valid one:

>>> b"\xed\x9f\xbf".decode("utf-8")
'\ud7ff'

Pike is more explicit about what the problem is:

> utf8_to_string("\xed\xb4\x80");
UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a
UTF-16 surrogate character.

Is this something worth fixing?

Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x versions. (2.7 actually permits this, which is a bigger bug, but one with backward-compatibility issues.)

History
Date	User	Action	Args
2015-03-08 21:36:35	Rosuav	set	recipients: + Rosuav, vstinner, ezio.melotti
2015-03-08 21:36:35	Rosuav	set	messageid: <1425850595.48.0.491367645605.issue23614@psf.upfronthosting.co.za>
2015-03-08 21:36:35	Rosuav	link	issue23614 messages
2015-03-08 21:36:35	Rosuav	create