This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Opaque error message on UTF-8 decoding to surrogates
Type: enhancement Stage:
Components: Interpreter Core, Unicode Versions: Python 3.4, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Rosuav, ezio.melotti, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2015-03-08 21:36 by Rosuav, last changed 2022-04-11 14:58 by admin.

Messages (6)
msg237572 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-03-08 21:36
>>> b"\xed\xb4\x80".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position
0: invalid continuation byte

The actual problem here is that this byte sequence would decode to U+DD00, which, being a surrogate, is invalid for the encoding. It's correct to raise UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure whether the "invalid continuation byte" is talking about the "0xed in position 0" or about one of the others; 0xED is not a continuation byte, it's a start byte - and a perfectly valid one:

>>> b"\xed\x9f\xbf".decode("utf-8")
'\ud7ff'

Pike is more explicit about what the problem is:

> utf8_to_string("\xed\xb4\x80");
UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a
UTF-16 surrogate character.

Is this something worth fixing?

Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x versions. (2.7 actually permits this, which is a bigger bug, but one with backward-compatibility issues.)
msg237573 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-08 22:01
UTF-8 codec can't decode byte 0xed because 0xed is not valid UTF-8 sequence and following byte is not expected valid continuation byte.

UTF-8 codec can produce errors of three types:

* "invalid start byte". When the byte is not start byte of UTF-8 sequence (%x00-7F, %xC2-F4).
* "invalid continuation byte".  When the byte that follow unfinished UTF-8 sequence is not valid continuation byte (the validity depends on previous byte).
* "unexpected end of data". When the there are no bytes after unfinished UTF-8 sequence.
msg238043 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-03-13 17:57
The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 of the book, or 40 of the pdf) shows that if the start byte is ED the continuation byte must be in range 80..9F.  This means that, in order to decode a sequence starting with ED, you need two more valid continuation bytes.  Since the following byte (B4) is not in allowed range 80..9F and is thus an invalid continuation byte, the decoder doesn't know how to decode the byte in position 0 (i.e. ED).

It is also true that this particular sequence, if allowed, would result in a surrogate.  However, by looking at the first two bytes only, you don't have enough information to be sure about that (e.g. ED B4 00 begins doesn't decode to a surrogate, so Pike's error message is imprecise).

If handling this special case doesn't require too much extra code, it would be ok with me to have something like:
>>> b"\xed\xb4\x80".decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte (possible start of a surrogate)
msg238057 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-03-13 21:53
Nice document. Is that actually how Python's decoder checks things? Does the decoder have different definitions of "valid continuation byte" based on the lead byte? If that's the case... well, ten out of ten for complying with the spec, to be sure, but unfortunately it leads to some opaque error messages!

I haven't looked into the code even a little bit, but would it be possible to have a specific error message attached to certain "invalid continuation bytes"?

* E0 followed by 80..9F: "non-shortest form"
* ED followed by A0..BF: "surrogate"
* F4 followed by 90..BF: "outside defined range"

If that's too hard, it'd at least be helpful to point out that the "invalid continuation byte" is not the same as the "byte 0x?? in position ?" - the rejection here is actually of the B4 that follows it. How does this look?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte 0xb4 for this start byte

(BTW, I think Pike's decoder just always emits two bytes, no matter what the actual errant stream (after all, there's no way to know how many bytes "ought to have been" one character, when there's an error in it). So it's incomplete, yes, but when you're dealing with wrong data, completeness isn't all that possible anyway.)
msg238059 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-03-13 22:24
> Nice document. Is that actually how Python's decoder checks things?

Yes, Python follows the Unicode standard.

> * E0 followed by 80..9F: "non-shortest form"
> * ED followed by A0..BF: "surrogate"
> * F4 followed by 90..BF: "outside defined range"

If you get a decode error while using UTF-8, it means that you are trying to decode something that is not (valid) UTF-8.  I can see 3 situations where this might happen:
1) the input is using a different encoding;
2) the input is corrupted;
3) the input is using an encoding similar to UTF-8 (e.g. CESU-8);

In the first two cases additional information about continuation bytes are meaningless and misleading (there's no such thing as short form or surrogates in e.g. ASCII).  In the third case (which is actually a special case of 1), mentioning surrogates and perhaps non-shortest form might be useful if the developer is intimately familiar with UTF-8 and Unicode since he might suspect that the input is actually CESU-8 or the text has been encoded by an outdated encoder that follows the RFC 2044 specs from 1996.

> How does this look?
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0:
> invalid continuation byte 0xb4 for this start byte

Something similar would be ok with me, assuming is easy to implement in the code.
msg242164 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-04-28 04:22
Got around to tracking down where this is actually being done. It's in Objects/stringlib/codecs.h and it looks to be a hot area for optimization. I don't want to fiddle with it without knowing a lot about the performance implications (UTF-8 encode/decode being a pretty common operation), and since my knowledge of CPU operation costs is about fifteen or twenty years out of date, it's probably best I not do this. It would be nice if the message could be improved per Ezio's suggestion, but that would mean returning more information to the caller.
History
Date User Action Args
2022-04-11 14:58:13adminsetgithub: 67802
2015-04-28 04:22:07Rosuavsetmessages: + msg242164
2015-03-13 22:24:52ezio.melottisetmessages: + msg238059
2015-03-13 21:53:27Rosuavsetmessages: + msg238057
2015-03-13 17:57:35ezio.melottisettype: enhancement
messages: + msg238043
2015-03-08 22:01:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg237573
2015-03-08 21:36:35Rosuavcreate