This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients Rosuav, ezio.melotti, serhiy.storchaka, vstinner
Date 2015-03-13.22:24:52
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1426285492.53.0.728871893418.issue23614@psf.upfronthosting.co.za>
In-reply-to
Content
> Nice document. Is that actually how Python's decoder checks things?

Yes, Python follows the Unicode standard.

> * E0 followed by 80..9F: "non-shortest form"
> * ED followed by A0..BF: "surrogate"
> * F4 followed by 90..BF: "outside defined range"

If you get a decode error while using UTF-8, it means that you are trying to decode something that is not (valid) UTF-8.  I can see 3 situations where this might happen:
1) the input is using a different encoding;
2) the input is corrupted;
3) the input is using an encoding similar to UTF-8 (e.g. CESU-8);

In the first two cases additional information about continuation bytes are meaningless and misleading (there's no such thing as short form or surrogates in e.g. ASCII).  In the third case (which is actually a special case of 1), mentioning surrogates and perhaps non-shortest form might be useful if the developer is intimately familiar with UTF-8 and Unicode since he might suspect that the input is actually CESU-8 or the text has been encoded by an outdated encoder that follows the RFC 2044 specs from 1996.

> How does this look?
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0:
> invalid continuation byte 0xb4 for this start byte

Something similar would be ok with me, assuming is easy to implement in the code.
History
Date User Action Args
2015-03-13 22:24:52ezio.melottisetrecipients: + ezio.melotti, vstinner, Rosuav, serhiy.storchaka
2015-03-13 22:24:52ezio.melottisetmessageid: <1426285492.53.0.728871893418.issue23614@psf.upfronthosting.co.za>
2015-03-13 22:24:52ezio.melottilinkissue23614 messages
2015-03-13 22:24:52ezio.melotticreate