Message 238059 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Rosuav, ezio.melotti, serhiy.storchaka, vstinner
Date	2015-03-13.22:24:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1426285492.53.0.728871893418.issue23614@psf.upfronthosting.co.za>
In-reply-to

Content
> Nice document. Is that actually how Python's decoder checks things? Yes, Python follows the Unicode standard. > * E0 followed by 80..9F: "non-shortest form" > * ED followed by A0..BF: "surrogate" > * F4 followed by 90..BF: "outside defined range" If you get a decode error while using UTF-8, it means that you are trying to decode something that is not (valid) UTF-8. I can see 3 situations where this might happen: 1) the input is using a different encoding; 2) the input is corrupted; 3) the input is using an encoding similar to UTF-8 (e.g. CESU-8); In the first two cases additional information about continuation bytes are meaningless and misleading (there's no such thing as short form or surrogates in e.g. ASCII). In the third case (which is actually a special case of 1), mentioning surrogates and perhaps non-shortest form might be useful if the developer is intimately familiar with UTF-8 and Unicode since he might suspect that the input is actually CESU-8 or the text has been encoded by an outdated encoder that follows the RFC 2044 specs from 1996. > How does this look? > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: > invalid continuation byte 0xb4 for this start byte Something similar would be ok with me, assuming is easy to implement in the code.

> Nice document. Is that actually how Python's decoder checks things?

Yes, Python follows the Unicode standard.

> * E0 followed by 80..9F: "non-shortest form"
> * ED followed by A0..BF: "surrogate"
> * F4 followed by 90..BF: "outside defined range"

If you get a decode error while using UTF-8, it means that you are trying to decode something that is not (valid) UTF-8.  I can see 3 situations where this might happen:
1) the input is using a different encoding;
2) the input is corrupted;
3) the input is using an encoding similar to UTF-8 (e.g. CESU-8);

In the first two cases additional information about continuation bytes are meaningless and misleading (there's no such thing as short form or surrogates in e.g. ASCII).  In the third case (which is actually a special case of 1), mentioning surrogates and perhaps non-shortest form might be useful if the developer is intimately familiar with UTF-8 and Unicode since he might suspect that the input is actually CESU-8 or the text has been encoded by an outdated encoder that follows the RFC 2044 specs from 1996.

> How does this look?
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0:
> invalid continuation byte 0xb4 for this start byte

Something similar would be ok with me, assuming is easy to implement in the code.

History
Date	User	Action	Args
2015-03-13 22:24:52	ezio.melotti	set	recipients: + ezio.melotti, vstinner, Rosuav, serhiy.storchaka
2015-03-13 22:24:52	ezio.melotti	set	messageid: <1426285492.53.0.728871893418.issue23614@psf.upfronthosting.co.za>
2015-03-13 22:24:52	ezio.melotti	link	issue23614 messages
2015-03-13 22:24:52	ezio.melotti	create