Message 238057 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rosuav
Recipients	Rosuav, ezio.melotti, serhiy.storchaka, vstinner
Date	2015-03-13.21:53:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1426283607.48.0.685234730526.issue23614@psf.upfronthosting.co.za>
In-reply-to

Content
Nice document. Is that actually how Python's decoder checks things? Does the decoder have different definitions of "valid continuation byte" based on the lead byte? If that's the case... well, ten out of ten for complying with the spec, to be sure, but unfortunately it leads to some opaque error messages! I haven't looked into the code even a little bit, but would it be possible to have a specific error message attached to certain "invalid continuation bytes"? * E0 followed by 80..9F: "non-shortest form" * ED followed by A0..BF: "surrogate" * F4 followed by 90..BF: "outside defined range" If that's too hard, it'd at least be helpful to point out that the "invalid continuation byte" is not the same as the "byte 0x?? in position ?" - the rejection here is actually of the B4 that follows it. How does this look? UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte 0xb4 for this start byte (BTW, I think Pike's decoder just always emits two bytes, no matter what the actual errant stream (after all, there's no way to know how many bytes "ought to have been" one character, when there's an error in it). So it's incomplete, yes, but when you're dealing with wrong data, completeness isn't all that possible anyway.)

Nice document. Is that actually how Python's decoder checks things? Does the decoder have different definitions of "valid continuation byte" based on the lead byte? If that's the case... well, ten out of ten for complying with the spec, to be sure, but unfortunately it leads to some opaque error messages!

I haven't looked into the code even a little bit, but would it be possible to have a specific error message attached to certain "invalid continuation bytes"?

* E0 followed by 80..9F: "non-shortest form"
* ED followed by A0..BF: "surrogate"
* F4 followed by 90..BF: "outside defined range"

If that's too hard, it'd at least be helpful to point out that the "invalid continuation byte" is not the same as the "byte 0x?? in position ?" - the rejection here is actually of the B4 that follows it. How does this look?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte 0xb4 for this start byte

(BTW, I think Pike's decoder just always emits two bytes, no matter what the actual errant stream (after all, there's no way to know how many bytes "ought to have been" one character, when there's an error in it). So it's incomplete, yes, but when you're dealing with wrong data, completeness isn't all that possible anyway.)

History
Date	User	Action	Args
2015-03-13 21:53:27	Rosuav	set	recipients: + Rosuav, vstinner, ezio.melotti, serhiy.storchaka
2015-03-13 21:53:27	Rosuav	set	messageid: <1426283607.48.0.685234730526.issue23614@psf.upfronthosting.co.za>
2015-03-13 21:53:27	Rosuav	link	issue23614 messages
2015-03-13 21:53:26	Rosuav	create