This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Rosuav
Recipients Rosuav, ezio.melotti, serhiy.storchaka, vstinner
Date 2015-03-13.21:53:26
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1426283607.48.0.685234730526.issue23614@psf.upfronthosting.co.za>
In-reply-to
Content
Nice document. Is that actually how Python's decoder checks things? Does the decoder have different definitions of "valid continuation byte" based on the lead byte? If that's the case... well, ten out of ten for complying with the spec, to be sure, but unfortunately it leads to some opaque error messages!

I haven't looked into the code even a little bit, but would it be possible to have a specific error message attached to certain "invalid continuation bytes"?

* E0 followed by 80..9F: "non-shortest form"
* ED followed by A0..BF: "surrogate"
* F4 followed by 90..BF: "outside defined range"

If that's too hard, it'd at least be helpful to point out that the "invalid continuation byte" is not the same as the "byte 0x?? in position ?" - the rejection here is actually of the B4 that follows it. How does this look?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte 0xb4 for this start byte

(BTW, I think Pike's decoder just always emits two bytes, no matter what the actual errant stream (after all, there's no way to know how many bytes "ought to have been" one character, when there's an error in it). So it's incomplete, yes, but when you're dealing with wrong data, completeness isn't all that possible anyway.)
History
Date User Action Args
2015-03-13 21:53:27Rosuavsetrecipients: + Rosuav, vstinner, ezio.melotti, serhiy.storchaka
2015-03-13 21:53:27Rosuavsetmessageid: <1426283607.48.0.685234730526.issue23614@psf.upfronthosting.co.za>
2015-03-13 21:53:27Rosuavlinkissue23614 messages
2015-03-13 21:53:26Rosuavcreate