Message 102024 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	dangra, ezio.melotti, lemburg, sjmachin
Date	2010-03-31.18:07:43
SpamBayes Score	0.0005580376
Marked as misclassified	No
Message-id	<1270058865.03.0.672346954204.issue8271@psf.upfronthosting.co.za>
In-reply-to

Content
I guess the term "failing" byte somewhat underdefined. Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD". Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point. Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly. Any takers ?

I guess the term "failing" byte somewhat underdefined.

Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD".

Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point.

Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly.

Any takers ?

History
Date	User	Action	Args
2010-03-31 18:07:45	lemburg	set	recipients: + lemburg, sjmachin, ezio.melotti, dangra
2010-03-31 18:07:45	lemburg	set	messageid: <1270058865.03.0.672346954204.issue8271@psf.upfronthosting.co.za>
2010-03-31 18:07:43	lemburg	link	issue8271 messages
2010-03-31 18:07:43	lemburg	create