Message 102068 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	dangra, ezio.melotti, lemburg, sjmachin
Date	2010-04-01.07:44:47
SpamBayes Score	4.1799897e-14
Marked as misclassified	No
Message-id	<4BB44EED.2010300@egenix.com>
In-reply-to	<1270091973.22.0.435495612508.issue8271@psf.upfronthosting.co.za>

Content
John Machin wrote: > > John Machin <sjmachin@users.sourceforge.net> added the comment: > > @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below. I just had a quick look at the code and saw that it's testing for the high bit on the subsequent bytes. Looking closer, you're right and the situation is a bit more complex, but the solution still looks simple: only the endinpos has to be adjusted more carefully depending on what the various checks find. That said, I find the Unicode consortium solution a bit awkward. In UTF-8 the first byte in a multi-byte sequence defines the number of bytes that make up a sequence. If some of those bytes are invalid, the whole sequence is invalid and the fact that some of those bytes may be interpretable as regular code points does not necessarily result in better results - the reason is that loss of bytes in a stream is far more unlikely than flipping a few bits in the data.

John Machin wrote:
> 
> John Machin <sjmachin@users.sourceforge.net> added the comment:
> 
> @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

I just had a quick look at the code and saw that it's testing for the high
bit on the subsequent bytes.

Looking closer, you're right and the situation is a bit more complex,
but the solution still looks simple: only the endinpos
has to be adjusted more carefully depending on what the various
checks find.

That said, I find the Unicode consortium solution a bit awkward.
In UTF-8 the first byte in a multi-byte sequence defines the number
of bytes that make up a sequence. If some of those bytes are invalid,
the whole sequence is invalid and the fact that some of those
bytes may be interpretable as regular code points does not necessarily
result in better results - the reason is that loss of bytes in a
stream is far more unlikely than flipping a few bits in the data.

History
Date	User	Action	Args
2010-04-01 07:44:50	lemburg	set	recipients: + lemburg, sjmachin, ezio.melotti, dangra
2010-04-01 07:44:48	lemburg	link	issue8271 messages
2010-04-01 07:44:47	lemburg	create