Author lemburg
Recipients dangra, ezio.melotti, lemburg, sjmachin
Date 2010-04-01.07:44:47
SpamBayes Score 4.17999e-14
Marked as misclassified No
Message-id <4BB44EED.2010300@egenix.com>
In-reply-to <1270091973.22.0.435495612508.issue8271@psf.upfronthosting.co.za>
Content
John Machin wrote:
> 
> John Machin <sjmachin@users.sourceforge.net> added the comment:
> 
> @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

I just had a quick look at the code and saw that it's testing for the high
bit on the subsequent bytes.

Looking closer, you're right and the situation is a bit more complex,
but the solution still looks simple: only the endinpos
has to be adjusted more carefully depending on what the various
checks find.

That said, I find the Unicode consortium solution a bit awkward.
In UTF-8 the first byte in a multi-byte sequence defines the number
of bytes that make up a sequence. If some of those bytes are invalid,
the whole sequence is invalid and the fact that some of those
bytes may be interpretable as regular code points does not necessarily
result in better results - the reason is that loss of bytes in a
stream is far more unlikely than flipping a few bits in the data.
History
Date User Action Args
2010-04-01 07:44:50lemburgsetrecipients: + lemburg, sjmachin, ezio.melotti, dangra
2010-04-01 07:44:48lemburglinkissue8271 messages
2010-04-01 07:44:47lemburgcreate