This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.3, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Ringding, belopolsky, dangra, ezio.melotti, jmehnle, lemburg, pitrou, python-dev, serhiy.storchaka, sjmachin, spatz123, vstinner
Priority: normal Keywords: needs review, patch

Created on 2010-03-31 02:28 by sjmachin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue8271.diff ezio.melotti, 2010-04-01 08:33 Incomplete patch against trunk. review
issue8271v2.diff ezio.melotti, 2010-04-02 22:27 New patch against trunk review
issue8271v3.diff ezio.melotti, 2010-04-04 05:49 Final patch review
issue8271v4.diff ezio.melotti, 2010-04-07 04:08 More final patch review
issue8271v5.diff ezio.melotti, 2010-06-04 16:22 Even more final patch review
issue8271v6.diff ezio.melotti, 2011-04-19 12:30 Patch to fix the number of FFFD review
issue8271-3.3-fast-3.patch serhiy.storchaka, 2012-06-23 21:21 Ezio's patch updated to current sources review
Messages (66)
msg101972 - (view) Author: John Machin (sjmachin) Date: 2010-03-31 02:28
Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example:

 >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
 '\ufffdB'
 # should produce u'\ufffdAB'

Resynchronisation currently starts at a position derived by considering the length implied by the start byte:

 >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
 '\ufffdD'
 # should produce u'\ufffdABCD'; resync should start from the *failing* byte.

Notes: This applies to the 'ignore' option as well as the 'replace' option. The Unicode discussion mentions "security exploits".
msg102013 - (view) Author: Daniel Graña (dangra) Date: 2010-03-31 14:59
Some background for this report at http://stackoverflow.com/questions/2547262/why-is-python-decode-replacing-more-than-the-invalid-bytes-from-an-encoded-string/2548480
msg102024 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-31 18:07
I guess the term "failing" byte somewhat underdefined.

Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD".

Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point.

Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly.

Any takers ?
msg102061 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 03:19
@lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not in 80-BF. It is the "failing byte"; high bit not set. Required action is to emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total output: FFFD 0041 0042 0043. Current code emits FFFD 0043.

Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current code emits FFFD 0043.

Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 0081 0043. Current code emits FFFD 0043.
msg102062 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-01 03:28
Having the 'high bit set' means that the first bit is set to 1.
All the continuation bytes (i.e. the 2nd, 3rd or 4th byte in a sequence) have the first two bits set to 1 and 0 respectively, so if the first bit is not set to 1 then the byte shouldn't be considered part of the sequence.
I'm trying to work on a patch.
msg102063 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 06:08
@ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation bytes; they are either starter bytes (C2-F4) or invalid for any purpose (C0-C2 and F5-FF). Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.

The simple way of summarising the above is to say that a byte that is not a valid continuation byte in the current state ("failing byte") is not a part of the current (now known to be invalid) sequence, and the decoder must try again ("resync") with the failing byte.

Do you agree with my example 3?
msg102064 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-01 06:14
Yes, right now I'm considering valid all the bytes that start with '10...'. C2 starts with '11...' so it's a "failing byte".
msg102065 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 07:29
#ezio.melotti: """I'm considering valid all the bytes that start with '10...'"""

Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow."""

Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences (over-long). Specifically the first continuation byte may not be in 80-9F. Those bytes start with '10...' but they are invalid after an E0 starter byte.

Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) doesn't implement the surrogates restriction, so that the special case for starter byte ED is not used in CPython). Note the other 3 special cases for the first continuation byte.
msg102066 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-01 07:33
That's why I'm writing tests that cover all the cases, including overlong sequences. If the test will fail I'll change the patch :)
msg102068 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-01 07:44
John Machin wrote:
> 
> John Machin <sjmachin@users.sourceforge.net> added the comment:
> 
> @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

I just had a quick look at the code and saw that it's testing for the high
bit on the subsequent bytes.

Looking closer, you're right and the situation is a bit more complex,
but the solution still looks simple: only the endinpos
has to be adjusted more carefully depending on what the various
checks find.

That said, I find the Unicode consortium solution a bit awkward.
In UTF-8 the first byte in a multi-byte sequence defines the number
of bytes that make up a sequence. If some of those bytes are invalid,
the whole sequence is invalid and the fact that some of those
bytes may be interpretable as regular code points does not necessarily
result in better results - the reason is that loss of bytes in a
stream is far more unlikely than flipping a few bits in the data.
msg102076 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-01 08:33
Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better.
I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking.
msg102077 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-01 08:46
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better.

Thanks. Please also check whether it's worthwhile unrolling those
loops by hand.

> I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking.

I think we need to do this all the way, even though 5 and 6 byte
sequences are not used at the moment.
msg102085 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 11:57
Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.
msg102089 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-01 13:19
John Machin wrote:
> 
> John Machin <sjmachin@users.sourceforge.net> added the comment:
> 
> Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.

The UTF-8 codec was written at a time when UTF-8 still included
the possibility to have 5 or 6 bytes:

http://www.rfc-editor.org/rfc/rfc2279.txt

Use of those encodings has always raised an error, though. For error
handling purposes it still has to support those possibilities.
msg102090 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 13:47
@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now says 21 bits is it. F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.
msg102093 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-01 14:12
John Machin wrote:
> 
> John Machin <sjmachin@users.sourceforge.net> added the comment:
> 
> @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. 

I know.

> The standard now says 21 bits is it. 

It says that the current Unicode codespace only uses 21 bits. In the
early days 16 bits were considered enough, so it wouldn't surprise me,
if they extend that range again at some point in the future - after
all, leaving 11 bits unused in UCS-4 is a huge waste of space.

If you have a reference that the Unicode consortium has decided
to stay with that limit forever, please quote it.

> F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.

Fair enough. Let's do that.

The reference in the table should then be updated to RFC 3629.
msg102094 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 14:43
Patch review:

Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the r79542 that the patch mentions ... help, please!

length 2 case: 
1. the loop can be hand-unrolled into oblivion. It can be entered only when s[1] & 0xC0 != 0x80 (previous if test).
2. the over-long check (if (ch < 0x80)) hasn't been touched. It could be removed and the entries for C0 and C1 in the utf8_code_length array set to 0.

length 3 case:
1. the tests involving s[0] being 0xE0 or 0xED are misplaced.
2. the test s[0] == 0xE0 && s[1] < 0xA0 if not misplaced would be shadowing the over-long test (ch < 0x800). It seems better to use the over-long test (with endinpos set to 1).
3. The test s[0] == 0xED relates to the surrogates caper which in the py3k version is handled in the same place as the over-long test.
4. unrolling loop: needs no loop, only 1 test ... if s[1] is good, then we know s[2] must be bad without testing it, because we start the for loop only when s[1] is bad || s[2] is bad.

length 4 case: as for the len 3 case generally ... misplaced tests, F1 test shadows over-long test, F4 test shadows max value test, too many loop iterations.
msg102095 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-01 14:50
Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:
1) consider all the bytes in range F5-FD as invalid without looking for the other bytes;
2) try to read the next 4/5/6 bytes and fail if they are no continuation bytes.
We can also look at what others do (e.g. browsers and other languages).
msg102098 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-01 15:01
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:
> 1) consider all the bytes in range F5-FD as invalid without looking for the other bytes;
> 2) try to read the next 4/5/6 bytes and fail if they are no continuation bytes.
> We can also look at what others do (e.g. browsers and other languages).

By marking those entries as 0 in the length table, they would only
use one byte, however, compared to the current state, that would
produce more replacement code points in the output, so perhaps applying
the same logic as for the other sequences is a better strategy.
msg102099 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 15:07
Chapter 3, page 94: """As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF"""

Of course they should be handled by the simple expedient of setting their length entry to zero. Why write code when there is an existing mechanism??
msg102101 - (view) Author: John Machin (sjmachin) Date: 2010-04-01 15:23
@lemburg: """perhaps applying the same logic as for the other sequences is a better strategy"""

What other sequences??? F5-FF are invalid bytes; they don't start valid sequences. What same logic?? At the start of a character, they should get the same short sharp treatment as any other non-starter byte e.g. 80 or C0.
msg102209 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-02 22:27
Here's a new patch. Should be complete but I want to test it some more before committing.
I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code.

I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.
msg102239 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-03 11:41
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> Here's a new patch. Should be complete but I want to test it some more before committing.
> I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code.

Ok.

> I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.

Right, but that idea is controversial. In Python we need to be able to
put those surrogate code points into source code (encoded as UTF-8) as
well as pickle and marshal dumps of Unicode object dumps, so we can't
consider them invalid UTF-8.
msg102265 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-03 14:43
> I also found out that, according to RFC 3629, surrogates 
> are considered invalid and they can't be encoded/decoded, 
> but the UTF-8 codec actually does it.

Python2 does, but Python3 raises an error.

Python 2.7a4+ (trunk:79675, Apr  3 2010, 16:11:36)
>>> u"\uDC80".encode("utf8")
'\xed\xb2\x80'

Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
>>> "\uDC80".encode("utf8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see #6687 (and #8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7.
msg102320 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-04 05:49
This new patch (v3) should be ok. 
I added a few more tests and found another corner case:
'\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch.

I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet).

Finally, I changed the error messages to make them clearer:
unexpected code byte -> invalid start byte;
invalid data -> invalid continuation byte.
(I can revert this if the old messages are better or if it is better to fix this with a separate commit.)

The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars.
msg102516 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-07 04:08
The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the test.
(The failing test in test_codeccallbacks was testing that registering error handlers works, using a function that replaced "\xc0\x80" with "\x00". Since now "\xc0" is an invalid start byte regardless of what follows, the function is now receiving only "\xc0" instead of "\xc0\x80" so I had to change the test.)
msg102522 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-07 08:37
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> I also found out that, according to RFC 3629, surrogates 
>> are considered invalid and they can't be encoded/decoded, 
>> but the UTF-8 codec actually does it.
> 
> Python2 does, but Python3 raises an error.
> 
> Python 2.7a4+ (trunk:79675, Apr  3 2010, 16:11:36)
>>>> u"\uDC80".encode("utf8")
> '\xed\xb2\x80'
> 
> Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
>>>> "\uDC80".encode("utf8")
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
> 
> Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see #6687 (and #8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7.

I wonder how that change got into the 3.x branch - I would certainly
not have approved it for the reasons given further up on this ticket.

I think we should revert that change for Python 3.2.
msg102523 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-07 09:02
> >> I also found out that, according to RFC 3629, surrogates
> >> are considered invalid and they can't be encoded/decoded,
> >> but the UTF-8 codec actually does it.
> >
> > Python2 does, but Python3 raises an error.
> > (...)
> 
> I wonder how that change got into the 3.x branch - I would certainly
> not have approved it for the reasons given further up on this ticket.
> 
> I think we should revert that change for Python 3.2.

See r72208 and issue #3672.

pitrou wrote "We could fix it for 3.1, and perhaps leave 2.7 unchanged if some 
people rely on this (for whatever reason)."
msg107074 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-06-04 16:22
I added a test for the 'ignore' error handler. I will commit the patch before the RC unless someone has something against it.

To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629, so:
1) Invalid sequences are now handled as described in http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95);
2) 5- and 6-bits-long sequences are now invalid (no changes in behavior, I just removed the "deafult:" of the switch/case and marked them with '0' in the first table);
3) According to RFC 3629, codepoints in the surrogate range (U+D800-U+DFFF) should be considered invalid, but this would not be backward compatible, so I added code and tests but left them commented away;
4) I changed the error message "unexpected code byte" to "invalid start byte" and "invalid data" to "invalid continuation byte";
5) I added an extensive set of tests in test_unicode;
6) I fixed test_codeccallbacks because it was failing after this change.
msg107163 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-06-05 20:35
Fixed on trunk in r81758 and r81759.
I'm leaving the issue open until I port it on the other versions.
msg109015 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-06-30 20:21
The issue about invalid surrogates in UTF-8 has been raised in #9133.
msg109070 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-01 19:19
Ported to py3k in r82413.
Some test with non-BMP characters should probably be added.
The patch should still be ported to 2.6 and 3.1.
msg109155 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-03 00:49
I've found a subtle corner case about 3- and 4-bytes long sequences.
For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are invalid.
I.e. if the first byte is \xe0 and the second byte is between \x80 (included) and \xA0 (excluded), then the second byte is invalid (this is because sequences < \xe0\xa0\x80 will result in codepoints < U+0800 and these codepoints are already represented by two-bytes-long sequences (\xdf\xbf decodes to U+07FF)).

Assume that we want to decode the string b'\xe0\x61\x80\x61' (where \xe0 is the start byte of a 3-bytes-long sequence, \x61 is the letter 'a' and \x80 a valid continuation byte).
This actually results in:
>>> b'\xe0\x61\x80\x61'.decode('utf-8', 'replace')
'�a�a'
since \x61 is not a valid continuation byte in the sequence:
 * \xe0 is converted to �;
 * \x61 is displayed correctly as 'a';
 * \x80 is valid only as a continuation byte and invalid alone, so it's replaced by �;
 * \x61 is displayed correctly as 'a';

Now, assume that we want to do the same with b'\xe0\x80\x81\x61':
This actually results in:
>>> b'\xe0\x80\x81\x61'.decode('utf-8', 'replace')
'��a'
in this case \x80 would be a valid continuation byte, but since it's preceded by \xe0 it's not valid.
Since it's not valid, the result might be similar to the previous case, i.e.:
 * \xe0 is converted to �;
 * \x80 is valid as a continuation byte but not in this specific case, so it's replaced by �;
 * \x81 is valid only as a continuation byte and invalid alone, so it's replaced by �;
 * \x61 is displayed correctly as 'a';
However for this case (and the other similar cases), the invalid bytes wouldn't be otherwise valid because they are still in range \x80-\xbf (continuation bytes), so the current behavior might be fine.

This happens because the current algorithm just checks that the second byte (\x80) is in range \x80-\xbf (i.e. it's a continuation byte) and if it is it assumes that the invalid byte is the third (\x81) and replaces the first two bytes (\xe0\x80) with a single �.

That said, the algorithm could be improved to check what is the wrong byte with better accuracy (and that could also be used to give a better error message about decoding surrogates). This shouldn't affect the speed of regular decoding, because the extra check will happen only in case of error.
Also note the Unicode standard doesn't seem to mention this case, and that anyway this doesn't "eat" any of the following characters as it was doing before the patch -- the only difference would be in the number of �.
msg109159 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-03 05:42
Backported to 2.6 and 3.1 in r82470 and r82469.
I'll leave this open for a while to see if anyone has any comment on my previous message.
msg109170 - (view) Author: John Machin (sjmachin) Date: 2010-07-03 09:36
About the E0 80 81 61 problem: my interpretation is that you are correct, the 80 is not valid in the current state (start byte == E0), so no look-ahead, three FFFDs must be issued followed by 0061. I don't really care about issuing too many FFFDs so long as it doesn't munch valid sequences. However it would be very nice to get an explicit message about surrogates.
msg129495 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-02-26 03:31
After a mail I sent to the Unicode Consortium about the corner case I found, they updated the "Best Practices for Using U+FFFD"[0] and now it says:
"""
 Another example illustrates the application of the concept of maximal subpart for UTF-8 continuation bytes outside the allowable ranges defined in Table 3-7. The UTF-8 sequence <41 E0 9F 80 41> is ill-formed, because <9F> is not an allowed second byte of a UTF-8 sequence commencing with <E0>. In this case, there is an unconvertible offset at <E0> and the maximal subpart at that offset is also <E0>. The subsequence <E0 9F> cannot be a maximal subpart, because it is not an initial subsequence of any well-formed UTF-8 code unit sequence.
"""

The result of decoding that string with Python is:
>>> b'\x41\xE0\x9F\x80\x41'.decode('utf-8', 'replace')
'A��A'
i.e. the bytes <E0 9F> are wrongly considered as a maximal subpart and replaced with a single '�' (the second � is the \x80).

I'll work on a patch and see how it comes out.

[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 96
msg129647 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-02-27 18:58
The patch turned out to be less trivial than I initially thought.

The current algorithm checks for invalid continuation bytes in 4 places:
1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid);
2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte;
3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes;
3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes;

The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second  byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF).
This means that the above 4 cases should be changed in this way:
1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF;
2) the "case 2" is fine as is, because the second byte is always in range 80..BF;
3) the "case 3" should check (pseudocode):
  if (second_byte_is_not_valid) max_subpart_len = 1
  else if (third_byte not in 80..BF) max_subpart_len = 2
  else  # the sequence is valid
the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED.
4) the "case 4" should check (pseudocode):
  if (second_byte_is_not_valid) max_subpart_len = 1
  else if (third_byte not in 80..BF) max_subpart_len = 2
  else if (fourth_byte not in 80..BF) max_subpart_len = 3
  else  # the sequence is valid
here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4.

In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively.
The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function.
Is this OK?
msg129685 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-28 09:18
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> The patch turned out to be less trivial than I initially thought.
> 
> The current algorithm checks for invalid continuation bytes in 4 places:
> 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid);
> 2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte;
> 3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes;
> 3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes;
> 
> The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second  byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF).
> This means that the above 4 cases should be changed in this way:
> 1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF;
> 2) the "case 2" is fine as is, because the second byte is always in range 80..BF;
> 3) the "case 3" should check (pseudocode):
>   if (second_byte_is_not_valid) max_subpart_len = 1
>   else if (third_byte not in 80..BF) max_subpart_len = 2
>   else  # the sequence is valid
> the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED.
> 4) the "case 4" should check (pseudocode):
>   if (second_byte_is_not_valid) max_subpart_len = 1
>   else if (third_byte not in 80..BF) max_subpart_len = 2
>   else if (fourth_byte not in 80..BF) max_subpart_len = 3
>   else  # the sequence is valid
> here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4.
> 
> In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively.
> The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function.
> Is this OK?

Sure.

It would be great if you could time the difference in
performance. In tight loops like the codec ones, small changes
in the way you write things can often make a big difference.

Please also include a Misc/NEWS entry to point to the change
and possibly consequences for existing code relying on the
previous behavior.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg134046 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-04-19 12:30
Attached patch against 3.1 fixes the number of FFFD.
A test for the range in the error message should probably be added.  I haven't done any benchmark yet.  There's some code duplication, but I'm not sure it can be factored out.
msg142132 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 17:15
Here are some benchmarks:
Commands:
# half of the bytes are invalid
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "surrogateescape")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "replace")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "ignore")'

With patch:
1000 loops, best of 3: 854 usec per loop
1000 loops, best of 3: 509 usec per loop
1000 loops, best of 3: 415 usec per loop

Without patch:
1000 loops, best of 3: 670 usec per loop
1000 loops, best of 3: 470 usec per loop
1000 loops, best of 3: 382 usec per loop

Commands (from the interactive interpreter):
# all valid codepoints
import timeit
b = "".join(chr(c) for c in range(0x110000) if c not in range(0xD800, 0xE000)).encode("utf-8")
b_dec = b.decode
timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import b_dec').timeit(100)/100

With patch:
0.03830226898193359
0.03849360942840576
0.03835036039352417
0.03821949005126953

Without patch:
0.03750091791152954
0.037977190017700196
0.04067679166793823
0.038579678535461424

Commands:
# near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string
b2 = bytes(c for k,c in enumerate(b) if k%5)
b2_dec = b2.decode
timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import b2_dec').timeit(10)/10

With patch:
9.645482301712036
6.602735090255737
5.338080596923828

Without patch:
8.124328684806823
5.804249691963196
4.851014900207519

All tests done on wide 3.2.

Since the changes are about errors, decoding of valid utf-8 strings is not affected.  Decoding with non-strict error handlers and invalid strings are slower, but I don't think the difference is significant.
If the patch is fine I will commit it.
msg160980 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-17 16:35
Looks like issue14738 fixes this bug for Python 3.3.

>>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
'\ufffdAB'
>>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
'\ufffdABCD'
msg160981 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-05-17 16:55
The original bug should be fixed already in 3.3 and there should be tests (unless they got removed/skipped after we changed unicode implementation).
The only issue left was about the number of U+FFFD generated with invalid sequences in some cases.
My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass.  FWIW this should be already fixed on PyPy.
msg160989 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-17 17:31
> The only issue left was about the number of U+FFFD generated with invalid sequences in some cases.
> My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass.

Tests fails, but I'm not sure that the tests are correct.

b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
continuation byte'. This is terminological issue.

b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I
don't think that is right.
msg160990 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-05-17 17:36
> Tests fails, but I'm not sure that the tests are correct.

> b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
> continuation byte'. This is terminological issue.

This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too.

> b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not
> two. I don't think that is right.

Why not?
msg160991 - (view) Author: Saul Spatz (spatz123) Date: 2012-05-17 17:36
>b'\xe0\x80'.decode('utf-8', 'replace') returns >one U+FFFD and not two. I
>don't think that is right.

I think that one U+FFFD is correct.  The on;y error is a premature end of
data.
On Thu, May 17, 2012 at 12:31 PM, Serhiy Storchaka
<report@bugs.python.org>wrote:

>
> Serhiy Storchaka <storchaka@gmail.com> added the comment:
>
> > The only issue left was about the number of U+FFFD generated with
> invalid sequences in some cases.
> > My last patch has extensive tests for this, so you could try to apply it
> (or copy the tests) and see if they all pass.
>
> Tests fails, but I'm not sure that the tests are correct.
>
> b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
> continuation byte'. This is terminological issue.
>
> b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I
> don't think that is right.
>
> ----------
> title: str.decode('utf8',       'replace') -- conformance with Unicode
> 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue8271>
> _______________________________________
>
msg160993 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-05-17 18:12
Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).

If you make any changes on the tests please let me know.
msg160998 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-17 18:33
> I think that one U+FFFD is correct.  The on;y error is a premature end of
> data.

I poorly expressed. I also think that there is only one decoding error,
and not two. I think the test is wrong.
msg161000 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-17 18:46
> This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too.

Yes, this implementation detail. It is much easier and faster. Whether
it is necessary to change it?

> Why not?

May be I'm wrong. I looked in "The Unicode Standard, Version
6.0" (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf), pp. 95-97,
the standard does not categorical in this, but recommends that only
maximal subpart should be replaced by U+FFFD. \xe0\x80 is not maximal
subpart. Therefore, there must be two U+FFFD. In this case, the previous
and the current implementation does not conform to the standard.
msg161001 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-17 18:52
> Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).

I probably poorly said. Past and current implementations raise
'unexpected end of data' and not 'invalid continuation byte'. Test
expects 'invalid continuation byte'.
msg161002 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-05-17 18:55
> \xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD.

OK, now I get what you mean.  The valid range for continuation bytes that can follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation byte here.  While working on the patch I stumbled across this corner case and contacted the Unicode consortium to ask about it, as explained in msg129495.

I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard).
msg161004 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-05-17 18:59
> I probably poorly said. Past and current implementations raise
> 'unexpected end of data' and not 'invalid continuation byte'. Test
> expects 'invalid continuation byte'.

I don't think it matters much either way.
msg161005 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-17 19:06
> I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard).

No, test correctly expects two U+FFFD. Current implementation is wrong.
As I understand now, what's the error, I'll try to correct Python 3.3
implementation.
msg161622 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-25 20:21
Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it is a little slow, but I tried to minimize the losses.
msg161627 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-05-25 22:38
Do you have any benchmark results?
msg161650 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-26 08:50
Here are the benchmark results (numbers are speed, MB/s).

On 32-bit Linux, AMD Athlon 64 X2:

                                          vanilla      patched

utf-8     'A'*10000                       2016 (+5%)   2111
utf-8     '\x80'*10000                    383 (+9%)    416
utf-8       '\x80'+'A'*9999               1283 (+1%)   1301
utf-8     '\u0100'*10000                  383 (-8%)    354
utf-8       '\u0100'+'A'*9999             1258 (-6%)   1184
utf-8       '\u0100'+'\x80'*9999          383 (-8%)    354
utf-8     '\u8000'*10000                  434 (-11%)   388
utf-8       '\u8000'+'A'*9999             1262 (-6%)   1180
utf-8       '\u8000'+'\x80'*9999          383 (-8%)    354
utf-8       '\u8000'+'\u0100'*9999        383 (-8%)    354
utf-8     '\U00010000'*10000              358 (+1%)    361
utf-8       '\U00010000'+'A'*9999         1168 (-5%)   1104
utf-8       '\U00010000'+'\x80'*9999      382 (-20%)   307
utf-8       '\U00010000'+'\u0100'*9999    382 (-20%)   307
utf-8       '\U00010000'+'\u8000'*9999    404 (-10%)   365

On 32-bit Linux, Intel Atom N570:

                                          vanilla      patched

ascii     'A'*10000                       789 (+1%)    800

latin1    'A'*10000                       796 (-2%)    781
latin1        'A'*9999+'\x80'             779 (+1%)    789
latin1    '\x80'*10000                    1739 (-3%)   1690
latin1      '\x80'+'A'*9999               1747 (+1%)   1773

utf-8     'A'*10000                       623 (+1%)    631
utf-8     '\x80'*10000                    145 (+14%)   165
utf-8       '\x80'+'A'*9999               354 (+1%)    358
utf-8     '\u0100'*10000                  164 (-5%)    156
utf-8       '\u0100'+'A'*9999             343 (+2%)    350
utf-8       '\u0100'+'\x80'*9999          164 (-4%)    157
utf-8     '\u8000'*10000                  175 (-5%)    166
utf-8       '\u8000'+'A'*9999             349 (+2%)    356
utf-8       '\u8000'+'\x80'*9999          164 (-4%)    157
utf-8       '\u8000'+'\u0100'*9999        164 (-4%)    157
utf-8     '\U00010000'*10000              152 (+7%)    163
utf-8       '\U00010000'+'A'*9999         313 (+6%)    332
utf-8       '\U00010000'+'\x80'*9999      161 (-13%)   140
utf-8       '\U00010000'+'\u0100'*9999    161 (-14%)   139
utf-8       '\U00010000'+'\u8000'*9999    160 (-1%)    159
msg161655 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-26 09:27
Fortunately, issue14923 (if accepted) will compensate for the slowdown.

On 32-bit Linux, AMD Athlon 64 X2:

                                          vanilla      old patch    fast patch

utf-8     'A'*10000                       2016 (+3%)   2111 (-2%)   2072
utf-8     '\x80'*10000                    383 (+19%)   416 (+9%)    454
utf-8       '\x80'+'A'*9999               1283 (-7%)   1301 (-9%)   1190
utf-8     '\u0100'*10000                  383 (+46%)   354 (+58%)   560
utf-8       '\u0100'+'A'*9999             1258 (-1%)   1184 (+5%)   1244
utf-8       '\u0100'+'\x80'*9999          383 (+46%)   354 (+58%)   558
utf-8     '\u8000'*10000                  434 (+6%)    388 (+19%)   461
utf-8       '\u8000'+'A'*9999             1262 (-1%)   1180 (+5%)   1244
utf-8       '\u8000'+'\x80'*9999          383 (+46%)   354 (+58%)   559
utf-8       '\u8000'+'\u0100'*9999        383 (+45%)   354 (+57%)   555
utf-8     '\U00010000'*10000              358 (+5%)    361 (+4%)    375
utf-8       '\U00010000'+'A'*9999         1168 (-1%)   1104 (+5%)   1159
utf-8       '\U00010000'+'\x80'*9999      382 (+43%)   307 (+78%)   546
utf-8       '\U00010000'+'\u0100'*9999    382 (+43%)   307 (+79%)   548
utf-8       '\U00010000'+'\u8000'*9999    404 (+13%)   365 (+25%)   458

On 32-bit Linux, Intel Atom N570:

                                          vanilla      old patch    fast patch

utf-8     'A'*10000                       623 (+1%)    631 (+0%)    631
utf-8     '\x80'*10000                    145 (+26%)   165 (+11%)   183
utf-8       '\x80'+'A'*9999               354 (-0%)    358 (-1%)    353
utf-8     '\u0100'*10000                  164 (+10%)   156 (+16%)   181
utf-8       '\u0100'+'A'*9999             343 (+1%)    350 (-1%)    348
utf-8       '\u0100'+'\x80'*9999          164 (+10%)   157 (+15%)   181
utf-8     '\u8000'*10000                  175 (-1%)    166 (+5%)    174
utf-8       '\u8000'+'A'*9999             349 (+0%)    356 (-2%)    349
utf-8       '\u8000'+'\x80'*9999          164 (+10%)   157 (+15%)   180
utf-8       '\u8000'+'\u0100'*9999        164 (+10%)   157 (+15%)   181
utf-8     '\U00010000'*10000              152 (+7%)    163 (+0%)    163
utf-8       '\U00010000'+'A'*9999         313 (+4%)    332 (-2%)    327
utf-8       '\U00010000'+'\x80'*9999      161 (+11%)   140 (+28%)   179
utf-8       '\U00010000'+'\u0100'*9999    161 (+11%)   139 (+28%)   178
utf-8       '\U00010000'+'\u8000'*9999    160 (+9%)    159 (+9%)    174
msg163587 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-23 11:38
Why is this marked "fixed"? Is it fixed or not?
msg163588 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 11:42
I deleted a fast patch, since it unsafe. Issue14923 should safer compensate a small slowdown.

I think this change is not a bugfix (this is not a bug, the standard allows such behavior), but a new feature, so I doubt the need to fix 2.7 and 3.2. Any chance to commit the patch today and to get this feature in Python 3.3?
msg163591 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 11:55
No, it is not fully fixed. Only one bug was fixed, but the current
behavior is still not conformed with the Unicode Standard
*recommendations*. Non-conforming with recommendations is not a bug,
conforming is a feature.
msg163674 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 21:03
Here is updated, a little faster, patch. It merged with decode_utf8_range_check.patch from issue14923.

Patch contains non-modified Ezio Melotti's tests which all successfully passed.
msg163677 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 21:21
Here is updated patch with resolved merge conflict with 3214c9ebcf5e.
msg174828 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-04 20:18
What about commit?  All Ezio's tests passsed, microbenchmark shows less than 10% differences:

vanilla      patched
MB/s         MB/s

2076 (-3%)   2007   decode  utf-8  'A'*10000
414 (-0%)    413    decode  utf-8  '\x80'*10000
1283 (-1%)   1275   decode  utf-8    '\x80'+'A'*9999
556 (-8%)    514    decode  utf-8  '\u0100'*10000
1227 (-4%)   1172   decode  utf-8    '\u0100'+'A'*9999
556 (-8%)    514    decode  utf-8    '\u0100'+'\x80'*9999
406 (+10%)   447    decode  utf-8  '\u8000'*10000
1225 (-5%)   1167   decode  utf-8    '\u8000'+'A'*9999
554 (-7%)    513    decode  utf-8    '\u8000'+'\x80'*9999
552 (-8%)    508    decode  utf-8    '\u8000'+'\u0100'*9999
358 (-4%)    345    decode  utf-8  '\U00010000'*10000
1173 (-5%)   1118   decode  utf-8    '\U00010000'+'A'*9999
492 (+1%)    495    decode  utf-8    '\U00010000'+'\x80'*9999
492 (+1%)    496    decode  utf-8    '\U00010000'+'\u0100'*9999
383 (+5%)    401    decode  utf-8    '\U00010000'+'\u8000'*9999
msg174831 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-04 21:23
New changeset 5962f192a483 by Ezio Melotti in branch '3.3':
#8271: the utf-8 decoder now outputs the correct number of U+FFFD  characters when used with the "replace" error handler on invalid utf-8 sequences.  Patch by Serhiy Storchaka, tests by Ezio Melotti.
http://hg.python.org/cpython/rev/5962f192a483

New changeset 5b205fff1972 by Ezio Melotti in branch 'default':
#8271: merge with 3.3.
http://hg.python.org/cpython/rev/5b205fff1972
msg174832 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-11-04 21:37
Fixed, thanks for updating the patch!
I committed it on 3.3 too, and while this could have gone on 2.7/3.2 too IMHO, it's to much work to port it there and not worth it.
msg174834 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-04 21:50
Agree.  In 2.7 UTF-8 codec still broken in corner cases (it accepts 
surrogates) and 3.2 is coming to an end of maintaining.  In any case it is 
only recomendation, not demands.
msg174839 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-04 23:00
New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3':
Issue #8271: Fix compilation on Windows
http://hg.python.org/cpython/rev/96f4cee8ea5e

New changeset 6f44f33460cd by Victor Stinner in branch 'default':
(Merge 3.3) Issue #8271: Fix compilation on Windows
http://hg.python.org/cpython/rev/6f44f33460cd
History
Date User Action Args
2022-04-11 14:56:59adminsetgithub: 52518
2014-03-31 23:38:33jmehnlesetnosy: + jmehnle
2012-11-04 23:00:39python-devsetmessages: + msg174839
2012-11-04 21:50:05serhiy.storchakasetmessages: + msg174834
2012-11-04 21:37:03ezio.melottisetstatus: open -> closed

messages: + msg174832
versions: + Python 3.3
2012-11-04 21:23:36python-devsetnosy: + python-dev
messages: + msg174831
2012-11-04 20:18:23serhiy.storchakasetmessages: + msg174828
2012-11-04 20:07:41serhiy.storchakasetversions: + Python 3.4, - Python 3.1, Python 2.7, Python 3.2, Python 3.3
2012-11-04 20:06:44serhiy.storchakasetfiles: - issue8271-3.3-fast-2.patch
2012-11-04 20:06:26serhiy.storchakasetfiles: - issue8271-3.3.patch
2012-06-23 21:21:02serhiy.storchakasetfiles: + issue8271-3.3-fast-3.patch

messages: + msg163677
2012-06-23 21:03:47serhiy.storchakasetfiles: + issue8271-3.3-fast-2.patch

messages: + msg163674
2012-06-23 11:55:49serhiy.storchakasetmessages: + msg163591
2012-06-23 11:42:25serhiy.storchakasetmessages: + msg163588
2012-06-23 11:38:42pitrousetmessages: + msg163587
2012-06-23 11:35:52serhiy.storchakasetfiles: - issue8271-3.3-fast.patch
2012-05-26 09:28:00serhiy.storchakasetfiles: + issue8271-3.3-fast.patch

messages: + msg161655
2012-05-26 08:50:35serhiy.storchakasetmessages: + msg161650
2012-05-25 22:38:41ezio.melottisetmessages: + msg161627
2012-05-25 20:21:51serhiy.storchakasetfiles: + issue8271-3.3.patch

messages: + msg161622
2012-05-17 19:06:38serhiy.storchakasetmessages: + msg161005
2012-05-17 18:59:28ezio.melottisetmessages: + msg161004
2012-05-17 18:55:05ezio.melottisetmessages: + msg161002
2012-05-17 18:52:51serhiy.storchakasetmessages: + msg161001
2012-05-17 18:46:04serhiy.storchakasetmessages: + msg161000
2012-05-17 18:33:46serhiy.storchakasetmessages: + msg160998
2012-05-17 18:12:55ezio.melottisetmessages: + msg160993
2012-05-17 17:36:22spatz123setmessages: + msg160991
2012-05-17 17:36:08ezio.melottisetmessages: + msg160990
2012-05-17 17:31:03serhiy.storchakasetmessages: + msg160989
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2012-05-17 16:55:39ezio.melottisetmessages: + msg160981
2012-05-17 16:35:22serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg160980
2011-09-21 09:44:44Ringdingsetnosy: + Ringding
2011-08-15 17:15:30ezio.melottisetmessages: + msg142132
2011-07-07 10:08:07spatz123setnosy: + spatz123
2011-04-19 12:30:42ezio.melottisetfiles: + issue8271v6.diff

messages: + msg134046
versions: + Python 3.3, - Python 2.6
2011-02-28 09:18:07lemburgsetnosy: lemburg, sjmachin, belopolsky, pitrou, vstinner, ezio.melotti, dangra
messages: + msg129685
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2011-02-27 18:58:56ezio.melottisetnosy: lemburg, sjmachin, belopolsky, pitrou, vstinner, ezio.melotti, dangra
messages: + msg129647
2011-02-26 03:31:22ezio.melottisetnosy: lemburg, sjmachin, belopolsky, pitrou, vstinner, ezio.melotti, dangra
messages: + msg129495
2010-12-29 23:31:38belopolskysetnosy: + belopolsky
2010-07-03 09:36:43sjmachinsetmessages: + msg109170
2010-07-03 05:42:06ezio.melottisetresolution: fixed
messages: + msg109159
stage: patch review -> resolved
2010-07-03 00:49:12ezio.melottisetmessages: + msg109155
2010-07-01 19:19:12ezio.melottisetmessages: + msg109070
2010-06-30 20:21:39ezio.melottisetmessages: + msg109015
2010-06-05 20:35:39ezio.melottisetmessages: + msg107163
2010-06-04 16:22:48ezio.melottisetfiles: + issue8271v5.diff

messages: + msg107074
2010-04-07 13:09:51ezio.melottisetnosy: + pitrou
2010-04-07 09:02:03vstinnersetmessages: + msg102523
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2010-04-07 08:37:35lemburgsetmessages: + msg102522
2010-04-07 04:08:13ezio.melottisetkeywords: + needs review
files: + issue8271v4.diff
messages: + msg102516
2010-04-04 05:49:17ezio.melottisetfiles: + issue8271v3.diff

messages: + msg102320
2010-04-03 14:43:21vstinnersetnosy: + vstinner
messages: + msg102265
2010-04-03 11:41:34lemburgsetmessages: + msg102239
2010-04-02 22:27:17ezio.melottisetfiles: + issue8271v2.diff

stage: test needed -> patch review
messages: + msg102209
versions: + Python 2.6
2010-04-01 15:23:22sjmachinsetmessages: + msg102101
2010-04-01 15:07:07sjmachinsetmessages: + msg102099
2010-04-01 15:01:37lemburgsetmessages: + msg102098
2010-04-01 14:50:11ezio.melottisetmessages: + msg102095
2010-04-01 14:43:21sjmachinsetmessages: + msg102094
2010-04-01 14:12:59lemburgsetmessages: + msg102093
2010-04-01 13:47:37sjmachinsetmessages: + msg102090
2010-04-01 13:19:04lemburgsetmessages: + msg102089
2010-04-01 11:57:00sjmachinsetmessages: + msg102085
2010-04-01 08:46:32lemburgsetmessages: + msg102077
2010-04-01 08:33:45ezio.melottisetkeywords: + patch
files: + issue8271.diff
messages: + msg102076
2010-04-01 07:44:48lemburgsetmessages: + msg102068
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2010-04-01 07:33:49ezio.melottisetmessages: + msg102066
2010-04-01 07:29:45sjmachinsetmessages: + msg102065
2010-04-01 06:14:54ezio.melottisetmessages: + msg102064
2010-04-01 06:08:59sjmachinsetmessages: + msg102063
2010-04-01 04:43:51ezio.melottisetassignee: ezio.melotti
2010-04-01 03:28:27ezio.melottisetmessages: + msg102062
2010-04-01 03:19:32sjmachinsetmessages: + msg102061
2010-03-31 18:07:43lemburgsetmessages: + msg102024
2010-03-31 15:22:35r.david.murraysetnosy: + lemburg
2010-03-31 14:59:29dangrasetmessages: + msg102013
2010-03-31 14:56:43dangrasetnosy: + dangra
2010-03-31 06:41:59ezio.melottisetversions: + Python 3.2
nosy: + ezio.melotti

priority: normal
components: + Unicode
stage: test needed
2010-03-31 02:28:10sjmachincreate