str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 #52518

sjmachin · 2010-03-31T02:28:11Z

BPO	8271
Nosy	@malemburg, @abalkin, @pitrou, @vstinner, @ezio-melotti, @serhiy-storchaka
Files	issue8271.diff: Incomplete patch against trunk. issue8271v2.diff: New patch against trunk issue8271v3.diff: Final patch issue8271v4.diff: More final patch issue8271v5.diff: Even more final patch issue8271v6.diff: Patch to fix the number of FFFD issue8271-3.3-fast-3.patch: Ezio's patch updated to current sources

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2012-11-04.21:37:03.868>
created_at = <Date 2010-03-31.02:28:10.626>
labels = ['type-bug', 'expert-unicode']
title = "str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0"
updated_at = <Date 2014-03-31.23:38:33.951>
user = 'https://bugs.python.org/sjmachin'

bugs.python.org fields:

activity = <Date 2014-03-31.23:38:33.951>
actor = 'jmehnle'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2012-11-04.21:37:03.868>
closer = 'ezio.melotti'
components = ['Unicode']
creation = <Date 2010-03-31.02:28:10.626>
creator = 'sjmachin'
dependencies = []
files = ['16720', '16741', '16754', '16794', '17552', '21719', '26118']
hgrepos = []
issue_num = 8271
keywords = ['patch', 'needs review']
message_count = 66.0
messages = ['101972', '102013', '102024', '102061', '102062', '102063', '102064', '102065', '102066', '102068', '102076', '102077', '102085', '102089', '102090', '102093', '102094', '102095', '102098', '102099', '102101', '102209', '102239', '102265', '102320', '102516', '102522', '102523', '107074', '107163', '109015', '109070', '109155', '109159', '109170', '129495', '129647', '129685', '134046', '142132', '160980', '160981', '160989', '160990', '160991', '160993', '160998', '161000', '161001', '161002', '161004', '161005', '161622', '161627', '161650', '161655', '163587', '163588', '163591', '163674', '163677', '174828', '174831', '174832', '174834', '174839']
nosy_count = 12.0
nosy_names = ['lemburg', 'sjmachin', 'belopolsky', 'pitrou', 'vstinner', 'ezio.melotti', 'Ringding', 'dangra', 'python-dev', 'jmehnle', 'spatz123', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue8271'
versions = ['Python 3.3', 'Python 3.4']

sjmachin · 2010-03-31T02:28:10Z

Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example:

 >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
 '\ufffdB'
 # should produce u'\ufffdAB'

Resynchronisation currently starts at a position derived by considering the length implied by the start byte:

 >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
 '\ufffdD'
 # should produce u'\ufffdABCD'; resync should start from the *failing* byte.

Notes: This applies to the 'ignore' option as well as the 'replace' option. The Unicode discussion mentions "security exploits".

dangra · 2010-03-31T14:59:30Z

Some background for this report at http://stackoverflow.com/questions/2547262/why-is-python-decode-replacing-more-than-the-invalid-bytes-from-an-encoded-string/2548480

malemburg · 2010-03-31T18:07:43Z

I guess the term "failing" byte somewhat underdefined.

Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD".

Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point.

Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly.

Any takers ?

sjmachin · 2010-04-01T03:19:32Z

@lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not in 80-BF. It is the "failing byte"; high bit not set. Required action is to emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total output: FFFD 0041 0042 0043. Current code emits FFFD 0043.

Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current code emits FFFD 0043.

Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 0081 0043. Current code emits FFFD 0043.

ezio-melotti · 2010-04-01T03:28:28Z

Having the 'high bit set' means that the first bit is set to 1.
All the continuation bytes (i.e. the 2nd, 3rd or 4th byte in a sequence) have the first two bits set to 1 and 0 respectively, so if the first bit is not set to 1 then the byte shouldn't be considered part of the sequence.
I'm trying to work on a patch.

sjmachin · 2010-04-01T06:08:59Z

@ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation bytes; they are either starter bytes (C2-F4) or invalid for any purpose (C0-C2 and F5-FF). Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.

The simple way of summarising the above is to say that a byte that is not a valid continuation byte in the current state ("failing byte") is not a part of the current (now known to be invalid) sequence, and the decoder must try again ("resync") with the failing byte.

Do you agree with my example 3?

ezio-melotti · 2010-04-01T06:14:54Z

Yes, right now I'm considering valid all the bytes that start with '10...'. C2 starts with '11...' so it's a "failing byte".

sjmachin · 2010-04-01T07:29:45Z

#ezio.melotti: """I'm considering valid all the bytes that start with '10...'"""

Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow."""

Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences (over-long). Specifically the first continuation byte may not be in 80-9F. Those bytes start with '10...' but they are invalid after an E0 starter byte.

Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) doesn't implement the surrogates restriction, so that the special case for starter byte ED is not used in CPython). Note the other 3 special cases for the first continuation byte.

ezio-melotti · 2010-04-01T07:33:50Z

That's why I'm writing tests that cover all the cases, including overlong sequences. If the test will fail I'll change the patch :)

malemburg · 2010-04-01T07:44:48Z

John Machin wrote:

John Machin <sjmachin@users.sourceforge.net> added the comment:

@lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

I just had a quick look at the code and saw that it's testing for the high
bit on the subsequent bytes.

Looking closer, you're right and the situation is a bit more complex,
but the solution still looks simple: only the endinpos
has to be adjusted more carefully depending on what the various
checks find.

That said, I find the Unicode consortium solution a bit awkward.
In UTF-8 the first byte in a multi-byte sequence defines the number
of bytes that make up a sequence. If some of those bytes are invalid,
the whole sequence is invalid and the fact that some of those
bytes may be interpretable as regular code points does not necessarily
result in better results - the reason is that loss of bytes in a
stream is far more unlikely than flipping a few bits in the data.

ezio-melotti · 2010-04-01T08:33:43Z

Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better.
I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking.

malemburg · 2010-04-01T08:46:32Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better.

Thanks. Please also check whether it's worthwhile unrolling those
loops by hand.

I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking.

I think we need to do this all the way, even though 5 and 6 byte
sequences are not used at the moment.

sjmachin · 2010-04-01T11:57:01Z

Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.

malemburg · 2010-04-01T13:19:04Z

John Machin wrote:

John Machin <sjmachin@users.sourceforge.net> added the comment:

Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.

The UTF-8 codec was written at a time when UTF-8 still included
the possibility to have 5 or 6 bytes:

http://www.rfc-editor.org/rfc/rfc2279.txt

Use of those encodings has always raised an error, though. For error
handling purposes it still has to support those possibilities.

sjmachin · 2010-04-01T13:47:37Z

@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now says 21 bits is it. F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.

malemburg · 2010-04-01T14:13:00Z

John Machin wrote:

John Machin <sjmachin@users.sourceforge.net> added the comment:

@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago.

I know.

The standard now says 21 bits is it.

It says that the current Unicode codespace only uses 21 bits. In the
early days 16 bits were considered enough, so it wouldn't surprise me,
if they extend that range again at some point in the future - after
all, leaving 11 bits unused in UCS-4 is a huge waste of space.

If you have a reference that the Unicode consortium has decided
to stay with that limit forever, please quote it.

F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.

Fair enough. Let's do that.

The reference in the table should then be updated to RFC 3629.

sjmachin · 2010-04-01T14:43:22Z

Patch review:

Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the r79542 that the patch mentions ... help, please!

length 2 case:

the loop can be hand-unrolled into oblivion. It can be entered only when s[1] & 0xC0 != 0x80 (previous if test).
the over-long check (if (ch < 0x80)) hasn't been touched. It could be removed and the entries for C0 and C1 in the utf8_code_length array set to 0.

length 3 case:

the tests involving s[0] being 0xE0 or 0xED are misplaced.
the test s[0] == 0xE0 && s[1] < 0xA0 if not misplaced would be shadowing the over-long test (ch < 0x800). It seems better to use the over-long test (with endinpos set to 1).
The test s[0] == 0xED relates to the surrogates caper which in the py3k version is handled in the same place as the over-long test.
unrolling loop: needs no loop, only 1 test ... if s[1] is good, then we know s[2] must be bad without testing it, because we start the for loop only when s[1] is bad || s[2] is bad.

length 4 case: as for the len 3 case generally ... misplaced tests, F1 test shadows over-long test, F4 test shadows max value test, too many loop iterations.

ezio-melotti · 2010-04-01T14:50:11Z

Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:

consider all the bytes in range F5-FD as invalid without looking for the other bytes;
try to read the next 4/5/6 bytes and fail if they are no continuation bytes.
We can also look at what others do (e.g. browsers and other languages).

malemburg · 2010-04-01T15:01:36Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:

consider all the bytes in range F5-FD as invalid without looking for the other bytes;

try to read the next 4/5/6 bytes and fail if they are no continuation bytes.
We can also look at what others do (e.g. browsers and other languages).

By marking those entries as 0 in the length table, they would only
use one byte, however, compared to the current state, that would
produce more replacement code points in the output, so perhaps applying
the same logic as for the other sequences is a better strategy.

sjmachin · 2010-04-01T15:07:07Z

Chapter 3, page 94: """As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF"""

Of course they should be handled by the simple expedient of setting their length entry to zero. Why write code when there is an existing mechanism??

sjmachin · 2010-04-01T15:23:22Z

@lemburg: """perhaps applying the same logic as for the other sequences is a better strategy"""

What other sequences??? F5-FF are invalid bytes; they don't start valid sequences. What same logic?? At the start of a character, they should get the same short sharp treatment as any other non-starter byte e.g. 80 or C0.

ezio-melotti · 2010-04-02T22:27:15Z

Here's a new patch. Should be complete but I want to test it some more before committing.
I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code.

I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.

malemburg · 2010-04-03T11:41:33Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

Here's a new patch. Should be complete but I want to test it some more before committing.
I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code.

Ok.

I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.

Right, but that idea is controversial. In Python we need to be able to
put those surrogate code points into source code (encoded as UTF-8) as
well as pickle and marshal dumps of Unicode object dumps, so we can't
consider them invalid UTF-8.

vstinner · 2010-04-03T14:43:21Z

I also found out that, according to RFC 3629, surrogates
are considered invalid and they can't be encoded/decoded,
but the UTF-8 codec actually does it.

Python2 does, but Python3 raises an error.

Python 2.7a4+ (trunk:79675, Apr  3 2010, 16:11:36)
>>> u"\uDC80".encode("utf8")
'\xed\xb2\x80'

Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
>>> "\uDC80".encode("utf8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see bpo-6687 (and bpo-8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7.

ezio-melotti · 2010-04-04T05:49:13Z

This new patch (v3) should be ok.
I added a few more tests and found another corner case:
'\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch.

I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet).

Finally, I changed the error messages to make them clearer:
unexpected code byte -> invalid start byte;
invalid data -> invalid continuation byte.
(I can revert this if the old messages are better or if it is better to fix this with a separate commit.)

The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars.

ezio-melotti · 2010-04-07T04:08:11Z

The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the test.
(The failing test in test_codeccallbacks was testing that registering error handlers works, using a function that replaced "\xc0\x80" with "\x00". Since now "\xc0" is an invalid start byte regardless of what follows, the function is now receiving only "\xc0" instead of "\xc0\x80" so I had to change the test.)

ezio-melotti · 2011-08-15T17:15:30Z

Here are some benchmarks:
Commands:
# half of the bytes are invalid
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "surrogateescape")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "replace")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "ignore")'

With patch:
1000 loops, best of 3: 854 usec per loop
1000 loops, best of 3: 509 usec per loop
1000 loops, best of 3: 415 usec per loop

Without patch:
1000 loops, best of 3: 670 usec per loop
1000 loops, best of 3: 470 usec per loop
1000 loops, best of 3: 382 usec per loop

Commands (from the interactive interpreter):
# all valid codepoints
import timeit
b = "".join(chr(c) for c in range(0x110000) if c not in range(0xD800, 0xE000)).encode("utf-8")
b_dec = b.decode
timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import b_dec').timeit(100)/100

With patch:
0.03830226898193359
0.03849360942840576
0.03835036039352417
0.03821949005126953

Without patch:
0.03750091791152954
0.037977190017700196
0.04067679166793823
0.038579678535461424

Commands:
# near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string
b2 = bytes(c for k,c in enumerate(b) if k%5)
b2_dec = b2.decode
timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import b2_dec').timeit(10)/10

With patch:
9.645482301712036
6.602735090255737
5.338080596923828

Without patch:
8.124328684806823
5.804249691963196
4.851014900207519

All tests done on wide 3.2.

Since the changes are about errors, decoding of valid utf-8 strings is not affected. Decoding with non-strict error handlers and invalid strings are slower, but I don't think the difference is significant.
If the patch is fine I will commit it.

serhiy-storchaka · 2012-05-17T16:35:23Z

Looks like bpo-14738 fixes this bug for Python 3.3.

>>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
'\ufffdAB'
>>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
'\ufffdABCD'

ezio-melotti · 2012-05-17T16:55:40Z

The original bug should be fixed already in 3.3 and there should be tests (unless they got removed/skipped after we changed unicode implementation).
The only issue left was about the number of U+FFFD generated with invalid sequences in some cases.
My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass. FWIW this should be already fixed on PyPy.

serhiy-storchaka · 2012-05-17T17:31:03Z

The only issue left was about the number of U+FFFD generated with invalid sequences in some cases.
My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass.

Tests fails, but I'm not sure that the tests are correct.

b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
continuation byte'. This is terminological issue.

b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I
don't think that is right.

ezio-melotti · 2012-05-17T17:36:08Z

Tests fails, but I'm not sure that the tests are correct.

b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
continuation byte'. This is terminological issue.

This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too.

b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not
two. I don't think that is right.

Why not?

spatz123 · 2012-05-17T17:36:23Z

b'\xe0\x80'.decode('utf-8', 'replace') returns >one U+FFFD and not two. I
don't think that is right.

I think that one U+FFFD is correct. The on;y error is a premature end of
data.
On Thu, May 17, 2012 at 12:31 PM, Serhiy Storchaka
<report@bugs.python.org>wrote:

Serhiy Storchaka <storchaka@gmail.com> added the comment:

> The only issue left was about the number of U+FFFD generated with
invalid sequences in some cases.
> My last patch has extensive tests for this, so you could try to apply it
(or copy the tests) and see if they all pass.

Tests fails, but I'm not sure that the tests are correct.

b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
continuation byte'. This is terminological issue.

b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I
don't think that is right.

----------
title: str.decode('utf8', 'replace') -- conformance with Unicode
5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue8271\>

ezio-melotti · 2012-05-17T18:12:56Z

Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).

If you make any changes on the tests please let me know.

serhiy-storchaka · 2012-05-17T18:33:46Z

I think that one U+FFFD is correct. The on;y error is a premature end of
data.

I poorly expressed. I also think that there is only one decoding error,
and not two. I think the test is wrong.

serhiy-storchaka · 2012-05-17T18:46:04Z

This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too.

Yes, this implementation detail. It is much easier and faster. Whether
it is necessary to change it?

Why not?

May be I'm wrong. I looked in "The Unicode Standard, Version
6.0" (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf), pp. 95-97,
the standard does not categorical in this, but recommends that only
maximal subpart should be replaced by U+FFFD. \xe0\x80 is not maximal
subpart. Therefore, there must be two U+FFFD. In this case, the previous
and the current implementation does not conform to the standard.

serhiy-storchaka · 2012-05-17T18:52:51Z

Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).

I probably poorly said. Past and current implementations raise
'unexpected end of data' and not 'invalid continuation byte'. Test
expects 'invalid continuation byte'.

ezio-melotti · 2012-05-17T18:55:05Z

\xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD.

OK, now I get what you mean. The valid range for continuation bytes that can follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation byte here. While working on the patch I stumbled across this corner case and contacted the Unicode consortium to ask about it, as explained in msg129495.

I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard).

ezio-melotti · 2012-05-17T18:59:28Z

I probably poorly said. Past and current implementations raise
'unexpected end of data' and not 'invalid continuation byte'. Test
expects 'invalid continuation byte'.

I don't think it matters much either way.

serhiy-storchaka · 2012-05-17T19:06:39Z

I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard).

No, test correctly expects two U+FFFD. Current implementation is wrong.
As I understand now, what's the error, I'll try to correct Python 3.3
implementation.

serhiy-storchaka · 2012-05-25T20:21:51Z

Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it is a little slow, but I tried to minimize the losses.

ezio-melotti · 2012-05-25T22:38:41Z

Do you have any benchmark results?

serhiy-storchaka · 2012-05-26T08:50:35Z

Here are the benchmark results (numbers are speed, MB/s).

On 32-bit Linux, AMD Athlon 64 X2:

                                      vanilla      patched

utf-8 'A'*10000 2016 (+5%) 2111
utf-8 '\x80'*10000 383 (+9%) 416
utf-8 '\x80'+'A'*9999 1283 (+1%) 1301
utf-8 '\u0100'*10000 383 (-8%) 354
utf-8 '\u0100'+'A'*9999 1258 (-6%) 1184
utf-8 '\u0100'+'\x80'*9999 383 (-8%) 354
utf-8 '\u8000'*10000 434 (-11%) 388
utf-8 '\u8000'+'A'*9999 1262 (-6%) 1180
utf-8 '\u8000'+'\x80'*9999 383 (-8%) 354
utf-8 '\u8000'+'\u0100'*9999 383 (-8%) 354
utf-8 '\U00010000'*10000 358 (+1%) 361
utf-8 '\U00010000'+'A'*9999 1168 (-5%) 1104
utf-8 '\U00010000'+'\x80'*9999 382 (-20%) 307
utf-8 '\U00010000'+'\u0100'*9999 382 (-20%) 307
utf-8 '\U00010000'+'\u8000'*9999 404 (-10%) 365

On 32-bit Linux, Intel Atom N570:

                                      vanilla      patched

ascii 'A'*10000 789 (+1%) 800

latin1 'A'*10000 796 (-2%) 781
latin1 'A'*9999+'\x80' 779 (+1%) 789
latin1 '\x80'*10000 1739 (-3%) 1690
latin1 '\x80'+'A'*9999 1747 (+1%) 1773

utf-8 'A'*10000 623 (+1%) 631
utf-8 '\x80'*10000 145 (+14%) 165
utf-8 '\x80'+'A'*9999 354 (+1%) 358
utf-8 '\u0100'*10000 164 (-5%) 156
utf-8 '\u0100'+'A'*9999 343 (+2%) 350
utf-8 '\u0100'+'\x80'*9999 164 (-4%) 157
utf-8 '\u8000'*10000 175 (-5%) 166
utf-8 '\u8000'+'A'*9999 349 (+2%) 356
utf-8 '\u8000'+'\x80'*9999 164 (-4%) 157
utf-8 '\u8000'+'\u0100'*9999 164 (-4%) 157
utf-8 '\U00010000'*10000 152 (+7%) 163
utf-8 '\U00010000'+'A'*9999 313 (+6%) 332
utf-8 '\U00010000'+'\x80'*9999 161 (-13%) 140
utf-8 '\U00010000'+'\u0100'*9999 161 (-14%) 139
utf-8 '\U00010000'+'\u8000'*9999 160 (-1%) 159

serhiy-storchaka · 2012-05-26T09:27:59Z

Fortunately, bpo-14923 (if accepted) will compensate for the slowdown.

On 32-bit Linux, AMD Athlon 64 X2:

                                      vanilla      old patch    fast patch

utf-8 'A'*10000 2016 (+3%) 2111 (-2%) 2072
utf-8 '\x80'*10000 383 (+19%) 416 (+9%) 454
utf-8 '\x80'+'A'*9999 1283 (-7%) 1301 (-9%) 1190
utf-8 '\u0100'*10000 383 (+46%) 354 (+58%) 560
utf-8 '\u0100'+'A'*9999 1258 (-1%) 1184 (+5%) 1244
utf-8 '\u0100'+'\x80'*9999 383 (+46%) 354 (+58%) 558
utf-8 '\u8000'*10000 434 (+6%) 388 (+19%) 461
utf-8 '\u8000'+'A'*9999 1262 (-1%) 1180 (+5%) 1244
utf-8 '\u8000'+'\x80'*9999 383 (+46%) 354 (+58%) 559
utf-8 '\u8000'+'\u0100'*9999 383 (+45%) 354 (+57%) 555
utf-8 '\U00010000'*10000 358 (+5%) 361 (+4%) 375
utf-8 '\U00010000'+'A'*9999 1168 (-1%) 1104 (+5%) 1159
utf-8 '\U00010000'+'\x80'*9999 382 (+43%) 307 (+78%) 546
utf-8 '\U00010000'+'\u0100'*9999 382 (+43%) 307 (+79%) 548
utf-8 '\U00010000'+'\u8000'*9999 404 (+13%) 365 (+25%) 458

On 32-bit Linux, Intel Atom N570:

                                      vanilla      old patch    fast patch

utf-8 'A'*10000 623 (+1%) 631 (+0%) 631
utf-8 '\x80'*10000 145 (+26%) 165 (+11%) 183
utf-8 '\x80'+'A'*9999 354 (-0%) 358 (-1%) 353
utf-8 '\u0100'*10000 164 (+10%) 156 (+16%) 181
utf-8 '\u0100'+'A'*9999 343 (+1%) 350 (-1%) 348
utf-8 '\u0100'+'\x80'*9999 164 (+10%) 157 (+15%) 181
utf-8 '\u8000'*10000 175 (-1%) 166 (+5%) 174
utf-8 '\u8000'+'A'*9999 349 (+0%) 356 (-2%) 349
utf-8 '\u8000'+'\x80'*9999 164 (+10%) 157 (+15%) 180
utf-8 '\u8000'+'\u0100'*9999 164 (+10%) 157 (+15%) 181
utf-8 '\U00010000'*10000 152 (+7%) 163 (+0%) 163
utf-8 '\U00010000'+'A'*9999 313 (+4%) 332 (-2%) 327
utf-8 '\U00010000'+'\x80'*9999 161 (+11%) 140 (+28%) 179
utf-8 '\U00010000'+'\u0100'*9999 161 (+11%) 139 (+28%) 178
utf-8 '\U00010000'+'\u8000'*9999 160 (+9%) 159 (+9%) 174

pitrou · 2012-06-23T11:38:42Z

Why is this marked "fixed"? Is it fixed or not?

serhiy-storchaka · 2012-06-23T11:42:18Z

I deleted a fast patch, since it unsafe. bpo-14923 should safer compensate a small slowdown.

I think this change is not a bugfix (this is not a bug, the standard allows such behavior), but a new feature, so I doubt the need to fix 2.7 and 3.2. Any chance to commit the patch today and to get this feature in Python 3.3?

serhiy-storchaka · 2012-06-23T11:55:49Z

No, it is not fully fixed. Only one bug was fixed, but the current
behavior is still not conformed with the Unicode Standard
*recommendations*. Non-conforming with recommendations is not a bug,
conforming is a feature.

serhiy-storchaka · 2012-06-23T21:03:46Z

Here is updated, a little faster, patch. It merged with decode_utf8_range_check.patch from bpo-14923.

Patch contains non-modified Ezio Melotti's tests which all successfully passed.

serhiy-storchaka · 2012-06-23T21:21:02Z

Here is updated patch with resolved merge conflict with 3214c9ebcf5e.

serhiy-storchaka · 2012-11-04T20:18:23Z

What about commit? All Ezio's tests passsed, microbenchmark shows less than 10% differences:

vanilla patched
MB/s MB/s

2076 (-3%) 2007 decode utf-8 'A'*10000
414 (-0%) 413 decode utf-8 '\x80'*10000
1283 (-1%) 1275 decode utf-8 '\x80'+'A'*9999
556 (-8%) 514 decode utf-8 '\u0100'*10000
1227 (-4%) 1172 decode utf-8 '\u0100'+'A'*9999
556 (-8%) 514 decode utf-8 '\u0100'+'\x80'*9999
406 (+10%) 447 decode utf-8 '\u8000'*10000
1225 (-5%) 1167 decode utf-8 '\u8000'+'A'*9999
554 (-7%) 513 decode utf-8 '\u8000'+'\x80'*9999
552 (-8%) 508 decode utf-8 '\u8000'+'\u0100'*9999
358 (-4%) 345 decode utf-8 '\U00010000'*10000
1173 (-5%) 1118 decode utf-8 '\U00010000'+'A'*9999
492 (+1%) 495 decode utf-8 '\U00010000'+'\x80'*9999
492 (+1%) 496 decode utf-8 '\U00010000'+'\u0100'*9999
383 (+5%) 401 decode utf-8 '\U00010000'+'\u8000'*9999

python-dev · 2012-11-04T21:23:36Z

New changeset 5962f192a483 by Ezio Melotti in branch '3.3':
bpo-8271: the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti.
http://hg.python.org/cpython/rev/5962f192a483

New changeset 5b205fff1972 by Ezio Melotti in branch 'default':
bpo-8271: merge with 3.3.
http://hg.python.org/cpython/rev/5b205fff1972

ezio-melotti · 2012-11-04T21:37:04Z

Fixed, thanks for updating the patch!
I committed it on 3.3 too, and while this could have gone on 2.7/3.2 too IMHO, it's to much work to port it there and not worth it.

serhiy-storchaka · 2012-11-04T21:50:05Z

Agree. In 2.7 UTF-8 codec still broken in corner cases (it accepts
surrogates) and 3.2 is coming to an end of maintaining. In any case it is
only recomendation, not demands.

python-dev · 2012-11-04T23:00:39Z

New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3':
Issue bpo-8271: Fix compilation on Windows
http://hg.python.org/cpython/rev/96f4cee8ea5e

New changeset 6f44f33460cd by Victor Stinner in branch 'default':
(Merge 3.3) Issue bpo-8271: Fix compilation on Windows
http://hg.python.org/cpython/rev/6f44f33460cd

sjmachin mannequin added the type-bug An unexpected behavior, bug, or error label Mar 31, 2010

ezio-melotti added the topic-unicode label Mar 31, 2010

ezio-melotti self-assigned this Apr 1, 2010

malemburg changed the title ~~str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0~~ str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 Apr 1, 2010

serhiy-storchaka changed the title ~~str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0~~ str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 May 17, 2012

ezio-melotti closed this as completed Nov 4, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 #52518

str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 #52518

Comments

sjmachin mannequin commented Mar 31, 2010

sjmachin mannequin commented Mar 31, 2010

dangra mannequin commented Mar 31, 2010

malemburg commented Mar 31, 2010

sjmachin mannequin commented Apr 1, 2010

ezio-melotti commented Apr 1, 2010

sjmachin mannequin commented Apr 1, 2010

ezio-melotti commented Apr 1, 2010

sjmachin mannequin commented Apr 1, 2010

ezio-melotti commented Apr 1, 2010

malemburg commented Apr 1, 2010

ezio-melotti commented Apr 1, 2010

malemburg commented Apr 1, 2010

sjmachin mannequin commented Apr 1, 2010

malemburg commented Apr 1, 2010

sjmachin mannequin commented Apr 1, 2010

malemburg commented Apr 1, 2010

sjmachin mannequin commented Apr 1, 2010

ezio-melotti commented Apr 1, 2010

malemburg commented Apr 1, 2010

sjmachin mannequin commented Apr 1, 2010

sjmachin mannequin commented Apr 1, 2010

ezio-melotti commented Apr 2, 2010

malemburg commented Apr 3, 2010

vstinner commented Apr 3, 2010

ezio-melotti commented Apr 4, 2010

ezio-melotti commented Apr 7, 2010

ezio-melotti commented Aug 15, 2011

serhiy-storchaka commented May 17, 2012

ezio-melotti commented May 17, 2012

serhiy-storchaka commented May 17, 2012

ezio-melotti commented May 17, 2012

spatz123 mannequin commented May 17, 2012

ezio-melotti commented May 17, 2012

serhiy-storchaka commented May 17, 2012

serhiy-storchaka commented May 17, 2012

serhiy-storchaka commented May 17, 2012

ezio-melotti commented May 17, 2012

ezio-melotti commented May 17, 2012

serhiy-storchaka commented May 17, 2012

serhiy-storchaka commented May 25, 2012

ezio-melotti commented May 25, 2012

serhiy-storchaka commented May 26, 2012

serhiy-storchaka commented May 26, 2012

pitrou commented Jun 23, 2012

serhiy-storchaka commented Jun 23, 2012

serhiy-storchaka commented Jun 23, 2012

serhiy-storchaka commented Jun 23, 2012

serhiy-storchaka commented Jun 23, 2012

serhiy-storchaka commented Nov 4, 2012

python-dev mannequin commented Nov 4, 2012

ezio-melotti commented Nov 4, 2012

serhiy-storchaka commented Nov 4, 2012

python-dev mannequin commented Nov 4, 2012