Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 #52518

Closed
sjmachin mannequin opened this issue Mar 31, 2010 · 66 comments
Closed

str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 #52518

sjmachin mannequin opened this issue Mar 31, 2010 · 66 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@sjmachin
Copy link
Mannequin

sjmachin mannequin commented Mar 31, 2010

BPO 8271
Nosy @malemburg, @abalkin, @pitrou, @vstinner, @ezio-melotti, @serhiy-storchaka
Files
  • issue8271.diff: Incomplete patch against trunk.
  • issue8271v2.diff: New patch against trunk
  • issue8271v3.diff: Final patch
  • issue8271v4.diff: More final patch
  • issue8271v5.diff: Even more final patch
  • issue8271v6.diff: Patch to fix the number of FFFD
  • issue8271-3.3-fast-3.patch: Ezio's patch updated to current sources
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2012-11-04.21:37:03.868>
    created_at = <Date 2010-03-31.02:28:10.626>
    labels = ['type-bug', 'expert-unicode']
    title = "str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0"
    updated_at = <Date 2014-03-31.23:38:33.951>
    user = 'https://bugs.python.org/sjmachin'

    bugs.python.org fields:

    activity = <Date 2014-03-31.23:38:33.951>
    actor = 'jmehnle'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2012-11-04.21:37:03.868>
    closer = 'ezio.melotti'
    components = ['Unicode']
    creation = <Date 2010-03-31.02:28:10.626>
    creator = 'sjmachin'
    dependencies = []
    files = ['16720', '16741', '16754', '16794', '17552', '21719', '26118']
    hgrepos = []
    issue_num = 8271
    keywords = ['patch', 'needs review']
    message_count = 66.0
    messages = ['101972', '102013', '102024', '102061', '102062', '102063', '102064', '102065', '102066', '102068', '102076', '102077', '102085', '102089', '102090', '102093', '102094', '102095', '102098', '102099', '102101', '102209', '102239', '102265', '102320', '102516', '102522', '102523', '107074', '107163', '109015', '109070', '109155', '109159', '109170', '129495', '129647', '129685', '134046', '142132', '160980', '160981', '160989', '160990', '160991', '160993', '160998', '161000', '161001', '161002', '161004', '161005', '161622', '161627', '161650', '161655', '163587', '163588', '163591', '163674', '163677', '174828', '174831', '174832', '174834', '174839']
    nosy_count = 12.0
    nosy_names = ['lemburg', 'sjmachin', 'belopolsky', 'pitrou', 'vstinner', 'ezio.melotti', 'Ringding', 'dangra', 'python-dev', 'jmehnle', 'spatz123', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue8271'
    versions = ['Python 3.3', 'Python 3.4']

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Mar 31, 2010

    Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example:

     >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
     '\ufffdB'
     # should produce u'\ufffdAB'

    Resynchronisation currently starts at a position derived by considering the length implied by the start byte:

     >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
     '\ufffdD'
     # should produce u'\ufffdABCD'; resync should start from the *failing* byte.

    Notes: This applies to the 'ignore' option as well as the 'replace' option. The Unicode discussion mentions "security exploits".

    @sjmachin sjmachin mannequin added the type-bug An unexpected behavior, bug, or error label Mar 31, 2010
    @dangra
    Copy link
    Mannequin

    dangra mannequin commented Mar 31, 2010

    @malemburg
    Copy link
    Member

    I guess the term "failing" byte somewhat underdefined.

    Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD".

    Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point.

    Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly.

    Any takers ?

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

    Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not in 80-BF. It is the "failing byte"; high bit not set. Required action is to emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total output: FFFD 0041 0042 0043. Current code emits FFFD 0043.

    Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current code emits FFFD 0043.

    Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 0081 0043. Current code emits FFFD 0043.

    @ezio-melotti
    Copy link
    Member

    Having the 'high bit set' means that the first bit is set to 1.
    All the continuation bytes (i.e. the 2nd, 3rd or 4th byte in a sequence) have the first two bits set to 1 and 0 respectively, so if the first bit is not set to 1 then the byte shouldn't be considered part of the sequence.
    I'm trying to work on a patch.

    @ezio-melotti ezio-melotti self-assigned this Apr 1, 2010
    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    @ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation bytes; they are either starter bytes (C2-F4) or invalid for any purpose (C0-C2 and F5-FF). Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.

    The simple way of summarising the above is to say that a byte that is not a valid continuation byte in the current state ("failing byte") is not a part of the current (now known to be invalid) sequence, and the decoder must try again ("resync") with the failing byte.

    Do you agree with my example 3?

    @ezio-melotti
    Copy link
    Member

    Yes, right now I'm considering valid all the bytes that start with '10...'. C2 starts with '11...' so it's a "failing byte".

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    #ezio.melotti: """I'm considering valid all the bytes that start with '10...'"""

    Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow."""

    Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences (over-long). Specifically the first continuation byte may not be in 80-9F. Those bytes start with '10...' but they are invalid after an E0 starter byte.

    Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) doesn't implement the surrogates restriction, so that the special case for starter byte ED is not used in CPython). Note the other 3 special cases for the first continuation byte.

    @ezio-melotti
    Copy link
    Member

    That's why I'm writing tests that cover all the cases, including overlong sequences. If the test will fail I'll change the patch :)

    @malemburg
    Copy link
    Member

    John Machin wrote:

    John Machin <sjmachin@users.sourceforge.net> added the comment:

    @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

    I just had a quick look at the code and saw that it's testing for the high
    bit on the subsequent bytes.

    Looking closer, you're right and the situation is a bit more complex,
    but the solution still looks simple: only the endinpos
    has to be adjusted more carefully depending on what the various
    checks find.

    That said, I find the Unicode consortium solution a bit awkward.
    In UTF-8 the first byte in a multi-byte sequence defines the number
    of bytes that make up a sequence. If some of those bytes are invalid,
    the whole sequence is invalid and the fact that some of those
    bytes may be interpretable as regular code points does not necessarily
    result in better results - the reason is that loss of bytes in a
    stream is far more unlikely than flipping a few bits in the data.

    @malemburg malemburg changed the title str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 Apr 1, 2010
    @ezio-melotti
    Copy link
    Member

    Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better.
    I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking.

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better.

    Thanks. Please also check whether it's worthwhile unrolling those
    loops by hand.

    I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking.

    I think we need to do this all the way, even though 5 and 6 byte
    sequences are not used at the moment.

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.

    @malemburg
    Copy link
    Member

    John Machin wrote:

    John Machin <sjmachin@users.sourceforge.net> added the comment:

    Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.

    The UTF-8 codec was written at a time when UTF-8 still included
    the possibility to have 5 or 6 bytes:

    http://www.rfc-editor.org/rfc/rfc2279.txt

    Use of those encodings has always raised an error, though. For error
    handling purposes it still has to support those possibilities.

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now says 21 bits is it. F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.

    @malemburg
    Copy link
    Member

    John Machin wrote:

    John Machin <sjmachin@users.sourceforge.net> added the comment:

    @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago.

    I know.

    The standard now says 21 bits is it.

    It says that the current Unicode codespace only uses 21 bits. In the
    early days 16 bits were considered enough, so it wouldn't surprise me,
    if they extend that range again at some point in the future - after
    all, leaving 11 bits unused in UCS-4 is a huge waste of space.

    If you have a reference that the Unicode consortium has decided
    to stay with that limit forever, please quote it.

    F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.

    Fair enough. Let's do that.

    The reference in the table should then be updated to RFC 3629.

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    Patch review:

    Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the r79542 that the patch mentions ... help, please!

    length 2 case:

    1. the loop can be hand-unrolled into oblivion. It can be entered only when s[1] & 0xC0 != 0x80 (previous if test).
    2. the over-long check (if (ch < 0x80)) hasn't been touched. It could be removed and the entries for C0 and C1 in the utf8_code_length array set to 0.

    length 3 case:

    1. the tests involving s[0] being 0xE0 or 0xED are misplaced.
    2. the test s[0] == 0xE0 && s[1] < 0xA0 if not misplaced would be shadowing the over-long test (ch < 0x800). It seems better to use the over-long test (with endinpos set to 1).
    3. The test s[0] == 0xED relates to the surrogates caper which in the py3k version is handled in the same place as the over-long test.
    4. unrolling loop: needs no loop, only 1 test ... if s[1] is good, then we know s[2] must be bad without testing it, because we start the for loop only when s[1] is bad || s[2] is bad.

    length 4 case: as for the len 3 case generally ... misplaced tests, F1 test shadows over-long test, F4 test shadows max value test, too many loop iterations.

    @ezio-melotti
    Copy link
    Member

    Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:

    1. consider all the bytes in range F5-FD as invalid without looking for the other bytes;
    2. try to read the next 4/5/6 bytes and fail if they are no continuation bytes.
      We can also look at what others do (e.g. browsers and other languages).

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:

    1. consider all the bytes in range F5-FD as invalid without looking for the other bytes;
    2. try to read the next 4/5/6 bytes and fail if they are no continuation bytes.
      We can also look at what others do (e.g. browsers and other languages).

    By marking those entries as 0 in the length table, they would only
    use one byte, however, compared to the current state, that would
    produce more replacement code points in the output, so perhaps applying
    the same logic as for the other sequences is a better strategy.

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    Chapter 3, page 94: """As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF"""

    Of course they should be handled by the simple expedient of setting their length entry to zero. Why write code when there is an existing mechanism??

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Apr 1, 2010

    @lemburg: """perhaps applying the same logic as for the other sequences is a better strategy"""

    What other sequences??? F5-FF are invalid bytes; they don't start valid sequences. What same logic?? At the start of a character, they should get the same short sharp treatment as any other non-starter byte e.g. 80 or C0.

    @ezio-melotti
    Copy link
    Member

    Here's a new patch. Should be complete but I want to test it some more before committing.
    I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code.

    I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    Here's a new patch. Should be complete but I want to test it some more before committing.
    I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code.

    Ok.

    I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.

    Right, but that idea is controversial. In Python we need to be able to
    put those surrogate code points into source code (encoded as UTF-8) as
    well as pickle and marshal dumps of Unicode object dumps, so we can't
    consider them invalid UTF-8.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 3, 2010

    I also found out that, according to RFC 3629, surrogates
    are considered invalid and they can't be encoded/decoded,
    but the UTF-8 codec actually does it.

    Python2 does, but Python3 raises an error.

    Python 2.7a4+ (trunk:79675, Apr  3 2010, 16:11:36)
    >>> u"\uDC80".encode("utf8")
    '\xed\xb2\x80'
    
    Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
    >>> "\uDC80".encode("utf8")
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

    Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see bpo-6687 (and bpo-8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7.

    @ezio-melotti
    Copy link
    Member

    This new patch (v3) should be ok.
    I added a few more tests and found another corner case:
    '\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch.

    I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet).

    Finally, I changed the error messages to make them clearer:
    unexpected code byte -> invalid start byte;
    invalid data -> invalid continuation byte.
    (I can revert this if the old messages are better or if it is better to fix this with a separate commit.)

    The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars.

    @ezio-melotti
    Copy link
    Member

    The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the test.
    (The failing test in test_codeccallbacks was testing that registering error handlers works, using a function that replaced "\xc0\x80" with "\x00". Since now "\xc0" is an invalid start byte regardless of what follows, the function is now receiving only "\xc0" instead of "\xc0\x80" so I had to change the test.)

    @ezio-melotti
    Copy link
    Member

    Here are some benchmarks:
    Commands:
    # half of the bytes are invalid
    ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "surrogateescape")'
    ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "replace")'
    ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "ignore")'

    With patch:
    1000 loops, best of 3: 854 usec per loop
    1000 loops, best of 3: 509 usec per loop
    1000 loops, best of 3: 415 usec per loop

    Without patch:
    1000 loops, best of 3: 670 usec per loop
    1000 loops, best of 3: 470 usec per loop
    1000 loops, best of 3: 382 usec per loop

    Commands (from the interactive interpreter):
    # all valid codepoints
    import timeit
    b = "".join(chr(c) for c in range(0x110000) if c not in range(0xD800, 0xE000)).encode("utf-8")
    b_dec = b.decode
    timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100
    timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import b_dec').timeit(100)/100
    timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import b_dec').timeit(100)/100
    timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import b_dec').timeit(100)/100

    With patch:
    0.03830226898193359
    0.03849360942840576
    0.03835036039352417
    0.03821949005126953

    Without patch:
    0.03750091791152954
    0.037977190017700196
    0.04067679166793823
    0.038579678535461424

    Commands:
    # near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string
    b2 = bytes(c for k,c in enumerate(b) if k%5)
    b2_dec = b2.decode
    timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import b2_dec').timeit(10)/10
    timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import b2_dec').timeit(10)/10
    timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import b2_dec').timeit(10)/10

    With patch:
    9.645482301712036
    6.602735090255737
    5.338080596923828

    Without patch:
    8.124328684806823
    5.804249691963196
    4.851014900207519

    All tests done on wide 3.2.

    Since the changes are about errors, decoding of valid utf-8 strings is not affected. Decoding with non-strict error handlers and invalid strings are slower, but I don't think the difference is significant.
    If the patch is fine I will commit it.

    @serhiy-storchaka
    Copy link
    Member

    Looks like bpo-14738 fixes this bug for Python 3.3.

    >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
    '\ufffdAB'
    >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
    '\ufffdABCD'

    @ezio-melotti
    Copy link
    Member

    The original bug should be fixed already in 3.3 and there should be tests (unless they got removed/skipped after we changed unicode implementation).
    The only issue left was about the number of U+FFFD generated with invalid sequences in some cases.
    My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass. FWIW this should be already fixed on PyPy.

    @serhiy-storchaka
    Copy link
    Member

    The only issue left was about the number of U+FFFD generated with invalid sequences in some cases.
    My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass.

    Tests fails, but I'm not sure that the tests are correct.

    b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
    continuation byte'. This is terminological issue.

    b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I
    don't think that is right.

    @serhiy-storchaka serhiy-storchaka changed the title str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 May 17, 2012
    @ezio-melotti
    Copy link
    Member

    Tests fails, but I'm not sure that the tests are correct.

    b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
    continuation byte'. This is terminological issue.

    This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too.

    b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not
    two. I don't think that is right.

    Why not?

    @spatz123
    Copy link
    Mannequin

    spatz123 mannequin commented May 17, 2012

    b'\xe0\x80'.decode('utf-8', 'replace') returns >one U+FFFD and not two. I
    don't think that is right.

    I think that one U+FFFD is correct. The on;y error is a premature end of
    data.
    On Thu, May 17, 2012 at 12:31 PM, Serhiy Storchaka
    <report@bugs.python.org>wrote:

    Serhiy Storchaka <storchaka@gmail.com> added the comment:

    > The only issue left was about the number of U+FFFD generated with
    invalid sequences in some cases.
    > My last patch has extensive tests for this, so you could try to apply it
    (or copy the tests) and see if they all pass.

    Tests fails, but I'm not sure that the tests are correct.

    b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
    continuation byte'. This is terminological issue.

    b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I
    don't think that is right.

    ----------
    title: str.decode('utf8', 'replace') -- conformance with Unicode
    5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue8271\>


    @ezio-melotti
    Copy link
    Member

    Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).

    If you make any changes on the tests please let me know.

    @serhiy-storchaka
    Copy link
    Member

    I think that one U+FFFD is correct. The on;y error is a premature end of
    data.

    I poorly expressed. I also think that there is only one decoding error,
    and not two. I think the test is wrong.

    @serhiy-storchaka
    Copy link
    Member

    This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too.

    Yes, this implementation detail. It is much easier and faster. Whether
    it is necessary to change it?

    Why not?

    May be I'm wrong. I looked in "The Unicode Standard, Version
    6.0" (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf), pp. 95-97,
    the standard does not categorical in this, but recommends that only
    maximal subpart should be replaced by U+FFFD. \xe0\x80 is not maximal
    subpart. Therefore, there must be two U+FFFD. In this case, the previous
    and the current implementation does not conform to the standard.

    @serhiy-storchaka
    Copy link
    Member

    Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).

    I probably poorly said. Past and current implementations raise
    'unexpected end of data' and not 'invalid continuation byte'. Test
    expects 'invalid continuation byte'.

    @ezio-melotti
    Copy link
    Member

    \xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD.

    OK, now I get what you mean. The valid range for continuation bytes that can follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation byte here. While working on the patch I stumbled across this corner case and contacted the Unicode consortium to ask about it, as explained in msg129495.

    I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard).

    @ezio-melotti
    Copy link
    Member

    I probably poorly said. Past and current implementations raise
    'unexpected end of data' and not 'invalid continuation byte'. Test
    expects 'invalid continuation byte'.

    I don't think it matters much either way.

    @serhiy-storchaka
    Copy link
    Member

    I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard).

    No, test correctly expects two U+FFFD. Current implementation is wrong.
    As I understand now, what's the error, I'll try to correct Python 3.3
    implementation.

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it is a little slow, but I tried to minimize the losses.

    @ezio-melotti
    Copy link
    Member

    Do you have any benchmark results?

    @serhiy-storchaka
    Copy link
    Member

    Here are the benchmark results (numbers are speed, MB/s).

    On 32-bit Linux, AMD Athlon 64 X2:

                                          vanilla      patched
    

    utf-8 'A'*10000 2016 (+5%) 2111
    utf-8 '\x80'*10000 383 (+9%) 416
    utf-8 '\x80'+'A'*9999 1283 (+1%) 1301
    utf-8 '\u0100'*10000 383 (-8%) 354
    utf-8 '\u0100'+'A'*9999 1258 (-6%) 1184
    utf-8 '\u0100'+'\x80'*9999 383 (-8%) 354
    utf-8 '\u8000'*10000 434 (-11%) 388
    utf-8 '\u8000'+'A'*9999 1262 (-6%) 1180
    utf-8 '\u8000'+'\x80'*9999 383 (-8%) 354
    utf-8 '\u8000'+'\u0100'*9999 383 (-8%) 354
    utf-8 '\U00010000'*10000 358 (+1%) 361
    utf-8 '\U00010000'+'A'*9999 1168 (-5%) 1104
    utf-8 '\U00010000'+'\x80'*9999 382 (-20%) 307
    utf-8 '\U00010000'+'\u0100'*9999 382 (-20%) 307
    utf-8 '\U00010000'+'\u8000'*9999 404 (-10%) 365

    On 32-bit Linux, Intel Atom N570:

                                          vanilla      patched
    

    ascii 'A'*10000 789 (+1%) 800

    latin1 'A'*10000 796 (-2%) 781
    latin1 'A'*9999+'\x80' 779 (+1%) 789
    latin1 '\x80'*10000 1739 (-3%) 1690
    latin1 '\x80'+'A'*9999 1747 (+1%) 1773

    utf-8 'A'*10000 623 (+1%) 631
    utf-8 '\x80'*10000 145 (+14%) 165
    utf-8 '\x80'+'A'*9999 354 (+1%) 358
    utf-8 '\u0100'*10000 164 (-5%) 156
    utf-8 '\u0100'+'A'*9999 343 (+2%) 350
    utf-8 '\u0100'+'\x80'*9999 164 (-4%) 157
    utf-8 '\u8000'*10000 175 (-5%) 166
    utf-8 '\u8000'+'A'*9999 349 (+2%) 356
    utf-8 '\u8000'+'\x80'*9999 164 (-4%) 157
    utf-8 '\u8000'+'\u0100'*9999 164 (-4%) 157
    utf-8 '\U00010000'*10000 152 (+7%) 163
    utf-8 '\U00010000'+'A'*9999 313 (+6%) 332
    utf-8 '\U00010000'+'\x80'*9999 161 (-13%) 140
    utf-8 '\U00010000'+'\u0100'*9999 161 (-14%) 139
    utf-8 '\U00010000'+'\u8000'*9999 160 (-1%) 159

    @serhiy-storchaka
    Copy link
    Member

    Fortunately, bpo-14923 (if accepted) will compensate for the slowdown.

    On 32-bit Linux, AMD Athlon 64 X2:

                                          vanilla      old patch    fast patch
    

    utf-8 'A'*10000 2016 (+3%) 2111 (-2%) 2072
    utf-8 '\x80'*10000 383 (+19%) 416 (+9%) 454
    utf-8 '\x80'+'A'*9999 1283 (-7%) 1301 (-9%) 1190
    utf-8 '\u0100'*10000 383 (+46%) 354 (+58%) 560
    utf-8 '\u0100'+'A'*9999 1258 (-1%) 1184 (+5%) 1244
    utf-8 '\u0100'+'\x80'*9999 383 (+46%) 354 (+58%) 558
    utf-8 '\u8000'*10000 434 (+6%) 388 (+19%) 461
    utf-8 '\u8000'+'A'*9999 1262 (-1%) 1180 (+5%) 1244
    utf-8 '\u8000'+'\x80'*9999 383 (+46%) 354 (+58%) 559
    utf-8 '\u8000'+'\u0100'*9999 383 (+45%) 354 (+57%) 555
    utf-8 '\U00010000'*10000 358 (+5%) 361 (+4%) 375
    utf-8 '\U00010000'+'A'*9999 1168 (-1%) 1104 (+5%) 1159
    utf-8 '\U00010000'+'\x80'*9999 382 (+43%) 307 (+78%) 546
    utf-8 '\U00010000'+'\u0100'*9999 382 (+43%) 307 (+79%) 548
    utf-8 '\U00010000'+'\u8000'*9999 404 (+13%) 365 (+25%) 458

    On 32-bit Linux, Intel Atom N570:

                                          vanilla      old patch    fast patch
    

    utf-8 'A'*10000 623 (+1%) 631 (+0%) 631
    utf-8 '\x80'*10000 145 (+26%) 165 (+11%) 183
    utf-8 '\x80'+'A'*9999 354 (-0%) 358 (-1%) 353
    utf-8 '\u0100'*10000 164 (+10%) 156 (+16%) 181
    utf-8 '\u0100'+'A'*9999 343 (+1%) 350 (-1%) 348
    utf-8 '\u0100'+'\x80'*9999 164 (+10%) 157 (+15%) 181
    utf-8 '\u8000'*10000 175 (-1%) 166 (+5%) 174
    utf-8 '\u8000'+'A'*9999 349 (+0%) 356 (-2%) 349
    utf-8 '\u8000'+'\x80'*9999 164 (+10%) 157 (+15%) 180
    utf-8 '\u8000'+'\u0100'*9999 164 (+10%) 157 (+15%) 181
    utf-8 '\U00010000'*10000 152 (+7%) 163 (+0%) 163
    utf-8 '\U00010000'+'A'*9999 313 (+4%) 332 (-2%) 327
    utf-8 '\U00010000'+'\x80'*9999 161 (+11%) 140 (+28%) 179
    utf-8 '\U00010000'+'\u0100'*9999 161 (+11%) 139 (+28%) 178
    utf-8 '\U00010000'+'\u8000'*9999 160 (+9%) 159 (+9%) 174

    @pitrou
    Copy link
    Member

    pitrou commented Jun 23, 2012

    Why is this marked "fixed"? Is it fixed or not?

    @serhiy-storchaka
    Copy link
    Member

    I deleted a fast patch, since it unsafe. bpo-14923 should safer compensate a small slowdown.

    I think this change is not a bugfix (this is not a bug, the standard allows such behavior), but a new feature, so I doubt the need to fix 2.7 and 3.2. Any chance to commit the patch today and to get this feature in Python 3.3?

    @serhiy-storchaka
    Copy link
    Member

    No, it is not fully fixed. Only one bug was fixed, but the current
    behavior is still not conformed with the Unicode Standard
    *recommendations*. Non-conforming with recommendations is not a bug,
    conforming is a feature.

    @serhiy-storchaka
    Copy link
    Member

    Here is updated, a little faster, patch. It merged with decode_utf8_range_check.patch from bpo-14923.

    Patch contains non-modified Ezio Melotti's tests which all successfully passed.

    @serhiy-storchaka
    Copy link
    Member

    Here is updated patch with resolved merge conflict with 3214c9ebcf5e.

    @serhiy-storchaka
    Copy link
    Member

    What about commit? All Ezio's tests passsed, microbenchmark shows less than 10% differences:

    vanilla patched
    MB/s MB/s

    2076 (-3%) 2007 decode utf-8 'A'*10000
    414 (-0%) 413 decode utf-8 '\x80'*10000
    1283 (-1%) 1275 decode utf-8 '\x80'+'A'*9999
    556 (-8%) 514 decode utf-8 '\u0100'*10000
    1227 (-4%) 1172 decode utf-8 '\u0100'+'A'*9999
    556 (-8%) 514 decode utf-8 '\u0100'+'\x80'*9999
    406 (+10%) 447 decode utf-8 '\u8000'*10000
    1225 (-5%) 1167 decode utf-8 '\u8000'+'A'*9999
    554 (-7%) 513 decode utf-8 '\u8000'+'\x80'*9999
    552 (-8%) 508 decode utf-8 '\u8000'+'\u0100'*9999
    358 (-4%) 345 decode utf-8 '\U00010000'*10000
    1173 (-5%) 1118 decode utf-8 '\U00010000'+'A'*9999
    492 (+1%) 495 decode utf-8 '\U00010000'+'\x80'*9999
    492 (+1%) 496 decode utf-8 '\U00010000'+'\u0100'*9999
    383 (+5%) 401 decode utf-8 '\U00010000'+'\u8000'*9999

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 4, 2012

    New changeset 5962f192a483 by Ezio Melotti in branch '3.3':
    bpo-8271: the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti.
    http://hg.python.org/cpython/rev/5962f192a483

    New changeset 5b205fff1972 by Ezio Melotti in branch 'default':
    bpo-8271: merge with 3.3.
    http://hg.python.org/cpython/rev/5b205fff1972

    @ezio-melotti
    Copy link
    Member

    Fixed, thanks for updating the patch!
    I committed it on 3.3 too, and while this could have gone on 2.7/3.2 too IMHO, it's to much work to port it there and not worth it.

    @serhiy-storchaka
    Copy link
    Member

    Agree. In 2.7 UTF-8 codec still broken in corner cases (it accepts
    surrogates) and 3.2 is coming to an end of maintaining. In any case it is
    only recomendation, not demands.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 4, 2012

    New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3':
    Issue bpo-8271: Fix compilation on Windows
    http://hg.python.org/cpython/rev/96f4cee8ea5e

    New changeset 6f44f33460cd by Victor Stinner in branch 'default':
    (Merge 3.3) Issue bpo-8271: Fix compilation on Windows
    http://hg.python.org/cpython/rev/6f44f33460cd

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants