New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 #52518
Comments
Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example: >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
'\ufffdB'
# should produce u'\ufffdAB' Resynchronisation currently starts at a position derived by considering the length implied by the start byte: >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
'\ufffdD'
# should produce u'\ufffdABCD'; resync should start from the *failing* byte. Notes: This applies to the 'ignore' option as well as the 'replace' option. The Unicode discussion mentions "security exploits". |
Some background for this report at http://stackoverflow.com/questions/2547262/why-is-python-decode-replacing-more-than-the-invalid-bytes-from-an-encoded-string/2548480 |
I guess the term "failing" byte somewhat underdefined. Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD". Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point. Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly. Any takers ? |
@lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below. Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not in 80-BF. It is the "failing byte"; high bit not set. Required action is to emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total output: FFFD 0041 0042 0043. Current code emits FFFD 0043. Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current code emits FFFD 0043. Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 0081 0043. Current code emits FFFD 0043. |
Having the 'high bit set' means that the first bit is set to 1. |
@ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation bytes; they are either starter bytes (C2-F4) or invalid for any purpose (C0-C2 and F5-FF). Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow. The simple way of summarising the above is to say that a byte that is not a valid continuation byte in the current state ("failing byte") is not a part of the current (now known to be invalid) sequence, and the decoder must try again ("resync") with the failing byte. Do you agree with my example 3? |
Yes, right now I'm considering valid all the bytes that start with '10...'. C2 starts with '11...' so it's a "failing byte". |
#ezio.melotti: """I'm considering valid all the bytes that start with '10...'""" Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.""" Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences (over-long). Specifically the first continuation byte may not be in 80-9F. Those bytes start with '10...' but they are invalid after an E0 starter byte. Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) doesn't implement the surrogates restriction, so that the special case for starter byte ED is not used in CPython). Note the other 3 special cases for the first continuation byte. |
That's why I'm writing tests that cover all the cases, including overlong sequences. If the test will fail I'll change the patch :) |
John Machin wrote:
I just had a quick look at the code and saw that it's testing for the high Looking closer, you're right and the situation is a bit more complex, That said, I find the Unicode consortium solution a bit awkward. |
Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better. |
Ezio Melotti wrote:
Thanks. Please also check whether it's worthwhile unrolling those
I think we need to do this all the way, even though 5 and 6 byte |
Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string. |
John Machin wrote:
The UTF-8 codec was written at a time when UTF-8 still included http://www.rfc-editor.org/rfc/rfc2279.txt Use of those encodings has always raised an error, though. For error |
@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now says 21 bits is it. F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero. |
John Machin wrote:
I know.
It says that the current Unicode codespace only uses 21 bits. In the If you have a reference that the Unicode consortium has decided
Fair enough. Let's do that. The reference in the table should then be updated to RFC 3629. |
Patch review: Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the r79542 that the patch mentions ... help, please! length 2 case:
length 3 case:
length 4 case: as for the len 3 case generally ... misplaced tests, F1 test shadows over-long test, F4 test shadows max value test, too many loop iterations. |
Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:
|
Ezio Melotti wrote:
By marking those entries as 0 in the length table, they would only |
Chapter 3, page 94: """As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF""" Of course they should be handled by the simple expedient of setting their length entry to zero. Why write code when there is an existing mechanism?? |
@lemburg: """perhaps applying the same logic as for the other sequences is a better strategy""" What other sequences??? F5-FF are invalid bytes; they don't start valid sequences. What same logic?? At the start of a character, they should get the same short sharp treatment as any other non-starter byte e.g. 80 or C0. |
Here's a new patch. Should be complete but I want to test it some more before committing. I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev. |
Ezio Melotti wrote:
Ok.
Right, but that idea is controversial. In Python we need to be able to |
Python2 does, but Python3 raises an error. Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36)
>>> u"\uDC80".encode("utf8")
'\xed\xb2\x80'
Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
>>> "\uDC80".encode("utf8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see bpo-6687 (and bpo-8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7. |
This new patch (v3) should be ok. I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet). Finally, I changed the error messages to make them clearer: The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars. |
The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the test. |
Here are some benchmarks: With patch: Without patch: Commands (from the interactive interpreter): With patch: Without patch: Commands: With patch: Without patch: All tests done on wide 3.2. Since the changes are about errors, decoding of valid utf-8 strings is not affected. Decoding with non-strict error handlers and invalid strings are slower, but I don't think the difference is significant. |
Looks like bpo-14738 fixes this bug for Python 3.3. >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
'\ufffdAB'
>>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
'\ufffdABCD' |
The original bug should be fixed already in 3.3 and there should be tests (unless they got removed/skipped after we changed unicode implementation). |
Tests fails, but I'm not sure that the tests are correct. b'\xe0\x00' raises 'unexpected end of data' and not 'invalid b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I |
This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too.
Why not? |
I think that one U+FFFD is correct. The on;y error is a premature end of
|
Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though). If you make any changes on the tests please let me know. |
I poorly expressed. I also think that there is only one decoding error, |
Yes, this implementation detail. It is much easier and faster. Whether
May be I'm wrong. I looked in "The Unicode Standard, Version |
I probably poorly said. Past and current implementations raise |
OK, now I get what you mean. The valid range for continuation bytes that can follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation byte here. While working on the patch I stumbled across this corner case and contacted the Unicode consortium to ask about it, as explained in msg129495. I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard). |
I don't think it matters much either way. |
No, test correctly expects two U+FFFD. Current implementation is wrong. |
Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it is a little slow, but I tried to minimize the losses. |
Do you have any benchmark results? |
Here are the benchmark results (numbers are speed, MB/s). On 32-bit Linux, AMD Athlon 64 X2:
utf-8 'A'*10000 2016 (+5%) 2111 On 32-bit Linux, Intel Atom N570:
ascii 'A'*10000 789 (+1%) 800 latin1 'A'*10000 796 (-2%) 781 utf-8 'A'*10000 623 (+1%) 631 |
Fortunately, bpo-14923 (if accepted) will compensate for the slowdown. On 32-bit Linux, AMD Athlon 64 X2:
utf-8 'A'*10000 2016 (+3%) 2111 (-2%) 2072 On 32-bit Linux, Intel Atom N570:
utf-8 'A'*10000 623 (+1%) 631 (+0%) 631 |
Why is this marked "fixed"? Is it fixed or not? |
I deleted a fast patch, since it unsafe. bpo-14923 should safer compensate a small slowdown. I think this change is not a bugfix (this is not a bug, the standard allows such behavior), but a new feature, so I doubt the need to fix 2.7 and 3.2. Any chance to commit the patch today and to get this feature in Python 3.3? |
No, it is not fully fixed. Only one bug was fixed, but the current |
Here is updated, a little faster, patch. It merged with decode_utf8_range_check.patch from bpo-14923. Patch contains non-modified Ezio Melotti's tests which all successfully passed. |
Here is updated patch with resolved merge conflict with 3214c9ebcf5e. |
What about commit? All Ezio's tests passsed, microbenchmark shows less than 10% differences: vanilla patched 2076 (-3%) 2007 decode utf-8 'A'*10000 |
New changeset 5962f192a483 by Ezio Melotti in branch '3.3': New changeset 5b205fff1972 by Ezio Melotti in branch 'default': |
Fixed, thanks for updating the patch! |
Agree. In 2.7 UTF-8 codec still broken in corner cases (it accepts |
New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3': New changeset 6f44f33460cd by Victor Stinner in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: