Issue 8271: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52518

classification

Title:	str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	Ringding, belopolsky, dangra, ezio.melotti, jmehnle, lemburg, pitrou, python-dev, serhiy.storchaka, sjmachin, spatz123, vstinner
Priority:	normal	Keywords:	needs review, patch

Created on 2010-03-31 02:28 by sjmachin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue8271.diff	ezio.melotti, 2010-04-01 08:33	Incomplete patch against trunk.	review
issue8271v2.diff	ezio.melotti, 2010-04-02 22:27	New patch against trunk	review
issue8271v3.diff	ezio.melotti, 2010-04-04 05:49	Final patch	review
issue8271v4.diff	ezio.melotti, 2010-04-07 04:08	More final patch	review
issue8271v5.diff	ezio.melotti, 2010-06-04 16:22	Even more final patch	review
issue8271v6.diff	ezio.melotti, 2011-04-19 12:30	Patch to fix the number of FFFD	review
issue8271-3.3-fast-3.patch	serhiy.storchaka, 2012-06-23 21:21	Ezio's patch updated to current sources	review

Messages (66)
msg101972 - (view)	Author: John Machin (sjmachin)	Date: 2010-03-31 02:28
Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example: >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace'))) '\ufffdB' # should produce u'\ufffdAB' Resynchronisation currently starts at a position derived by considering the length implied by the start byte: >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace'))) '\ufffdD' # should produce u'\ufffdABCD'; resync should start from the failing byte. Notes: This applies to the 'ignore' option as well as the 'replace' option. The Unicode discussion mentions "security exploits".
msg102013 - (view)	Author: Daniel Graña (dangra)	Date: 2010-03-31 14:59
Some background for this report at http://stackoverflow.com/questions/2547262/why-is-python-decode-replacing-more-than-the-invalid-bytes-from-an-encoded-string/2548480
msg102024 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-03-31 18:07
I guess the term "failing" byte somewhat underdefined. Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD". Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point. Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly. Any takers ?
msg102061 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 03:19
@lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below. Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not in 80-BF. It is the "failing byte"; high bit not set. Required action is to emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total output: FFFD 0041 0042 0043. Current code emits FFFD 0043. Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current code emits FFFD 0043. Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 0081 0043. Current code emits FFFD 0043.
msg102062 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-01 03:28
Having the 'high bit set' means that the first bit is set to 1. All the continuation bytes (i.e. the 2nd, 3rd or 4th byte in a sequence) have the first two bits set to 1 and 0 respectively, so if the first bit is not set to 1 then the byte shouldn't be considered part of the sequence. I'm trying to work on a patch.
msg102063 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 06:08
@ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit is set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation bytes; they are either starter bytes (C2-F4) or invalid for any purpose (C0-C2 and F5-FF). Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow. The simple way of summarising the above is to say that a byte that is not a valid continuation byte in the current state ("failing byte") is not a part of the current (now known to be invalid) sequence, and the decoder must try again ("resync") with the failing byte. Do you agree with my example 3?
msg102064 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-01 06:14
Yes, right now I'm considering valid all the bytes that start with '10...'. C2 starts with '11...' so it's a "failing byte".
msg102065 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 07:29
#ezio.melotti: """I'm considering valid all the bytes that start with '10...'""" Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.""" Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences (over-long). Specifically the first continuation byte may not be in 80-9F. Those bytes start with '10...' but they are invalid after an E0 starter byte. Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) doesn't implement the surrogates restriction, so that the special case for starter byte ED is not used in CPython). Note the other 3 special cases for the first continuation byte.
msg102066 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-01 07:33
That's why I'm writing tests that cover all the cases, including overlong sequences. If the test will fail I'll change the patch :)
msg102068 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-01 07:44
John Machin wrote: > > John Machin <sjmachin@users.sourceforge.net> added the comment: > > @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below. I just had a quick look at the code and saw that it's testing for the high bit on the subsequent bytes. Looking closer, you're right and the situation is a bit more complex, but the solution still looks simple: only the endinpos has to be adjusted more carefully depending on what the various checks find. That said, I find the Unicode consortium solution a bit awkward. In UTF-8 the first byte in a multi-byte sequence defines the number of bytes that make up a sequence. If some of those bytes are invalid, the whole sequence is invalid and the fact that some of those bytes may be interpretable as regular code points does not necessarily result in better results - the reason is that loss of bytes in a stream is far more unlikely than flipping a few bits in the data.
msg102076 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-01 08:33
Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better. I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking.
msg102077 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-01 08:46
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better. Thanks. Please also check whether it's worthwhile unrolling those loops by hand. > I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same way. Right now they just "eat" the following 4/5/6 chars without checking. I think we need to do this all the way, even though 5 and 6 byte sequences are not used at the moment.
msg102085 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 11:57
Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.
msg102089 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-01 13:19
John Machin wrote: > > John Machin <sjmachin@users.sourceforge.net> added the comment: > > Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string. The UTF-8 codec was written at a time when UTF-8 still included the possibility to have 5 or 6 bytes: http://www.rfc-editor.org/rfc/rfc2279.txt Use of those encodings has always raised an error, though. For error handling purposes it still has to support those possibilities.
msg102090 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 13:47
@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now says 21 bits is it. F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.
msg102093 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-01 14:12
John Machin wrote: > > John Machin <sjmachin@users.sourceforge.net> added the comment: > > @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. I know. > The standard now says 21 bits is it. It says that the current Unicode codespace only uses 21 bits. In the early days 16 bits were considered enough, so it wouldn't surprise me, if they extend that range again at some point in the future - after all, leaving 11 bits unused in UCS-4 is a huge waste of space. If you have a reference that the Unicode consortium has decided to stay with that limit forever, please quote it. > F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero. Fair enough. Let's do that. The reference in the table should then be updated to RFC 3629.
msg102094 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 14:43
Patch review: Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the r79542 that the patch mentions ... help, please! length 2 case: 1. the loop can be hand-unrolled into oblivion. It can be entered only when s[1] & 0xC0 != 0x80 (previous if test). 2. the over-long check (if (ch < 0x80)) hasn't been touched. It could be removed and the entries for C0 and C1 in the utf8_code_length array set to 0. length 3 case: 1. the tests involving s[0] being 0xE0 or 0xED are misplaced. 2. the test s[0] == 0xE0 && s[1] < 0xA0 if not misplaced would be shadowing the over-long test (ch < 0x800). It seems better to use the over-long test (with endinpos set to 1). 3. The test s[0] == 0xED relates to the surrogates caper which in the py3k version is handled in the same place as the over-long test. 4. unrolling loop: needs no loop, only 1 test ... if s[1] is good, then we know s[2] must be bad without testing it, because we start the for loop only when s[1] is bad \|\| s[2] is bad. length 4 case: as for the len 3 case generally ... misplaced tests, F1 test shadows over-long test, F4 test shadows max value test, too many loop iterations.
msg102095 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-01 14:50
Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities: 1) consider all the bytes in range F5-FD as invalid without looking for the other bytes; 2) try to read the next 4/5/6 bytes and fail if they are no continuation bytes. We can also look at what others do (e.g. browsers and other languages).
msg102098 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-01 15:01
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities: > 1) consider all the bytes in range F5-FD as invalid without looking for the other bytes; > 2) try to read the next 4/5/6 bytes and fail if they are no continuation bytes. > We can also look at what others do (e.g. browsers and other languages). By marking those entries as 0 in the length table, they would only use one byte, however, compared to the current state, that would produce more replacement code points in the output, so perhaps applying the same logic as for the other sequences is a better strategy.
msg102099 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 15:07
Chapter 3, page 94: """As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF""" Of course they should be handled by the simple expedient of setting their length entry to zero. Why write code when there is an existing mechanism??
msg102101 - (view)	Author: John Machin (sjmachin)	Date: 2010-04-01 15:23
@lemburg: """perhaps applying the same logic as for the other sequences is a better strategy""" What other sequences??? F5-FF are invalid bytes; they don't start valid sequences. What same logic?? At the start of a character, they should get the same short sharp treatment as any other non-starter byte e.g. 80 or C0.
msg102209 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-02 22:27
Here's a new patch. Should be complete but I want to test it some more before committing. I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code. I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.
msg102239 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-03 11:41
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > Here's a new patch. Should be complete but I want to test it some more before committing. > I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code. Ok. > I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev. Right, but that idea is controversial. In Python we need to be able to put those surrogate code points into source code (encoded as UTF-8) as well as pickle and marshal dumps of Unicode object dumps, so we can't consider them invalid UTF-8.
msg102265 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-04-03 14:43
> I also found out that, according to RFC 3629, surrogates > are considered invalid and they can't be encoded/decoded, > but the UTF-8 codec actually does it. Python2 does, but Python3 raises an error. Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36) >>> u"\uDC80".encode("utf8") '\xed\xb2\x80' Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55) >>> "\uDC80".encode("utf8") UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see #6687 (and #8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7.
msg102320 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-04 05:49
This new patch (v3) should be ok. I added a few more tests and found another corner case: '\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch. I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet). Finally, I changed the error messages to make them clearer: unexpected code byte -> invalid start byte; invalid data -> invalid continuation byte. (I can revert this if the old messages are better or if it is better to fix this with a separate commit.) The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars.
msg102516 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-04-07 04:08
The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the test. (The failing test in test_codeccallbacks was testing that registering error handlers works, using a function that replaced "\xc0\x80" with "\x00". Since now "\xc0" is an invalid start byte regardless of what follows, the function is now receiving only "\xc0" instead of "\xc0\x80" so I had to change the test.)
msg102522 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-07 08:37
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> I also found out that, according to RFC 3629, surrogates >> are considered invalid and they can't be encoded/decoded, >> but the UTF-8 codec actually does it. > > Python2 does, but Python3 raises an error. > > Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36) >>>> u"\uDC80".encode("utf8") > '\xed\xb2\x80' > > Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55) >>>> "\uDC80".encode("utf8") > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed > > Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see #6687 (and #8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7. I wonder how that change got into the 3.x branch - I would certainly not have approved it for the reasons given further up on this ticket. I think we should revert that change for Python 3.2.
msg102523 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-04-07 09:02
> >> I also found out that, according to RFC 3629, surrogates > >> are considered invalid and they can't be encoded/decoded, > >> but the UTF-8 codec actually does it. > > > > Python2 does, but Python3 raises an error. > > (...) > > I wonder how that change got into the 3.x branch - I would certainly > not have approved it for the reasons given further up on this ticket. > > I think we should revert that change for Python 3.2. See r72208 and issue #3672. pitrou wrote "We could fix it for 3.1, and perhaps leave 2.7 unchanged if some people rely on this (for whatever reason)."
msg107074 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-06-04 16:22
I added a test for the 'ignore' error handler. I will commit the patch before the RC unless someone has something against it. To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629, so: 1) Invalid sequences are now handled as described in http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95); 2) 5- and 6-bits-long sequences are now invalid (no changes in behavior, I just removed the "deafult:" of the switch/case and marked them with '0' in the first table); 3) According to RFC 3629, codepoints in the surrogate range (U+D800-U+DFFF) should be considered invalid, but this would not be backward compatible, so I added code and tests but left them commented away; 4) I changed the error message "unexpected code byte" to "invalid start byte" and "invalid data" to "invalid continuation byte"; 5) I added an extensive set of tests in test_unicode; 6) I fixed test_codeccallbacks because it was failing after this change.
msg107163 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-06-05 20:35
Fixed on trunk in r81758 and r81759. I'm leaving the issue open until I port it on the other versions.
msg109015 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-06-30 20:21
The issue about invalid surrogates in UTF-8 has been raised in #9133.
msg109070 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-01 19:19
Ported to py3k in r82413. Some test with non-BMP characters should probably be added. The patch should still be ported to 2.6 and 3.1.
msg109155 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-03 00:49
I've found a subtle corner case about 3- and 4-bytes long sequences. For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are invalid. I.e. if the first byte is \xe0 and the second byte is between \x80 (included) and \xA0 (excluded), then the second byte is invalid (this is because sequences < \xe0\xa0\x80 will result in codepoints < U+0800 and these codepoints are already represented by two-bytes-long sequences (\xdf\xbf decodes to U+07FF)). Assume that we want to decode the string b'\xe0\x61\x80\x61' (where \xe0 is the start byte of a 3-bytes-long sequence, \x61 is the letter 'a' and \x80 a valid continuation byte). This actually results in: >>> b'\xe0\x61\x80\x61'.decode('utf-8', 'replace') '�a�a' since \x61 is not a valid continuation byte in the sequence: * \xe0 is converted to �; * \x61 is displayed correctly as 'a'; * \x80 is valid only as a continuation byte and invalid alone, so it's replaced by �; * \x61 is displayed correctly as 'a'; Now, assume that we want to do the same with b'\xe0\x80\x81\x61': This actually results in: >>> b'\xe0\x80\x81\x61'.decode('utf-8', 'replace') '��a' in this case \x80 would be a valid continuation byte, but since it's preceded by \xe0 it's not valid. Since it's not valid, the result might be similar to the previous case, i.e.: * \xe0 is converted to �; * \x80 is valid as a continuation byte but not in this specific case, so it's replaced by �; * \x81 is valid only as a continuation byte and invalid alone, so it's replaced by �; * \x61 is displayed correctly as 'a'; However for this case (and the other similar cases), the invalid bytes wouldn't be otherwise valid because they are still in range \x80-\xbf (continuation bytes), so the current behavior might be fine. This happens because the current algorithm just checks that the second byte (\x80) is in range \x80-\xbf (i.e. it's a continuation byte) and if it is it assumes that the invalid byte is the third (\x81) and replaces the first two bytes (\xe0\x80) with a single �. That said, the algorithm could be improved to check what is the wrong byte with better accuracy (and that could also be used to give a better error message about decoding surrogates). This shouldn't affect the speed of regular decoding, because the extra check will happen only in case of error. Also note the Unicode standard doesn't seem to mention this case, and that anyway this doesn't "eat" any of the following characters as it was doing before the patch -- the only difference would be in the number of �.
msg109159 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-03 05:42
Backported to 2.6 and 3.1 in r82470 and r82469. I'll leave this open for a while to see if anyone has any comment on my previous message.
msg109170 - (view)	Author: John Machin (sjmachin)	Date: 2010-07-03 09:36
About the E0 80 81 61 problem: my interpretation is that you are correct, the 80 is not valid in the current state (start byte == E0), so no look-ahead, three FFFDs must be issued followed by 0061. I don't really care about issuing too many FFFDs so long as it doesn't munch valid sequences. However it would be very nice to get an explicit message about surrogates.
msg129495 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-02-26 03:31
After a mail I sent to the Unicode Consortium about the corner case I found, they updated the "Best Practices for Using U+FFFD"[0] and now it says: """ Another example illustrates the application of the concept of maximal subpart for UTF-8 continuation bytes outside the allowable ranges defined in Table 3-7. The UTF-8 sequence <41 E0 9F 80 41> is ill-formed, because <9F> is not an allowed second byte of a UTF-8 sequence commencing with <E0>. In this case, there is an unconvertible offset at <E0> and the maximal subpart at that offset is also <E0>. The subsequence <E0 9F> cannot be a maximal subpart, because it is not an initial subsequence of any well-formed UTF-8 code unit sequence. """ The result of decoding that string with Python is: >>> b'\x41\xE0\x9F\x80\x41'.decode('utf-8', 'replace') 'A��A' i.e. the bytes <E0 9F> are wrongly considered as a maximal subpart and replaced with a single '�' (the second � is the \x80). I'll work on a patch and see how it comes out. [0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 96
msg129647 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-02-27 18:58
The patch turned out to be less trivial than I initially thought. The current algorithm checks for invalid continuation bytes in 4 places: 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid); 2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte; 3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes; 3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes; The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF). This means that the above 4 cases should be changed in this way: 1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF; 2) the "case 2" is fine as is, because the second byte is always in range 80..BF; 3) the "case 3" should check (pseudocode): if (second_byte_is_not_valid) max_subpart_len = 1 else if (third_byte not in 80..BF) max_subpart_len = 2 else # the sequence is valid the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED. 4) the "case 4" should check (pseudocode): if (second_byte_is_not_valid) max_subpart_len = 1 else if (third_byte not in 80..BF) max_subpart_len = 2 else if (fourth_byte not in 80..BF) max_subpart_len = 3 else # the sequence is valid here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4. In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively. The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function. Is this OK?
msg129685 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-02-28 09:18
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > The patch turned out to be less trivial than I initially thought. > > The current algorithm checks for invalid continuation bytes in 4 places: > 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid); > 2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte; > 3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes; > 3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes; > > The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF). > This means that the above 4 cases should be changed in this way: > 1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF; > 2) the "case 2" is fine as is, because the second byte is always in range 80..BF; > 3) the "case 3" should check (pseudocode): > if (second_byte_is_not_valid) max_subpart_len = 1 > else if (third_byte not in 80..BF) max_subpart_len = 2 > else # the sequence is valid > the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED. > 4) the "case 4" should check (pseudocode): > if (second_byte_is_not_valid) max_subpart_len = 1 > else if (third_byte not in 80..BF) max_subpart_len = 2 > else if (fourth_byte not in 80..BF) max_subpart_len = 3 > else # the sequence is valid > here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4. > > In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively. > The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function. > Is this OK? Sure. It would be great if you could time the difference in performance. In tight loops like the codec ones, small changes in the way you write things can often make a big difference. Please also include a Misc/NEWS entry to point to the change and possibly consequences for existing code relying on the previous behavior. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg134046 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-04-19 12:30
Attached patch against 3.1 fixes the number of FFFD. A test for the range in the error message should probably be added. I haven't done any benchmark yet. There's some code duplication, but I'm not sure it can be factored out.
msg142132 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-15 17:15
Here are some benchmarks: Commands: # half of the bytes are invalid ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "surrogateescape")' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "replace")' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "ignore")' With patch: 1000 loops, best of 3: 854 usec per loop 1000 loops, best of 3: 509 usec per loop 1000 loops, best of 3: 415 usec per loop Without patch: 1000 loops, best of 3: 670 usec per loop 1000 loops, best of 3: 470 usec per loop 1000 loops, best of 3: 382 usec per loop Commands (from the interactive interpreter): # all valid codepoints import timeit b = "".join(chr(c) for c in range(0x110000) if c not in range(0xD800, 0xE000)).encode("utf-8") b_dec = b.decode timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import b_dec').timeit(100)/100 With patch: 0.03830226898193359 0.03849360942840576 0.03835036039352417 0.03821949005126953 Without patch: 0.03750091791152954 0.037977190017700196 0.04067679166793823 0.038579678535461424 Commands: # near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string b2 = bytes(c for k,c in enumerate(b) if k%5) b2_dec = b2.decode timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import b2_dec').timeit(10)/10 timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import b2_dec').timeit(10)/10 timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import b2_dec').timeit(10)/10 With patch: 9.645482301712036 6.602735090255737 5.338080596923828 Without patch: 8.124328684806823 5.804249691963196 4.851014900207519 All tests done on wide 3.2. Since the changes are about errors, decoding of valid utf-8 strings is not affected. Decoding with non-strict error handlers and invalid strings are slower, but I don't think the difference is significant. If the patch is fine I will commit it.
msg160980 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-17 16:35
Looks like issue14738 fixes this bug for Python 3.3. >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace'))) '\ufffdAB' >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace'))) '\ufffdABCD'
msg160981 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-05-17 16:55
The original bug should be fixed already in 3.3 and there should be tests (unless they got removed/skipped after we changed unicode implementation). The only issue left was about the number of U+FFFD generated with invalid sequences in some cases. My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass. FWIW this should be already fixed on PyPy.
msg160989 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-17 17:31
> The only issue left was about the number of U+FFFD generated with invalid sequences in some cases. > My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass. Tests fails, but I'm not sure that the tests are correct. b'\xe0\x00' raises 'unexpected end of data' and not 'invalid continuation byte'. This is terminological issue. b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I don't think that is right.
msg160990 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-05-17 17:36
> Tests fails, but I'm not sure that the tests are correct. > b'\xe0\x00' raises 'unexpected end of data' and not 'invalid > continuation byte'. This is terminological issue. This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too. > b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not > two. I don't think that is right. Why not?
msg160991 - (view)	Author: Saul Spatz (spatz123)	Date: 2012-05-17 17:36
>b'\xe0\x80'.decode('utf-8', 'replace') returns >one U+FFFD and not two. I >don't think that is right. I think that one U+FFFD is correct. The on;y error is a premature end of data. On Thu, May 17, 2012 at 12:31 PM, Serhiy Storchaka <report@bugs.python.org>wrote: > > Serhiy Storchaka <storchaka@gmail.com> added the comment: > > > The only issue left was about the number of U+FFFD generated with > invalid sequences in some cases. > > My last patch has extensive tests for this, so you could try to apply it > (or copy the tests) and see if they all pass. > > Tests fails, but I'm not sure that the tests are correct. > > b'\xe0\x00' raises 'unexpected end of data' and not 'invalid > continuation byte'. This is terminological issue. > > b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I > don't think that is right. > > ---------- > title: str.decode('utf8', 'replace') -- conformance with Unicode > 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue8271> > _______________________________________ >
msg160993 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-05-17 18:12
Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though). If you make any changes on the tests please let me know.
msg160998 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-17 18:33
> I think that one U+FFFD is correct. The on;y error is a premature end of > data. I poorly expressed. I also think that there is only one decoding error, and not two. I think the test is wrong.
msg161000 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-17 18:46
> This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too. Yes, this implementation detail. It is much easier and faster. Whether it is necessary to change it? > Why not? May be I'm wrong. I looked in "The Unicode Standard, Version 6.0" (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf), pp. 95-97, the standard does not categorical in this, but recommends that only maximal subpart should be replaced by U+FFFD. \xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD. In this case, the previous and the current implementation does not conform to the standard.
msg161001 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-17 18:52
> Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though). I probably poorly said. Past and current implementations raise 'unexpected end of data' and not 'invalid continuation byte'. Test expects 'invalid continuation byte'.
msg161002 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-05-17 18:55
> \xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD. OK, now I get what you mean. The valid range for continuation bytes that can follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation byte here. While working on the patch I stumbled across this corner case and contacted the Unicode consortium to ask about it, as explained in msg129495. I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard).
msg161004 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-05-17 18:59
> I probably poorly said. Past and current implementations raise > 'unexpected end of data' and not 'invalid continuation byte'. Test > expects 'invalid continuation byte'. I don't think it matters much either way.
msg161005 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-17 19:06
> I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard). No, test correctly expects two U+FFFD. Current implementation is wrong. As I understand now, what's the error, I'll try to correct Python 3.3 implementation.
msg161622 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-25 20:21
Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it is a little slow, but I tried to minimize the losses.
msg161627 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-05-25 22:38
Do you have any benchmark results?
msg161650 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-26 08:50
Here are the benchmark results (numbers are speed, MB/s). On 32-bit Linux, AMD Athlon 64 X2: vanilla patched utf-8 'A'10000 2016 (+5%) 2111 utf-8 '\x80'10000 383 (+9%) 416 utf-8 '\x80'+'A'9999 1283 (+1%) 1301 utf-8 '\u0100'10000 383 (-8%) 354 utf-8 '\u0100'+'A'9999 1258 (-6%) 1184 utf-8 '\u0100'+'\x80'9999 383 (-8%) 354 utf-8 '\u8000'10000 434 (-11%) 388 utf-8 '\u8000'+'A'9999 1262 (-6%) 1180 utf-8 '\u8000'+'\x80'9999 383 (-8%) 354 utf-8 '\u8000'+'\u0100'9999 383 (-8%) 354 utf-8 '\U00010000'10000 358 (+1%) 361 utf-8 '\U00010000'+'A'9999 1168 (-5%) 1104 utf-8 '\U00010000'+'\x80'9999 382 (-20%) 307 utf-8 '\U00010000'+'\u0100'9999 382 (-20%) 307 utf-8 '\U00010000'+'\u8000'9999 404 (-10%) 365 On 32-bit Linux, Intel Atom N570: vanilla patched ascii 'A'10000 789 (+1%) 800 latin1 'A'10000 796 (-2%) 781 latin1 'A'9999+'\x80' 779 (+1%) 789 latin1 '\x80'10000 1739 (-3%) 1690 latin1 '\x80'+'A'9999 1747 (+1%) 1773 utf-8 'A'10000 623 (+1%) 631 utf-8 '\x80'10000 145 (+14%) 165 utf-8 '\x80'+'A'9999 354 (+1%) 358 utf-8 '\u0100'10000 164 (-5%) 156 utf-8 '\u0100'+'A'9999 343 (+2%) 350 utf-8 '\u0100'+'\x80'9999 164 (-4%) 157 utf-8 '\u8000'10000 175 (-5%) 166 utf-8 '\u8000'+'A'9999 349 (+2%) 356 utf-8 '\u8000'+'\x80'9999 164 (-4%) 157 utf-8 '\u8000'+'\u0100'9999 164 (-4%) 157 utf-8 '\U00010000'10000 152 (+7%) 163 utf-8 '\U00010000'+'A'9999 313 (+6%) 332 utf-8 '\U00010000'+'\x80'9999 161 (-13%) 140 utf-8 '\U00010000'+'\u0100'9999 161 (-14%) 139 utf-8 '\U00010000'+'\u8000'*9999 160 (-1%) 159
msg161655 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-26 09:27
Fortunately, issue14923 (if accepted) will compensate for the slowdown. On 32-bit Linux, AMD Athlon 64 X2: vanilla old patch fast patch utf-8 'A'10000 2016 (+3%) 2111 (-2%) 2072 utf-8 '\x80'10000 383 (+19%) 416 (+9%) 454 utf-8 '\x80'+'A'9999 1283 (-7%) 1301 (-9%) 1190 utf-8 '\u0100'10000 383 (+46%) 354 (+58%) 560 utf-8 '\u0100'+'A'9999 1258 (-1%) 1184 (+5%) 1244 utf-8 '\u0100'+'\x80'9999 383 (+46%) 354 (+58%) 558 utf-8 '\u8000'10000 434 (+6%) 388 (+19%) 461 utf-8 '\u8000'+'A'9999 1262 (-1%) 1180 (+5%) 1244 utf-8 '\u8000'+'\x80'9999 383 (+46%) 354 (+58%) 559 utf-8 '\u8000'+'\u0100'9999 383 (+45%) 354 (+57%) 555 utf-8 '\U00010000'10000 358 (+5%) 361 (+4%) 375 utf-8 '\U00010000'+'A'9999 1168 (-1%) 1104 (+5%) 1159 utf-8 '\U00010000'+'\x80'9999 382 (+43%) 307 (+78%) 546 utf-8 '\U00010000'+'\u0100'9999 382 (+43%) 307 (+79%) 548 utf-8 '\U00010000'+'\u8000'9999 404 (+13%) 365 (+25%) 458 On 32-bit Linux, Intel Atom N570: vanilla old patch fast patch utf-8 'A'10000 623 (+1%) 631 (+0%) 631 utf-8 '\x80'10000 145 (+26%) 165 (+11%) 183 utf-8 '\x80'+'A'9999 354 (-0%) 358 (-1%) 353 utf-8 '\u0100'10000 164 (+10%) 156 (+16%) 181 utf-8 '\u0100'+'A'9999 343 (+1%) 350 (-1%) 348 utf-8 '\u0100'+'\x80'9999 164 (+10%) 157 (+15%) 181 utf-8 '\u8000'10000 175 (-1%) 166 (+5%) 174 utf-8 '\u8000'+'A'9999 349 (+0%) 356 (-2%) 349 utf-8 '\u8000'+'\x80'9999 164 (+10%) 157 (+15%) 180 utf-8 '\u8000'+'\u0100'9999 164 (+10%) 157 (+15%) 181 utf-8 '\U00010000'10000 152 (+7%) 163 (+0%) 163 utf-8 '\U00010000'+'A'9999 313 (+4%) 332 (-2%) 327 utf-8 '\U00010000'+'\x80'9999 161 (+11%) 140 (+28%) 179 utf-8 '\U00010000'+'\u0100'9999 161 (+11%) 139 (+28%) 178 utf-8 '\U00010000'+'\u8000'9999 160 (+9%) 159 (+9%) 174
msg163587 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-23 11:38
Why is this marked "fixed"? Is it fixed or not?
msg163588 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-23 11:42
I deleted a fast patch, since it unsafe. Issue14923 should safer compensate a small slowdown. I think this change is not a bugfix (this is not a bug, the standard allows such behavior), but a new feature, so I doubt the need to fix 2.7 and 3.2. Any chance to commit the patch today and to get this feature in Python 3.3?
msg163591 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-23 11:55
No, it is not fully fixed. Only one bug was fixed, but the current behavior is still not conformed with the Unicode Standard recommendations. Non-conforming with recommendations is not a bug, conforming is a feature.
msg163674 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-23 21:03
Here is updated, a little faster, patch. It merged with decode_utf8_range_check.patch from issue14923. Patch contains non-modified Ezio Melotti's tests which all successfully passed.
msg163677 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-23 21:21
Here is updated patch with resolved merge conflict with 3214c9ebcf5e.
msg174828 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-04 20:18
What about commit? All Ezio's tests passsed, microbenchmark shows less than 10% differences: vanilla patched MB/s MB/s 2076 (-3%) 2007 decode utf-8 'A'10000 414 (-0%) 413 decode utf-8 '\x80'10000 1283 (-1%) 1275 decode utf-8 '\x80'+'A'9999 556 (-8%) 514 decode utf-8 '\u0100'10000 1227 (-4%) 1172 decode utf-8 '\u0100'+'A'9999 556 (-8%) 514 decode utf-8 '\u0100'+'\x80'9999 406 (+10%) 447 decode utf-8 '\u8000'10000 1225 (-5%) 1167 decode utf-8 '\u8000'+'A'9999 554 (-7%) 513 decode utf-8 '\u8000'+'\x80'9999 552 (-8%) 508 decode utf-8 '\u8000'+'\u0100'9999 358 (-4%) 345 decode utf-8 '\U00010000'10000 1173 (-5%) 1118 decode utf-8 '\U00010000'+'A'9999 492 (+1%) 495 decode utf-8 '\U00010000'+'\x80'9999 492 (+1%) 496 decode utf-8 '\U00010000'+'\u0100'9999 383 (+5%) 401 decode utf-8 '\U00010000'+'\u8000'*9999
msg174831 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-04 21:23
New changeset 5962f192a483 by Ezio Melotti in branch '3.3': #8271: the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti. http://hg.python.org/cpython/rev/5962f192a483 New changeset 5b205fff1972 by Ezio Melotti in branch 'default': #8271: merge with 3.3. http://hg.python.org/cpython/rev/5b205fff1972
msg174832 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-11-04 21:37
Fixed, thanks for updating the patch! I committed it on 3.3 too, and while this could have gone on 2.7/3.2 too IMHO, it's to much work to port it there and not worth it.
msg174834 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-04 21:50
Agree. In 2.7 UTF-8 codec still broken in corner cases (it accepts surrogates) and 3.2 is coming to an end of maintaining. In any case it is only recomendation, not demands.
msg174839 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-04 23:00
New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3': Issue #8271: Fix compilation on Windows http://hg.python.org/cpython/rev/96f4cee8ea5e New changeset 6f44f33460cd by Victor Stinner in branch 'default': (Merge 3.3) Issue #8271: Fix compilation on Windows http://hg.python.org/cpython/rev/6f44f33460cd

History
Date	User	Action	Args
2022-04-11 14:56:59	admin	set	github: 52518
2014-03-31 23:38:33	jmehnle	set	nosy: + jmehnle
2012-11-04 23:00:39	python-dev	set	messages: + msg174839
2012-11-04 21:50:05	serhiy.storchaka	set	messages: + msg174834
2012-11-04 21:37:03	ezio.melotti	set	status: open -> closed messages: + msg174832 versions: + Python 3.3
2012-11-04 21:23:36	python-dev	set	nosy: + python-dev messages: + msg174831
2012-11-04 20:18:23	serhiy.storchaka	set	messages: + msg174828
2012-11-04 20:07:41	serhiy.storchaka	set	versions: + Python 3.4, - Python 3.1, Python 2.7, Python 3.2, Python 3.3
2012-11-04 20:06:44	serhiy.storchaka	set	files: - issue8271-3.3-fast-2.patch
2012-11-04 20:06:26	serhiy.storchaka	set	files: - issue8271-3.3.patch
2012-06-23 21:21:02	serhiy.storchaka	set	files: + issue8271-3.3-fast-3.patch messages: + msg163677
2012-06-23 21:03:47	serhiy.storchaka	set	files: + issue8271-3.3-fast-2.patch messages: + msg163674
2012-06-23 11:55:49	serhiy.storchaka	set	messages: + msg163591
2012-06-23 11:42:25	serhiy.storchaka	set	messages: + msg163588
2012-06-23 11:38:42	pitrou	set	messages: + msg163587
2012-06-23 11:35:52	serhiy.storchaka	set	files: - issue8271-3.3-fast.patch
2012-05-26 09:28:00	serhiy.storchaka	set	files: + issue8271-3.3-fast.patch messages: + msg161655
2012-05-26 08:50:35	serhiy.storchaka	set	messages: + msg161650
2012-05-25 22:38:41	ezio.melotti	set	messages: + msg161627
2012-05-25 20:21:51	serhiy.storchaka	set	files: + issue8271-3.3.patch messages: + msg161622
2012-05-17 19:06:38	serhiy.storchaka	set	messages: + msg161005
2012-05-17 18:59:28	ezio.melotti	set	messages: + msg161004
2012-05-17 18:55:05	ezio.melotti	set	messages: + msg161002
2012-05-17 18:52:51	serhiy.storchaka	set	messages: + msg161001
2012-05-17 18:46:04	serhiy.storchaka	set	messages: + msg161000
2012-05-17 18:33:46	serhiy.storchaka	set	messages: + msg160998
2012-05-17 18:12:55	ezio.melotti	set	messages: + msg160993
2012-05-17 17:36:22	spatz123	set	messages: + msg160991
2012-05-17 17:36:08	ezio.melotti	set	messages: + msg160990
2012-05-17 17:31:03	serhiy.storchaka	set	messages: + msg160989 title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2012-05-17 16:55:39	ezio.melotti	set	messages: + msg160981
2012-05-17 16:35:22	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg160980
2011-09-21 09:44:44	Ringding	set	nosy: + Ringding
2011-08-15 17:15:30	ezio.melotti	set	messages: + msg142132
2011-07-07 10:08:07	spatz123	set	nosy: + spatz123
2011-04-19 12:30:42	ezio.melotti	set	files: + issue8271v6.diff messages: + msg134046 versions: + Python 3.3, - Python 2.6
2011-02-28 09:18:07	lemburg	set	nosy: lemburg, sjmachin, belopolsky, pitrou, vstinner, ezio.melotti, dangra messages: + msg129685 title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2011-02-27 18:58:56	ezio.melotti	set	nosy: lemburg, sjmachin, belopolsky, pitrou, vstinner, ezio.melotti, dangra messages: + msg129647
2011-02-26 03:31:22	ezio.melotti	set	nosy: lemburg, sjmachin, belopolsky, pitrou, vstinner, ezio.melotti, dangra messages: + msg129495
2010-12-29 23:31:38	belopolsky	set	nosy: + belopolsky
2010-07-03 09:36:43	sjmachin	set	messages: + msg109170
2010-07-03 05:42:06	ezio.melotti	set	resolution: fixed messages: + msg109159 stage: patch review -> resolved
2010-07-03 00:49:12	ezio.melotti	set	messages: + msg109155
2010-07-01 19:19:12	ezio.melotti	set	messages: + msg109070
2010-06-30 20:21:39	ezio.melotti	set	messages: + msg109015
2010-06-05 20:35:39	ezio.melotti	set	messages: + msg107163
2010-06-04 16:22:48	ezio.melotti	set	files: + issue8271v5.diff messages: + msg107074
2010-04-07 13:09:51	ezio.melotti	set	nosy: + pitrou
2010-04-07 09:02:03	vstinner	set	messages: + msg102523 title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2010-04-07 08:37:35	lemburg	set	messages: + msg102522
2010-04-07 04:08:13	ezio.melotti	set	keywords: + needs review files: + issue8271v4.diff messages: + msg102516
2010-04-04 05:49:17	ezio.melotti	set	files: + issue8271v3.diff messages: + msg102320
2010-04-03 14:43:21	vstinner	set	nosy: + vstinner messages: + msg102265
2010-04-03 11:41:34	lemburg	set	messages: + msg102239
2010-04-02 22:27:17	ezio.melotti	set	files: + issue8271v2.diff stage: test needed -> patch review messages: + msg102209 versions: + Python 2.6
2010-04-01 15:23:22	sjmachin	set	messages: + msg102101
2010-04-01 15:07:07	sjmachin	set	messages: + msg102099
2010-04-01 15:01:37	lemburg	set	messages: + msg102098
2010-04-01 14:50:11	ezio.melotti	set	messages: + msg102095
2010-04-01 14:43:21	sjmachin	set	messages: + msg102094
2010-04-01 14:12:59	lemburg	set	messages: + msg102093
2010-04-01 13:47:37	sjmachin	set	messages: + msg102090
2010-04-01 13:19:04	lemburg	set	messages: + msg102089
2010-04-01 11:57:00	sjmachin	set	messages: + msg102085
2010-04-01 08:46:32	lemburg	set	messages: + msg102077
2010-04-01 08:33:45	ezio.melotti	set	keywords: + patch files: + issue8271.diff messages: + msg102076
2010-04-01 07:44:48	lemburg	set	messages: + msg102068 title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
2010-04-01 07:33:49	ezio.melotti	set	messages: + msg102066
2010-04-01 07:29:45	sjmachin	set	messages: + msg102065
2010-04-01 06:14:54	ezio.melotti	set	messages: + msg102064
2010-04-01 06:08:59	sjmachin	set	messages: + msg102063
2010-04-01 04:43:51	ezio.melotti	set	assignee: ezio.melotti
2010-04-01 03:28:27	ezio.melotti	set	messages: + msg102062
2010-04-01 03:19:32	sjmachin	set	messages: + msg102061
2010-03-31 18:07:43	lemburg	set	messages: + msg102024
2010-03-31 15:22:35	r.david.murray	set	nosy: + lemburg
2010-03-31 14:59:29	dangra	set	messages: + msg102013
2010-03-31 14:56:43	dangra	set	nosy: + dangra
2010-03-31 06:41:59	ezio.melotti	set	versions: + Python 3.2 nosy: + ezio.melotti priority: normal components: + Unicode stage: test needed
2010-03-31 02:28:10	sjmachin	create