classification
Title: utf8, backslashreplace and surrogates
Type: behavior Stage: patch review
Components: Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: doerwalter, lemburg, loewis, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2010-03-08 23:21 by vstinner, last changed 2010-04-22 19:41 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
utf8_surrogate_error-3.patch vstinner, 2010-04-20 21:59
Messages (8)
msg100678 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-08 23:21
utf8 encoder doesn't work in backslashreplace error handler:

>>> "\uDC80".encode("utf8", "backslashreplace")
TypeError: error handler should have returned bytes
msg100681 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-08 23:23
See also issue #6697.
msg100716 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2010-03-09 11:55
After the patch the comment:

/* Implementation limitations: only support error handler that return
    bytes, and only support up to four replacement bytes. */

no longer applies.

Also I would like to see a version of this patch where the length limitation for the replacement returned from the error handler is removed (ideally for both the str and bytes case).
msg103741 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 19:38
New version without the hardcoded limit: don't use goto encodeUCS4;, chain if to limit indentation depth: it only costs one copy of the UCS4 (5 lines are duplicated).

The buffer is now reallocated each time a surrogate escape is longer than 4 bytes.

I don't know if "nallocated += repsize - 4;" can overflow or not. If yes, how can I detect the overflow? I added: /* FIXME: check integer overflow? */
msg103766 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-04-20 21:09
> I don't know if "nallocated += repsize - 4;" can overflow or not.
> If yes, how can I detect the overflow?

Sure, if they are both Py_ssize_t, just use:

if (nallocated > PY_SSIZE_T_MAX - repsize + 4) {
  /* handle overflow ... */
}
msg103777 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 21:50
Oh no :-( I realized that I removed the first message of this issue! msg100687. Copy/paste of the message:
---
This issue is a regression introduced by r72208 to fix the issue #3672.

Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect.

I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised.

Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits.
---
msg103779 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 21:59
Oops, I forgot the remove the reallocation in the unicode case in the patch version 2.

Patch version 3:
 - micro-optimization: group both surrogates cases in the same if to avoid checking 0xD800 <= ch twice
 - check for integer overflow
 - (remove the duplication reallocation introduced by version 2)

I think that PyUnicode_EncodeUTF8() is more readable after my patch: there maximum if depth is 2 instead of 3, and I removed the goto.

It shouldn't change anything about performances for chacters < 0x800 (ASCII and Latin-1), and I expect similar performances for characters >= 0x800.
msg103977 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-22 19:41
Fixed: r80382 (py3k), r80383 (3.1).
History
Date User Action Args
2010-04-22 19:41:20vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg103977
2010-04-22 10:26:00vstinnersetfiles: - utf8_surrogate_error-2.patch
2010-04-20 21:59:51vstinnersetfiles: + utf8_surrogate_error-3.patch

messages: + msg103779
2010-04-20 21:50:24vstinnersetmessages: + msg103777
2010-04-20 21:09:53pitrousetnosy: + pitrou
messages: + msg103766
2010-04-20 21:08:37vstinnersetfiles: - utf8_surrogate_error.patch
2010-04-20 19:38:47vstinnersetfiles: + utf8_surrogate_error-2.patch

messages: + msg103741
2010-04-20 19:36:44vstinnersetmessages: - msg100687
2010-04-20 11:16:16vstinnerlinkissue8242 dependencies
2010-03-09 11:55:53doerwaltersetnosy: + doerwalter
messages: + msg100716
2010-03-09 01:37:20pitrousetversions: + Python 3.2
nosy: + lemburg, loewis

priority: normal
type: behavior
stage: patch review
2010-03-09 01:11:58vstinnersetfiles: + utf8_surrogate_error.patch
keywords: + patch
messages: + msg100687
2010-03-08 23:23:25vstinnersetmessages: + msg100681
2010-03-08 23:21:05vstinnercreate