utf8, backslashreplace and surrogates #52339

vstinner · 2010-03-08T23:21:06Z

BPO	8092
Nosy	@malemburg, @loewis, @doerwalter, @pitrou, @vstinner
Files	utf8_surrogate_error-3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-04-22.19:41:20.943>
created_at = <Date 2010-03-08.23:21:05.914>
labels = ['type-bug', 'expert-unicode']
title = 'utf8, backslashreplace and surrogates'
updated_at = <Date 2010-04-22.19:41:20.942>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-04-22.19:41:20.942>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-04-22.19:41:20.943>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2010-03-08.23:21:05.914>
creator = 'vstinner'
dependencies = []
files = ['17013']
hgrepos = []
issue_num = 8092
keywords = ['patch']
message_count = 8.0
messages = ['100678', '100681', '100716', '103741', '103766', '103777', '103779', '103977']
nosy_count = 5.0
nosy_names = ['lemburg', 'loewis', 'doerwalter', 'pitrou', 'vstinner']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue8092'
versions = ['Python 3.1', 'Python 3.2']

vstinner · 2010-03-08T23:21:06Z

utf8 encoder doesn't work in backslashreplace error handler:

>>> "\uDC80".encode("utf8", "backslashreplace")
TypeError: error handler should have returned bytes

vstinner · 2010-03-08T23:23:25Z

See also issue bpo-6697.

doerwalter · 2010-03-09T11:55:53Z

After the patch the comment:

/* Implementation limitations: only support error handler that return
bytes, and only support up to four replacement bytes. */

no longer applies.

Also I would like to see a version of this patch where the length limitation for the replacement returned from the error handler is removed (ideally for both the str and bytes case).

vstinner · 2010-04-20T19:38:46Z

New version without the hardcoded limit: don't use goto encodeUCS4;, chain if to limit indentation depth: it only costs one copy of the UCS4 (5 lines are duplicated).

The buffer is now reallocated each time a surrogate escape is longer than 4 bytes.

I don't know if "nallocated += repsize - 4;" can overflow or not. If yes, how can I detect the overflow? I added: /* FIXME: check integer overflow? */

pitrou · 2010-04-20T21:09:53Z

I don't know if "nallocated += repsize - 4;" can overflow or not.
If yes, how can I detect the overflow?

Sure, if they are both Py_ssize_t, just use:

if (nallocated > PY_SSIZE_T_MAX - repsize + 4) {
  /* handle overflow ... */
}

vstinner · 2010-04-20T21:50:24Z

Oh no :-( I realized that I removed the first message of this issue! msg100687. Copy/paste of the message:
---
This issue is a regression introduced by r72208 to fix the issue bpo-3672.

Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect.

I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised.

Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits.
---

vstinner · 2010-04-20T21:59:51Z

Oops, I forgot the remove the reallocation in the unicode case in the patch version 2.

Patch version 3:

micro-optimization: group both surrogates cases in the same if to avoid checking 0xD800 <= ch twice
check for integer overflow
(remove the duplication reallocation introduced by version 2)

I think that PyUnicode_EncodeUTF8() is more readable after my patch: there maximum if depth is 2 instead of 3, and I removed the goto.

It shouldn't change anything about performances for chacters < 0x800 (ASCII and Latin-1), and I expect similar performances for characters >= 0x800.

vstinner · 2010-04-22T19:41:21Z

Fixed: r80382 (py3k), r80383 (3.1).

vstinner added the topic-unicode label Mar 8, 2010

pitrou added the type-bug An unexpected behavior, bug, or error label Mar 9, 2010

vstinner closed this as completed Apr 22, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8, backslashreplace and surrogates #52339

utf8, backslashreplace and surrogates #52339

vstinner commented Mar 8, 2010

vstinner commented Mar 8, 2010

vstinner commented Mar 8, 2010

doerwalter commented Mar 9, 2010

vstinner commented Apr 20, 2010

pitrou commented Apr 20, 2010

vstinner commented Apr 20, 2010

vstinner commented Apr 20, 2010

vstinner commented Apr 22, 2010

utf8, backslashreplace and surrogates #52339

utf8, backslashreplace and surrogates #52339

Comments

vstinner commented Mar 8, 2010

vstinner commented Mar 8, 2010

vstinner commented Mar 8, 2010

doerwalter commented Mar 9, 2010

vstinner commented Apr 20, 2010

pitrou commented Apr 20, 2010

vstinner commented Apr 20, 2010

vstinner commented Apr 20, 2010

vstinner commented Apr 22, 2010