New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8, backslashreplace and surrogates #52339
Comments
utf8 encoder doesn't work in backslashreplace error handler: >>> "\uDC80".encode("utf8", "backslashreplace")
TypeError: error handler should have returned bytes |
See also issue bpo-6697. |
After the patch the comment: /* Implementation limitations: only support error handler that return no longer applies. Also I would like to see a version of this patch where the length limitation for the replacement returned from the error handler is removed (ideally for both the str and bytes case). |
New version without the hardcoded limit: don't use goto encodeUCS4;, chain if to limit indentation depth: it only costs one copy of the UCS4 (5 lines are duplicated). The buffer is now reallocated each time a surrogate escape is longer than 4 bytes. I don't know if "nallocated += repsize - 4;" can overflow or not. If yes, how can I detect the overflow? I added: /* FIXME: check integer overflow? */ |
Sure, if they are both Py_ssize_t, just use: if (nallocated > PY_SSIZE_T_MAX - repsize + 4) {
/* handle overflow ... */
} |
Oh no :-( I realized that I removed the first message of this issue! msg100687. Copy/paste of the message: Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect. I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised. Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits. |
Oops, I forgot the remove the reallocation in the unicode case in the patch version 2. Patch version 3:
I think that PyUnicode_EncodeUTF8() is more readable after my patch: there maximum if depth is 2 instead of 3, and I removed the goto. It shouldn't change anything about performances for chacters < 0x800 (ASCII and Latin-1), and I expect similar performances for characters >= 0x800. |
Fixed: r80382 (py3k), r80383 (3.1). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: