Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8, backslashreplace and surrogates #52339

Closed
vstinner opened this issue Mar 8, 2010 · 8 comments
Closed

utf8, backslashreplace and surrogates #52339

vstinner opened this issue Mar 8, 2010 · 8 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

vstinner commented Mar 8, 2010

BPO 8092
Nosy @malemburg, @loewis, @doerwalter, @pitrou, @vstinner
Files
  • utf8_surrogate_error-3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-04-22.19:41:20.943>
    created_at = <Date 2010-03-08.23:21:05.914>
    labels = ['type-bug', 'expert-unicode']
    title = 'utf8, backslashreplace and surrogates'
    updated_at = <Date 2010-04-22.19:41:20.942>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-04-22.19:41:20.942>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-04-22.19:41:20.943>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2010-03-08.23:21:05.914>
    creator = 'vstinner'
    dependencies = []
    files = ['17013']
    hgrepos = []
    issue_num = 8092
    keywords = ['patch']
    message_count = 8.0
    messages = ['100678', '100681', '100716', '103741', '103766', '103777', '103779', '103977']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'loewis', 'doerwalter', 'pitrou', 'vstinner']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue8092'
    versions = ['Python 3.1', 'Python 3.2']

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2010

    utf8 encoder doesn't work in backslashreplace error handler:

    >>> "\uDC80".encode("utf8", "backslashreplace")
    TypeError: error handler should have returned bytes

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2010

    See also issue bpo-6697.

    @pitrou pitrou added the type-bug An unexpected behavior, bug, or error label Mar 9, 2010
    @doerwalter
    Copy link
    Contributor

    After the patch the comment:

    /* Implementation limitations: only support error handler that return
    bytes, and only support up to four replacement bytes. */

    no longer applies.

    Also I would like to see a version of this patch where the length limitation for the replacement returned from the error handler is removed (ideally for both the str and bytes case).

    @vstinner
    Copy link
    Member Author

    New version without the hardcoded limit: don't use goto encodeUCS4;, chain if to limit indentation depth: it only costs one copy of the UCS4 (5 lines are duplicated).

    The buffer is now reallocated each time a surrogate escape is longer than 4 bytes.

    I don't know if "nallocated += repsize - 4;" can overflow or not. If yes, how can I detect the overflow? I added: /* FIXME: check integer overflow? */

    @pitrou
    Copy link
    Member

    pitrou commented Apr 20, 2010

    I don't know if "nallocated += repsize - 4;" can overflow or not.
    If yes, how can I detect the overflow?

    Sure, if they are both Py_ssize_t, just use:

    if (nallocated > PY_SSIZE_T_MAX - repsize + 4) {
      /* handle overflow ... */
    }

    @vstinner
    Copy link
    Member Author

    Oh no :-( I realized that I removed the first message of this issue! msg100687. Copy/paste of the message:
    ---
    This issue is a regression introduced by r72208 to fix the issue bpo-3672.

    Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect.

    I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised.

    Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits.
    ---

    @vstinner
    Copy link
    Member Author

    Oops, I forgot the remove the reallocation in the unicode case in the patch version 2.

    Patch version 3:

    • micro-optimization: group both surrogates cases in the same if to avoid checking 0xD800 <= ch twice
    • check for integer overflow
    • (remove the duplication reallocation introduced by version 2)

    I think that PyUnicode_EncodeUTF8() is more readable after my patch: there maximum if depth is 2 instead of 3, and I removed the goto.

    It shouldn't change anything about performances for chacters < 0x800 (ASCII and Latin-1), and I expect similar performances for characters >= 0x800.

    @vstinner
    Copy link
    Member Author

    Fixed: r80382 (py3k), r80383 (3.1).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants