Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace PyUnicode_CompareWithASCIIString with _PyUnicode_EqualToASCIIString #72887

Closed
serhiy-storchaka opened this issue Nov 15, 2016 · 16 comments
Assignees
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 28701
Nosy @vstinner, @ezio-melotti, @methane, @skrah, @serhiy-storchaka, @MojoVampire, @zhangyangyu
Files
  • _PyUnicode_EqualToASCIIString.patch
  • PyUnicode_CompareWithASCIIString.cocci: Coccinella patch for replacing the function
  • _PyUnicode_EqualToASCII-runtime-check.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-11-16.18:04:01.294>
    created_at = <Date 2016-11-15.19:50:05.474>
    labels = ['interpreter-core', 'type-bug', '3.7', 'expert-unicode']
    title = 'Replace PyUnicode_CompareWithASCIIString with _PyUnicode_EqualToASCIIString'
    updated_at = <Date 2017-01-09.12:16:31.715>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2017-01-09.12:16:31.715>
    actor = 'python-dev'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-11-16.18:04:01.294>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2016-11-15.19:50:05.474>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['45490', '45491', '45505']
    hgrepos = []
    issue_num = 28701
    keywords = ['patch']
    message_count = 16.0
    messages = ['280882', '280919', '280925', '280926', '280928', '280934', '280935', '280945', '280946', '280947', '280955', '280964', '280975', '280976', '284944', '285037']
    nosy_count = 8.0
    nosy_names = ['vstinner', 'ezio.melotti', 'methane', 'skrah', 'python-dev', 'serhiy.storchaka', 'josh.r', 'xiang.zhang']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue28701'
    versions = ['Python 3.5', 'Python 3.6', 'Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    Proposed patch replaces calls of public function PyUnicode_CompareWithASCIIString() with new private function _PyUnicode_EqualToASCIIString(). The problem with PyUnicode_CompareWithASCIIString() is that it returns -1 for the result "less than" and error, but the error case is never checked. The patch is purposed for following purposes:

    1. Readability. _PyUnicode_EqualToASCIIString(...) looks more readable than PyUnicode_CompareWithASCIIString(...) == 0 or !PyUnicode_CompareWithASCIIString(...), especially in large expression. I always have to make an effort to understand correctly the meaning of the latter expression.

    2. Efficiency. If the strings are not equal, _PyUnicode_EqualToASCIIString() can quickly return false, but PyUnicode_CompareWithASCIIString() needs to check whether what string is larger.

    3. Correctness. Since no caller checks the error of PyUnicode_CompareWithASCIIString(), it is incorrectly interpreted as "less then". Exception set by PyUnicode_CompareWithASCIIString() can be leaked in following code causing mystical error or crash in debug build. There are too many callers to add error checking for them all. These would be non-trivial error-prone changes that add new lines of the code, new variables and new returns or gotos. On other hand replacing PyUnicode_CompareWithASCIIString() with _PyUnicode_EqualToASCIIString() is done by simple script.

    _PyUnicode_EqualToASCIIString() returns true value (1) if strings are equal, false value (0) if they are different, and doesn't raise exceptions. Unlike to PyUnicode_CompareWithASCIIString() it works only with ASCII characters and returns false if any string contains non-ASCII characters.

    The patch also documents the return value of PyUnicode_CompareWithASCIIString() in case of error.

    See bpo-21449 for similar issue with _PyUnicode_CompareWithId().

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error labels Nov 15, 2016
    @zhangyangyu
    Copy link
    Member

    LGTM.

    @methane
    Copy link
    Member

    methane commented Nov 16, 2016

    Patch LGTM.
    But I don't know it's OK to commit it on 2.7, 3.5 and 3.6.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 16, 2016

    New changeset 386c682dcd75 by Serhiy Storchaka in branch '3.5':
    Issue bpo-28701: Replace PyUnicode_CompareWithASCIIString with _PyUnicode_EqualToASCIIString.
    https://hg.python.org/cpython/rev/386c682dcd75

    New changeset 72d07d13869a by Serhiy Storchaka in branch '3.6':
    Issue bpo-28701: Replace PyUnicode_CompareWithASCIIString with _PyUnicode_EqualToASCIIString.
    https://hg.python.org/cpython/rev/72d07d13869a

    New changeset 6f0f77333da5 by Serhiy Storchaka in branch 'default':
    Issue bpo-28701: Replace PyUnicode_CompareWithASCIIString with _PyUnicode_EqualToASCIIString.
    https://hg.python.org/cpython/rev/6f0f77333da5

    @serhiy-storchaka
    Copy link
    Member Author

    Thanks Xiang and Inada for your reviews.

    The patch fixes a bug: error is not checked after PyUnicode_CompareWithASCIIString(). The patch is not applicable to 2.7 since PyUnicode_CompareWithASCIIString() is Python 3 only.

    @vstinner
    Copy link
    Member

    (I reopen the issue to ask my question :-))

    +/* Test whether a unicode is equal to ASCII string. Return 1 if true,
    + 0 otherwise. Return 0 if any argument contains non-ASCII characters.
    + Any error occurs inside will be cleared before return. */

    Can you please also document the behaviour if you pass two non-ASCII strings which are equal? I understand that it returns also 0, right?

    Maybe the API should be more strict and require right to be ASCII: "right string must be encoded to ASCII". I expect an assertion error or a fatal error if right is non-ASCII when Python is compiled in debug mode.

    @vstinner vstinner reopened this Nov 16, 2016
    @serhiy-storchaka
    Copy link
    Member Author

    Can you please also document the behaviour if you pass two non-ASCII strings which are equal?

    What mean "equal"? The left argument is a Unicode string, but the right argument is a byte string. For comparing them we should decode right argument or encode left argument. The result depends on using encoding. _PyUnicode_EqualToASCIIString() uses ASCII (as shown from its name). Non-ASCII strings can't be equal. This is documented.

    If the documentation is not clear, could you provide better wording?

    Maybe the API should be more strict and require right to be ASCII: "right string must be encoded to ASCII". I expect an assertion error or a fatal error if right is non-ASCII when Python is compiled in debug mode.

    I hesitated about adding an assertion error or a fatal error in a bug fix. But this can be added in develop version.

    I don't know what is better -- return 0 in all builds or return 0 in release build and crash in debug build?

    @vstinner
    Copy link
    Member

    I suggest "return 0 in release build and crash in debug build".

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 16, 2016

    New changeset faf04a995031 by Serhiy Storchaka in branch '3.5':
    Issue bpo-28701: Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId.
    https://hg.python.org/cpython/rev/faf04a995031

    New changeset ff3dacc98b3a by Serhiy Storchaka in branch '3.6':
    Issue bpo-28701: Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId.
    https://hg.python.org/cpython/rev/ff3dacc98b3a

    New changeset 765013f71bc4 by Serhiy Storchaka in branch 'default':
    Issue bpo-28701: Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId.
    https://hg.python.org/cpython/rev/765013f71bc4

    @serhiy-storchaka
    Copy link
    Member Author

    The correct issue for above commits is bpo-21449.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 16, 2016

    New changeset b607f835f170 by Serhiy Storchaka in branch '3.5':
    Fixed an off-by-one error in _PyUnicode_EqualToASCIIString (issue bpo-28701).
    https://hg.python.org/cpython/rev/b607f835f170

    New changeset 1369e51182b7 by Serhiy Storchaka in branch '3.6':
    Fixed an off-by-one error in _PyUnicode_EqualToASCIIString (issue bpo-28701).
    https://hg.python.org/cpython/rev/1369e51182b7

    New changeset ba14f8b61bd8 by Serhiy Storchaka in branch 'default':
    Fixed an off-by-one error in _PyUnicode_EqualToASCIIString (issue bpo-28701).
    https://hg.python.org/cpython/rev/ba14f8b61bd8

    @serhiy-storchaka
    Copy link
    Member Author

    Following patch adds checks in debug mode that the right argument of _PyUnicode_EqualToASCIIString and _PyUnicode_EqualToASCIIId is ASCII-only string.

    @vstinner
    Copy link
    Member

    _PyUnicode_EqualToASCII-runtime-check.diff LGTM.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 16, 2016

    New changeset 6dd22ed7140e by Serhiy Storchaka in branch '3.6':
    Issue bpo-28701: _PyUnicode_EqualToASCIIId and _PyUnicode_EqualToASCIIString now
    https://hg.python.org/cpython/rev/6dd22ed7140e

    New changeset 44874b20e612 by Serhiy Storchaka in branch 'default':
    Issue bpo-28701: _PyUnicode_EqualToASCIIId and _PyUnicode_EqualToASCIIString now
    https://hg.python.org/cpython/rev/44874b20e612

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 7, 2017

    For the record: This is all that happened in decimal if a) you
    pass a legacy string and b) force _PyUnicode_Ready() to throw
    a MemoryError:

    >>> from decimal import *
    >>> import _testcapi
    >>> context = Context()
    >>> traps = _testcapi.unicode_legacy_string('traps')
    >>> getattr(context, traps)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError

    Both a) and b) are not trivial to accomplish at all and the result
    is completely benign.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 9, 2017

    New changeset 35334a4d41aa by Stefan Krah in branch '3.5':
    Issue bpo-28701: Revert part of 5bdc8e1a50c8 for the following reasons:
    https://hg.python.org/cpython/rev/35334a4d41aa

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants