Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement comparison (x==y and x!=y) for _sre.SRE_Pattern #72913

Closed
vstinner opened this issue Nov 17, 2016 · 23 comments
Closed

Implement comparison (x==y and x!=y) for _sre.SRE_Pattern #72913

vstinner opened this issue Nov 17, 2016 · 23 comments
Assignees
Labels
3.7 (EOL) end of life topic-regex type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

BPO 28727
Nosy @vstinner, @ezio-melotti, @serhiy-storchaka
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Files
  • pattern_compare.patch
  • pattern_compare-2.patch
  • pattern_compare-3.patch
  • pattern_compare-4.patch
  • pattern_compare-5.patch
  • pattern_compare-6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2016-11-22.14:32:35.264>
    created_at = <Date 2016-11-17.16:02:48.047>
    labels = ['expert-regex', 'type-feature', '3.7']
    title = 'Implement comparison (x==y and x!=y) for _sre.SRE_Pattern'
    updated_at = <Date 2017-03-31.16:36:24.688>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:24.688>
    actor = 'dstufft'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2016-11-22.14:32:35.264>
    closer = 'vstinner'
    components = ['Regular Expressions']
    creation = <Date 2016-11-17.16:02:48.047>
    creator = 'vstinner'
    dependencies = []
    files = ['45520', '45523', '45528', '45529', '45541', '45587']
    hgrepos = []
    issue_num = 28727
    keywords = ['patch']
    message_count = 23.0
    messages = ['281046', '281052', '281057', '281062', '281063', '281083', '281084', '281086', '281087', '281088', '281094', '281101', '281178', '281306', '281362', '281363', '281365', '281369', '281374', '281479', '281480', '281483', '281484']
    nosy_count = 6.0
    nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'SilentGhost', 'python-dev', 'serhiy.storchaka']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue28727'
    versions = ['Python 3.7']

    @vstinner
    Copy link
    Member Author

    Attached patch implements rich comparison for _sre.SRE_Pattern objects created by re.compile().

    Comparison between patterns is used in the warnings module to not add duplicated filters, see issue bpo-18383:

    New changeset f57f4e33ba5e by Martin Panter in branch '3.5':
    Issue bpo-18383: Avoid adding duplicate filters when warnings is reloaded
    https://hg.python.org/cpython/rev/f57f4e33ba5e

    For the warnings module, it became a problem in test_warnings since the Python test runner started to clear all caches. When re.purge() is called, re.compile() creates a new object, whereas with the cache it returns the same object and so the two patterns are equal since it's the same object. => see issue bpo-28688

    @vstinner vstinner added 3.7 (EOL) end of life type-feature A feature request or enhancement labels Nov 17, 2016
    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Nov 17, 2016

    Ten subtest in test_re fail with

    TypeError: unhashable type: '_sre.SRE_Pattern'

    @vstinner
    Copy link
    Member Author

    Ten subtest in test_re fail with: TypeError: unhashable type: '_sre.SRE_Pattern'

    Oops, right. Updated patch implements also hash() on patterns.

    @serhiy-storchaka
    Copy link
    Member

    Added comments on Rietveld.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Nov 17, 2016

    I hope you make it clear what you mean by 'equal', i.e. that it's comparing the pattern and the flags (AFAICT), so re.compile('(?x)a') != re.compile('(?x) a ').

    @vstinner
    Copy link
    Member Author

    Patch version 3:

    • pattern_hash() includes isbytes, I also shifted flags by 1 bit to not erase the isbytes bit (FYI maximum value of flags is 256)
    • pattern_richcompare() avoids calling PyObject_RichCompareBool() if flags or isbytes is different
    • unit test ensures that no BytesWarning warning is raised
    • checks hash() in unit tests
    • fix also the unit test with a different flag (use the same pattern)
    • document also in the unit test that the comparison is case sensitive

    @vstinner
    Copy link
    Member Author

    Ok, I hope that it's the last attempt: patch 5

    • Remove hash(b) != hash(a): only keep tests on hash(b)==hash(a) when b==a
    • Replace re.ASCII flag with no flag to test two patterns with different flags

    @serhiy-storchaka
    Copy link
    Member

    There is a problem with locale-depending flags. The same pattern may be compiled with the LOCALE flag to different pattern objects in different locales. Perhaps we should compare the compiled code instead of pattern strings. Or both.

    @vstinner
    Copy link
    Member Author

    Serhiy: "There is a problem with locale-depending flags. The same pattern may be compiled with the LOCALE flag to different pattern objects in different locales."

    Oh, I didn't know and you are right.

    "Perhaps we should compare the compiled code instead of pattern strings. Or both."

    PatternObject contains many fields. I used the following two fields which come from re.compile():

    • pattern
    • flags

    I considered that they were enough because pattern_repr() only displays these ones. Other fields:

    • groups
    • groupindex
    • indexgroup
    • weakreflist
    • isbytes
    • codesize, code

    weakreflist can be skipped, isbytes is already tested in my patch.

    Would it be possible to only compare code instead of pattern? What are groups, groupindex and indexgroup: should we also compare them?

    Maybe I can start from pattern_compare-4.patch and only add a test comparing code?

    @serhiy-storchaka
    Copy link
    Member

    '[abc]' and '[cba]' are compiled to the same code. Do we want to handle them as equal? This is implementation defined.

    @vstinner
    Copy link
    Member Author

    '[abc]' and '[cba]' are compiled to the same code. Do we want to handle them as equal?

    Comparison must be fast. If the comparison is just memcmp(code1,
    code2, codesize), it's ok.

    I agree that we must put a similar somewhere to say that some parts
    are implementation details. Maybe split the test in two parts, and
    mark the second part with @cpython_only.

    @serhiy-storchaka
    Copy link
    Member

    I see two options:

    • Compare flags, isbytes and code. This makes some different patterns be compiled to equal objects. For example spaces in verbose mode are ignored. But this don't make equal all equivalent patterns. '[aA]' would be equal to '(?:a|A)' but still would be not equal to '(i?a)' with current implementation.

    • Compare flags, isbytes, code and pattern. This makes literally different patterns be compiled to not equal objects even if the difference is not significant. '[abc]' would be different from '[cba]' despites the fact that matching both always returns the same result.

    Since this issue becomes a little ambiguous, I would target the patch to 3.7 only. Maybe we will find other subtle details or will decide to change the meaning of equality of pattern objects before releasing 3.7.

    @vstinner
    Copy link
    Member Author

    New approach: patch 5 now compares indexgroup, groupindex and code instead of comparing pattern, to handle correctly two equal pattern strings compiled with the re.LOCALE flag and two different locales.

    The patch also converts indexgroup list to a tuple to be able to hash it. (It also prevents modification, but this is just a side effect, and groupindex remains a mutable dictionary.)

    _sre.compile() checks types which helps to identify a bug in unit tests.

    @serhiy-storchaka
    Copy link
    Member

    This looks too complicated. groups, indexgroup and groupindex are unambiguously derived from pattern string. If caching works different pattern strings are compiled to different pattern objects. Currently they are not equal, even if their codes are equal. And I don't see large need to consider them equal.

    @vstinner
    Copy link
    Member Author

    Back to basis, patch 6:

    • revert changes on indexgroup and groupindex types: I will fix this later, once this issue is fixed
    • pattern_richcompare() and pattern_hash() also uses pattern, but don't use groups, indexgroup nor groupindex anymore

    I removed the @cpython_only unit test and rewrote test_pattern_compare_bytes() to make it easier to follow.

    re.compile('abc', re.IGNORECASE) is different than re.compile('ABC', re.IGNORECASE), but it's a deliberate choice to not test it. I consider that the behaviour can change depending on the Python implementation and in a future version.

    @serhiy-storchaka
    Copy link
    Member

    pattern_compare-6.patch LGTM.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 21, 2016

    New changeset 5e8ef1493843 by Victor Stinner in branch '3.6':
    Implement rich comparison for _sre.SRE_Pattern
    https://hg.python.org/cpython/rev/5e8ef1493843

    @vstinner
    Copy link
    Member Author

    Serhiy Storchaka: "pattern_compare-6.patch LGTM."

    Thank you very much for your very useful reviews! I pushed the change.

    @vstinner
    Copy link
    Member Author

    For stricter checks on _sre.compile() arguments, I created the issue bpo-28765: "_sre.compile(): be more strict on types of indexgroup and groupindex".

    @serhiy-storchaka
    Copy link
    Member

    + && left->codesize && right->codesize);

    There is a typo. Should be:

    + && left->codesize == right->codesize);

    @serhiy-storchaka
    Copy link
    Member

    While we are here, it perhaps worth to add a fast path for self == other.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 22, 2016

    New changeset 6b43d15fd2d7 by Victor Stinner in branch '3.6':
    Issue bpo-28727: Fix typo in pattern_richcompare()
    https://hg.python.org/cpython/rev/6b43d15fd2d7

    New changeset c2cb70c97163 by Victor Stinner in branch '3.6':
    Issue bpo-28727: Optimize pattern_richcompare() for a==a
    https://hg.python.org/cpython/rev/c2cb70c97163

    @vstinner
    Copy link
    Member Author

    Serhiy Storchaka: "&& left->codesize && right->codesize)"

    Ooops! Fixed!

    "While we are here, it perhaps worth to add a fast path for self == other."

    Done.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants