Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize out non-capturing groups #74525

Closed
serhiy-storchaka opened this issue May 11, 2017 · 5 comments
Closed

Optimize out non-capturing groups #74525

serhiy-storchaka opened this issue May 11, 2017 · 5 comments
Assignees
Labels
3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir topic-regex

Comments

@serhiy-storchaka
Copy link
Member

BPO 30340
Nosy @ezio-melotti, @serhiy-storchaka, @MojoVampire
PRs
  • bpo-30340: Enhanced regular expressions optimization. #1542
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2017-05-14.06:54:30.184>
    created_at = <Date 2017-05-11.09:20:53.903>
    labels = ['expert-regex', '3.7', 'library', 'performance']
    title = 'Optimize out non-capturing groups'
    updated_at = <Date 2017-05-14.06:54:30.183>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2017-05-14.06:54:30.183>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2017-05-14.06:54:30.184>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2017-05-11.09:20:53.903>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 30340
    keywords = []
    message_count = 5.0
    messages = ['293477', '293481', '293501', '293504', '293635']
    nosy_count = 4.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'josh.r']
    pr_nums = ['1542']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue30340'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    Proposed patch makes the regular expression parser produce more optimal tree, mainly due to getting rid of non-capturing groups. This allows to apply an optimization that was forbidden before and makes the regular expression compiler producing more efficient code.

    For example following expressions are transformed in more optimal form:

    '(?:x|y)+' -> '[xy]+'
    '(?:ab)|(?:ac)' -> 'a[bc]'
    r'[a-z]|\d' -> r'[a-z\d]'

    This can speed up matching by 10-25 times.

    $ ./python -m timeit -s "import re; p = re.compile(r'(?:x|y)+'); s = 'x'*10000"  "p.match(s)"
    Unpatched:  500 loops, best of 5: 865 usec per loop
    Patched:    5000 loops, best of 5: 84.5 usec per loop
    
    $ ./python -m timeit -s "import re; p = re.compile(r'(?:[a-z]|\d)+'); s = 'x'*10000"  "p.match(s)"
    Unpatched:  100 loops, best of 5: 2.19 msec per loop
    Patched:    5000 loops, best of 5: 84.5 usec per loop

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label May 11, 2017
    @serhiy-storchaka serhiy-storchaka self-assigned this May 11, 2017
    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir topic-regex performance Performance or resource usage labels May 11, 2017
    @serhiy-storchaka
    Copy link
    Member Author

    $ ./python -m timeit -s "import re; p = re.compile(r'[a-z]|[0-9]'); s = ' '*10000+'x'"  "p.search(s)"
    Unpatched:  500 loops, best of 5: 732 usec per loop
    Patched:    1000 loops, best of 5: 279 usec per loop

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 11, 2017

    The PR includes defining and using a _uniq function that is actually a no-op function (it doesn't uniquify, the first line returns the argument, so the rest is skipped). Was that supposed to be removed, or should it actually uniqify?

    @serhiy-storchaka
    Copy link
    Member Author

    Good catch Josh! This return was temporary added for debugging, the function should actually uniqify.

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 821a9d1 by Serhiy Storchaka in branch 'master':
    bpo-30340: Enhanced regular expressions optimization. (bpo-1542)
    821a9d1

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant