classification
Title: Optimize out non-capturing groups
Type: performance Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, josh.r, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-05-11 09:20 by serhiy.storchaka, last changed 2017-05-14 06:54 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1542 merged serhiy.storchaka, 2017-05-11 09:39
Messages (5)
msg293477 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-11 09:20
Proposed patch makes the regular expression parser produce more optimal tree, mainly due to getting rid of non-capturing groups. This allows to apply an optimization that was forbidden before and makes the regular expression compiler producing more efficient code.

For example following expressions are transformed in more optimal form:

'(?:x|y)+' -> '[xy]+'
'(?:ab)|(?:ac)' -> 'a[bc]'
r'[a-z]|\d' -> r'[a-z\d]'

This can speed up matching by 10-25 times.

$ ./python -m timeit -s "import re; p = re.compile(r'(?:x|y)+'); s = 'x'*10000"  "p.match(s)"
Unpatched:  500 loops, best of 5: 865 usec per loop
Patched:    5000 loops, best of 5: 84.5 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?:[a-z]|\d)+'); s = 'x'*10000"  "p.match(s)"
Unpatched:  100 loops, best of 5: 2.19 msec per loop
Patched:    5000 loops, best of 5: 84.5 usec per loop
msg293481 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-11 09:34
$ ./python -m timeit -s "import re; p = re.compile(r'[a-z]|[0-9]'); s = ' '*10000+'x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 732 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop
msg293501 - (view) Author: Josh Rosenberg (josh.r) * Date: 2017-05-11 15:31
The PR includes defining and using a _uniq function that is actually a no-op function (it doesn't uniquify, the first line returns the argument, so the rest is skipped). Was that supposed to be removed, or should it actually uniqify?
msg293504 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-11 16:24
Good catch Josh! This return was temporary added for debugging, the function should actually uniqify.
msg293635 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-14 05:32
New changeset 821a9d146bc04a1bc1a9807962990a1f59d692b8 by Serhiy Storchaka in branch 'master':
bpo-30340: Enhanced regular expressions optimization. (#1542)
https://github.com/python/cpython/commit/821a9d146bc04a1bc1a9807962990a1f59d692b8
History
Date User Action Args
2017-05-14 06:54:30serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-05-14 05:32:35serhiy.storchakasetmessages: + msg293635
2017-05-11 16:24:03serhiy.storchakasetmessages: + msg293504
2017-05-11 15:31:30josh.rsetnosy: + josh.r
messages: + msg293501
2017-05-11 09:39:56serhiy.storchakasetpull_requests: + pull_request1641
2017-05-11 09:34:15serhiy.storchakasetmessages: + msg293481
2017-05-11 09:20:53serhiy.storchakacreate