Author serhiy.storchaka
Recipients ezio.melotti, mrabarnett, serhiy.storchaka
Date 2017-05-11.09:20:53
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1494494453.98.0.386630329077.issue30340@psf.upfronthosting.co.za>
In-reply-to
Content
Proposed patch makes the regular expression parser produce more optimal tree, mainly due to getting rid of non-capturing groups. This allows to apply an optimization that was forbidden before and makes the regular expression compiler producing more efficient code.

For example following expressions are transformed in more optimal form:

'(?:x|y)+' -> '[xy]+'
'(?:ab)|(?:ac)' -> 'a[bc]'
r'[a-z]|\d' -> r'[a-z\d]'

This can speed up matching by 10-25 times.

$ ./python -m timeit -s "import re; p = re.compile(r'(?:x|y)+'); s = 'x'*10000"  "p.match(s)"
Unpatched:  500 loops, best of 5: 865 usec per loop
Patched:    5000 loops, best of 5: 84.5 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?:[a-z]|\d)+'); s = 'x'*10000"  "p.match(s)"
Unpatched:  100 loops, best of 5: 2.19 msec per loop
Patched:    5000 loops, best of 5: 84.5 usec per loop
History
Date User Action Args
2017-05-11 09:20:54serhiy.storchakasetrecipients: + serhiy.storchaka, ezio.melotti, mrabarnett
2017-05-11 09:20:53serhiy.storchakasetmessageid: <1494494453.98.0.386630329077.issue30340@psf.upfronthosting.co.za>
2017-05-11 09:20:53serhiy.storchakalinkissue30340 messages
2017-05-11 09:20:53serhiy.storchakacreate