Author serhiy.storchaka
Recipients ezio.melotti, mrabarnett, serhiy.storchaka
Date 2017-05-05.17:02:59
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1494003779.18.0.0568506065862.issue30285@psf.upfronthosting.co.za>
In-reply-to
Content
Matching and searching case-insensitive regular expressions is much slower than matching and searching case-sensitive regular expressions. Case-insensitivity requires converting every character in input string to lower case and disables some optimizations. But there are only 2669 cased characters (52 in ASCII mode). For all other characters in the pattern we can use case-sensitive matching.

Results of microbenchmarks:

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai) +x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  5000 loops, best of 5: 47.1 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i) +x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  2000 loops, best of 5: 109 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[ \t]+x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 537 usec per loop
Patched:    5000 loops, best of 5: 84.2 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[ \t]+x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 608 usec per loop
Patched:    5000 loops, best of 5: 84.1 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)/x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 226 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)/x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 284 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[/%]x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 483 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[/%]x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 549 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

The patch also optimizes searching some complex case-sensitive patterns. This is a side effect, I'll provide more comprehensive optimization in other issue.

$ ./python -m timeit -s "import re; p = re.compile(r'([xy])'); s = ' '*10000+'x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 633 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'((x|y)z)'); s = ' '*10000+'xz'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 716 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop
History
Date User Action Args
2017-05-05 17:02:59serhiy.storchakasetrecipients: + serhiy.storchaka, ezio.melotti, mrabarnett
2017-05-05 17:02:59serhiy.storchakasetmessageid: <1494003779.18.0.0568506065862.issue30285@psf.upfronthosting.co.za>
2017-05-05 17:02:59serhiy.storchakalinkissue30285 messages
2017-05-05 17:02:59serhiy.storchakacreate