classification
Title: Optimize case-insensitive regular expressions
Type: performance Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, mrabarnett, rhettinger, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-05-05 17:02 by serhiy.storchaka, last changed 2017-05-09 20:37 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1482 merged serhiy.storchaka, 2017-05-05 17:22
Messages (3)
msg293127 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-05 17:02
Matching and searching case-insensitive regular expressions is much slower than matching and searching case-sensitive regular expressions. Case-insensitivity requires converting every character in input string to lower case and disables some optimizations. But there are only 2669 cased characters (52 in ASCII mode). For all other characters in the pattern we can use case-sensitive matching.

Results of microbenchmarks:

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai) +x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  5000 loops, best of 5: 47.1 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i) +x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  2000 loops, best of 5: 109 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[ \t]+x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 537 usec per loop
Patched:    5000 loops, best of 5: 84.2 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[ \t]+x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 608 usec per loop
Patched:    5000 loops, best of 5: 84.1 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)/x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 226 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)/x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 284 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[/%]x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 483 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[/%]x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 549 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

The patch also optimizes searching some complex case-sensitive patterns. This is a side effect, I'll provide more comprehensive optimization in other issue.

$ ./python -m timeit -s "import re; p = re.compile(r'([xy])'); s = ' '*10000+'x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 633 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'((x|y)z)'); s = ' '*10000+'xz'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 716 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop
msg293174 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-05-06 21:09
This seems like a great idea.
msg293348 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-09 20:37
New changeset 6d336a027913327fc042b0d758a16724fea27b9c by Serhiy Storchaka in branch 'master':
bpo-30285: Optimize case-insensitive matching and searching (#1482)
https://github.com/python/cpython/commit/6d336a027913327fc042b0d758a16724fea27b9c
History
Date User Action Args
2017-05-09 20:37:56serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-05-09 20:37:16serhiy.storchakasetmessages: + msg293348
2017-05-06 21:09:31rhettingersetnosy: + rhettinger
messages: + msg293174
2017-05-05 17:22:03serhiy.storchakasetpull_requests: + pull_request1582
2017-05-05 17:02:59serhiy.storchakacreate