This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Speeds up compiling cases-insensitive regular expressions
Type: performance Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-05-05 07:03 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1468 merged serhiy.storchaka, 2017-05-05 07:09
Messages (2)
msg293049 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-05 07:03
Currently _sre.getlower() takes two arguments. Depending on the bits set in the second argument it uses one of three algorithms for determining the lower case of the character -- Unicode, ASCII-only, and locale-depended. After resolving issue30215 _sre.getlower() no longer used for locale-depended case. Proposed patch replaces _sre.getlower() with two one-argument functions: _sre.ascii_tolower() and _sre.unicode_tolower(). This slightly speeds up compiling cases-insensitive regular expressions, especially containing ranges.

$ ./python -m timeit -s 'import sre_compile'  'sre_compile.compile("(?i)ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0)'
Unpatched:  2000 loops, best of 5: 180 usec per loop
Patched:    2000 loops, best of 5: 173 usec per loop

$ ./python -m timeit -s 'import sre_compile'  'sre_compile.compile("(?ia)ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0)'
Unpatched:  2000 loops, best of 5: 175 usec per loop
Patched:    2000 loops, best of 5: 168 usec per loop

$ ./python -m timeit -s 'import sre_compile'  'sre_compile.compile("(?i)[A-Z]", 0)'
Unpatched:  500 loops, best of 5: 788 usec per loop
Patched:    500 loops, best of 5: 766 usec per loop

$ ./python -m timeit -s 'import sre_compile'  'sre_compile.compile("(?ia)[A-Z]", 0)'
Unpatched:  5000 loops, best of 5: 92 usec per loop
Patched:    5000 loops, best of 5: 83.2 usec per loop

$ ./python -m timeit -s 'import sre_compile'  'sre_compile.compile("(?i)[\u0410-\u042f]", 0)'
Unpatched:  2000 loops, best of 5: 141 usec per loop
Patched:    2000 loops, best of 5: 122 usec per loop

$ ./python -m timeit -s 'import sre_compile'  'sre_compile.compile("(?i)[\u0000-\uffff]", 0)'
Unpatched:  5 loops, best of 5: 59 msec per loop
Patched:    10 loops, best of 5: 28.9 msec per loop
msg293062 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-05 07:42
New changeset 7186cc29be352bed6f1110873283d073fd0643e4 by Serhiy Storchaka in branch 'master':
bpo-30277: Replace _sre.getlower() with _sre.ascii_tolower() and _sre.unicode_tolower(). (#1468)
https://github.com/python/cpython/commit/7186cc29be352bed6f1110873283d073fd0643e4
History
Date User Action Args
2022-04-11 14:58:46adminsetgithub: 74463
2017-05-05 07:43:31serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-05-05 07:42:49serhiy.storchakasetmessages: + msg293062
2017-05-05 07:09:37serhiy.storchakasetpull_requests: + pull_request1567
2017-05-05 07:03:05serhiy.storchakacreate