This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Faster compiling of charset regexpes
Type: performance Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: 19327 Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, mrabarnett, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013-10-21 12:01 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
re_mk_bitmap.patch serhiy.storchaka, 2013-10-21 12:01 review
re_optimize_charset.patch serhiy.storchaka, 2013-10-24 19:24 review
re_optimize_charset_2.patch serhiy.storchaka, 2013-10-25 21:02 review
Messages (6)
msg200755 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-21 12:01
Here is a patch which speed up compiling of regular expressions with big charsets.

Microbenchmark:
$ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))"  "compile(r, 0)"

Unpatched (but with fixed issue19327): 119 msec per loop
Patched: 59.6 msec per loop

Compiling regular expressions with big charset was main cause of slowing down importing the email.message module (issue11454).
msg201166 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-24 19:24
Here is a more complex patch which optimizes charset compiling. It affects small charsets too. Big charsets now supports same optimizations as small charsets. Optimized bitmap now can be used even if the charset contains category items or non-bmp characters.

$ ./python -m timeit "from sre_compile import compile; r = '[0-9]+'"  "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 457 usec per loop
Patched: 1000 loops, best of 3: 368 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[ \t\n\r\v\f]+'"  "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 490 usec per loop
Patched: 1000 loops, best of 3: 413 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[0-9A-Za-z_]+'"  "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 760 usec per loop
Patched: 1000 loops, best of 3: 527 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = r'[^\ud800-\udfff]*'"  "compile(r, 0)"
Unpatched: 100 loops, best of 3: 2.07 msec per loop
Patched: 1000 loops, best of 3: 1.44 msec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[\u0410-\u042f\u0430-\u043f\u0404\u0406\u0407\u0454\u0456\u0457\u0490\u0491]+'"  "compile(r, 0)"
Unpatched: 100 loops, best of 3: 8.24 msec per loop
Patched: 100 loops, best of 3: 2.13 msec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))"  "compile(r, 0)"
Unpatched: 10 loops, best of 3: 119 msec per loop
Patched: 10 loops, best of 3: 24.1 msec per loop
msg201292 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-25 21:02
Updated patch addresses Antoine's comments. One my bug fixed.
msg201419 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-27 06:22
New changeset d5498d9d9bb0 by Serhiy Storchaka in branch 'default':
Issue #19329: Optimized compiling charsets in regular expressions.
http://hg.python.org/cpython/rev/d5498d9d9bb0
msg201420 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-27 06:24
Thank you Antoine for your review.
msg230335 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-10-31 11:55
New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
Backported the optimization of compiling charsets in regular expressions
https://hg.python.org/cpython/rev/ebd48b4f650d
History
Date User Action Args
2022-04-11 14:57:52adminsetgithub: 63528
2014-10-31 11:55:20python-devsetmessages: + msg230335
2013-10-27 06:24:34serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg201420

stage: patch review -> resolved
2013-10-27 06:22:02python-devsetnosy: + python-dev
messages: + msg201419
2013-10-25 21:02:01serhiy.storchakasetfiles: + re_optimize_charset_2.patch

messages: + msg201292
2013-10-24 19:24:58serhiy.storchakasetfiles: + re_optimize_charset.patch

messages: + msg201166
title: Faster compiling of big charset regexpes -> Faster compiling of charset regexpes
2013-10-21 12:01:44serhiy.storchakasetdependencies: + re doesn't work with big charsets
2013-10-21 12:01:18serhiy.storchakacreate