Faster compiling of charset regexpes #63528

serhiy-storchaka · 2013-10-21T12:01:19Z

BPO	19329
Nosy	@vstinner, @ezio-melotti, @serhiy-storchaka
Dependencies	bpo-19327: re doesn't work with big charsets
Files	re_mk_bitmap.patch re_optimize_charset.patch re_optimize_charset_2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2013-10-27.06:24:34.571>
created_at = <Date 2013-10-21.12:01:18.787>
labels = ['expert-regex', 'library', 'performance']
title = 'Faster compiling of charset regexpes'
updated_at = <Date 2014-10-31.11:55:20.032>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2014-10-31.11:55:20.032>
actor = 'python-dev'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2013-10-27.06:24:34.571>
closer = 'serhiy.storchaka'
components = ['Library (Lib)', 'Regular Expressions']
creation = <Date 2013-10-21.12:01:18.787>
creator = 'serhiy.storchaka'
dependencies = ['19327']
files = ['32278', '32337', '32364']
hgrepos = []
issue_num = 19329
keywords = ['patch']
message_count = 6.0
messages = ['200755', '201166', '201292', '201419', '201420', '230335']
nosy_count = 5.0
nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue19329'
versions = ['Python 3.4']

serhiy-storchaka · 2013-10-21T12:01:19Z

Here is a patch which speed up compiling of regular expressions with big charsets.

Microbenchmark:
$ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))" "compile(r, 0)"

Unpatched (but with fixed bpo-19327): 119 msec per loop
Patched: 59.6 msec per loop

Compiling regular expressions with big charset was main cause of slowing down importing the email.message module (bpo-11454).

serhiy-storchaka · 2013-10-24T19:24:58Z

Here is a more complex patch which optimizes charset compiling. It affects small charsets too. Big charsets now supports same optimizations as small charsets. Optimized bitmap now can be used even if the charset contains category items or non-bmp characters.

$ ./python -m timeit "from sre_compile import compile; r = '[0-9]+'"  "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 457 usec per loop
Patched: 1000 loops, best of 3: 368 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[ \t\n\r\v\f]+'"  "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 490 usec per loop
Patched: 1000 loops, best of 3: 413 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[0-9A-Za-z_]+'"  "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 760 usec per loop
Patched: 1000 loops, best of 3: 527 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = r'[^\ud800-\udfff]*'"  "compile(r, 0)"
Unpatched: 100 loops, best of 3: 2.07 msec per loop
Patched: 1000 loops, best of 3: 1.44 msec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[\u0410-\u042f\u0430-\u043f\u0404\u0406\u0407\u0454\u0456\u0457\u0490\u0491]+'"  "compile(r, 0)"
Unpatched: 100 loops, best of 3: 8.24 msec per loop
Patched: 100 loops, best of 3: 2.13 msec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))"  "compile(r, 0)"
Unpatched: 10 loops, best of 3: 119 msec per loop
Patched: 10 loops, best of 3: 24.1 msec per loop

serhiy-storchaka · 2013-10-25T21:02:01Z

Updated patch addresses Antoine's comments. One my bug fixed.

python-dev · 2013-10-27T06:22:03Z

New changeset d5498d9d9bb0 by Serhiy Storchaka in branch 'default':
Issue bpo-19329: Optimized compiling charsets in regular expressions.
http://hg.python.org/cpython/rev/d5498d9d9bb0

serhiy-storchaka · 2013-10-27T06:24:34Z

Thank you Antoine for your review.

python-dev · 2014-10-31T11:55:20Z

New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
Backported the optimization of compiling charsets in regular expressions
https://hg.python.org/cpython/rev/ebd48b4f650d

serhiy-storchaka self-assigned this Oct 21, 2013

serhiy-storchaka added stdlib Python modules in the Lib dir topic-regex performance Performance or resource usage labels Oct 21, 2013

serhiy-storchaka changed the title ~~Faster compiling of big charset regexpes~~ Faster compiling of charset regexpes Oct 24, 2013

serhiy-storchaka closed this as completed Oct 27, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster compiling of charset regexpes #63528

Faster compiling of charset regexpes #63528

serhiy-storchaka commented Oct 21, 2013

serhiy-storchaka commented Oct 21, 2013

serhiy-storchaka commented Oct 24, 2013

serhiy-storchaka commented Oct 25, 2013

python-dev mannequin commented Oct 27, 2013

serhiy-storchaka commented Oct 27, 2013

python-dev mannequin commented Oct 31, 2014

Faster compiling of charset regexpes #63528

Faster compiling of charset regexpes #63528

Comments

serhiy-storchaka commented Oct 21, 2013

serhiy-storchaka commented Oct 21, 2013

serhiy-storchaka commented Oct 24, 2013

serhiy-storchaka commented Oct 25, 2013

python-dev mannequin commented Oct 27, 2013

serhiy-storchaka commented Oct 27, 2013

python-dev mannequin commented Oct 31, 2014