classification
Title: Quadratic complexity in the parsing of re replacement string
Type: performance Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: christian.heimes, ezio.melotti, mrabarnett, pitrou, python-dev, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2013-10-23 14:36 by serhiy.storchaka, last changed 2013-10-23 19:29 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_parse_template.patch serhiy.storchaka, 2013-10-23 14:36 review
Messages (5)
msg201032 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-23 14:36
sre_parse.parse_template uses string concatenation to accumulate characters.

    def literal(literal, p=p, pappend=a):
        if p and p[-1][0] is LITERAL:
            p[-1] = LITERAL, p[-1][1] + literal
        else:
            pappend((LITERAL, literal))

This operation have quadratic calculation complexity for long replacement strings.

$ ./python -m timeit -n1 -r1 -s "from sre_parse import parse_template; repl = 'x'*100000"  "parse_template(repl, '')"
1 loops, best of 1: 3.38 sec per loop
$ ./python -m timeit -n1 -r1 -s "from sre_parse import parse_template; repl = 'x'*200000"  "parse_template(repl, '')"
1 loops, best of 1: 18.2 sec per loop

The proposed patch change amortized complexity to be linear. It also speeds up parsing shorter strings.

$ ./python -m timeit -n1 -r1 -s "from sre_parse import parse_template; repl = 'x'*100000"  "parse_template(repl, '')"
1 loops, best of 1: 915 msec per loop
$ ./python -m timeit -n1 -r1 -s "from sre_parse import parse_template; repl = 'x'*200000"  "parse_template(repl, '')"
1 loops, best of 1: 1.79 sec per loop
msg201035 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-23 14:47
LTGM except for Python 2.7. I think we should slowly stop applying optimizations and other non-critical fixes to 2.7.
msg201036 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-23 14:58
Normally optimizations should only land in the default branch, unless it's catastrophic regression.
msg201057 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-23 19:28
New changeset b322047fec55 by Serhiy Storchaka in branch 'default':
Issue #19365: Optimized the parsing of long replacement string in re.sub*()
http://hg.python.org/cpython/rev/b322047fec55
msg201058 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-23 19:29
Thank you Antoine for your review.
History
Date User Action Args
2013-10-23 19:29:47serhiy.storchakasetstatus: open -> closed
messages: + msg201058

assignee: serhiy.storchaka
resolution: fixed
stage: patch review -> resolved
2013-10-23 19:28:36python-devsetnosy: + python-dev
messages: + msg201057
2013-10-23 14:58:02pitrousetnosy: + tim.peters

messages: + msg201036
versions: - Python 2.7, Python 3.3
2013-10-23 14:47:38christian.heimessetnosy: + christian.heimes
messages: + msg201035
2013-10-23 14:36:02serhiy.storchakacreate