Optimize re.search() for \A (and maybe ^) #87051

2d4d · 2021-01-10T21:58:21Z

BPO	42885
Nosy	@ezio-melotti, @stevendaprano, @serhiy-storchaka, @2d4d
PRs	bpo-42885: Optimize search for regular expressions starting with "\A" or "^" #32021
Files	regex_timeit.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2022-03-22.15:28:31.421>
created_at = <Date 2021-01-10.21:58:21.401>
labels = ['expert-regex', '3.10', 'performance']
title = 'Optimize re.search() for \\A (and maybe ^)'
updated_at = <Date 2022-03-22.15:28:31.421>
user = 'https://github.com/2d4d'

bugs.python.org fields:

activity = <Date 2022-03-22.15:28:31.421>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2022-03-22.15:28:31.421>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2021-01-10.21:58:21.401>
creator = '2d4d'
dependencies = []
files = ['49733']
hgrepos = []
issue_num = 42885
keywords = ['patch']
message_count = 6.0
messages = ['384782', '384805', '385137', '385138', '385142', '415789']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'steven.daprano', 'serhiy.storchaka', '2d4d']
pr_nums = ['32021']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue42885'
versions = ['Python 3.10']

2d4d · 2021-01-10T21:58:21Z

The re lib needs 7 seconds to check if a billion As start with an x. So e.g. this statement takes this long:

re.search(r'^x', 'A' * 1000000000)

It takes longer, the longer the string is. The string handling is not the problem, checking if it starts which an A takes just 0.00014 seconds. See output and code below:

3.10.0a4+ (heads/master:d16f617, Jan 9 2021, 13:24:45)
[GCC 7.5.0]
testing string len: 100000
re_test_false: 0.0008246829966083169
testing string len: 1000000000
re_test_false: 7.317708015005337
testing string len: 1000000000
re_test_true: 0.00014710200048284605

import re, timeit, functools, sys

def re_test_true(string):
    print("testing string len: ", len(string))
    re.search(r'^A', string)

def re_test_false(string):
    print("testing string len: ", len(string))
    re.search(r'^x', string)

print(sys.version)

huge_string = 'A' * 100000
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1))

huge_string = 'A' * 1000000000
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1))

print('re_test_true: ', timeit.timeit(functools.partial(re_test_true, huge_string), number=1))

stevendaprano · 2021-01-11T09:27:25Z

I'm getting similar results in Python 3.9.

[steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)"
5000 loops, best of 5: 67.3 usec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*100000" -s "import re" "re.search('^x', a)"
500 loops, best of 5: 639 usec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*1000000" -s "import re" "re.search('^x', a)"
50 loops, best of 5: 6.27 msec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)"
5 loops, best of 5: 62.8 msec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.search('^x', a)"
1 loop, best of 5: 654 msec per loop

It looks like the time is roughly linear in the length of the string.

I get the same result as far back as Python 2.7 (I haven't tried older versions).

[steve ~]$ python2.7 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)"
10000 loops, best of 3: 75.7 usec per loop
[steve ~]$ python2.7 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)"
10 loops, best of 3: 73.4 msec per loop

I would have expected essentially constant time, as in re.match:

[steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.match('x', a)"
500000 loops, best of 5: 560 nsec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.match('x', a)"
500000 loops, best of 5: 561 nsec per loop

serhiy-storchaka · 2021-01-16T08:59:14Z

^ matches not just the beginning of the string. It matches the beginning of a line, i.e. an anchor just after '\n'. If the input string contains '\n', the result cannot be found less than by linear time. If you want to check if the beginning of the string matches a regular expression, it is better to use match(). If you want the check if the whole string matches it, it is better to use fullmatch().

But in some cases you cannot choose what method to use. If you have a set of patterns, and only some of them should be anchored to the start of the string, you have to use search(). And while linear complexity for ^ is expected, search() is not optimized for \A.

So the original report is rejected, the behavior is expected and cannot be changed. It is not a bug. But some optimization can be added for \A, and perhaps the constant multiplier for ^ can be reduced too.

stevendaprano · 2021-01-16T10:53:06Z

On Sat, Jan 16, 2021 at 08:59:13AM +0000, Serhiy Storchaka wrote:

Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:

^ matches not just the beginning of the string. It matches the
beginning of a line, i.e. an anchor just after '\n'.

Only in MULTILINE mode.

I completely agree that in multiline mode '^a' has to search the entire
string. But in single line mode, if the first character isn't an 'a',
there is no need to continue searching.

As far as I can see from the docs, ^ in single line mode and \A always
should be constant time.

So the original report is rejected, the behavior is expected and
cannot be changed. It is not a bug.

I disagree that it is expected behaviour. "Match the start of the
string" cannot possibly match anything except the start of the string,
so who would expect it to keep scanning past the start of the string?

(Multiline mode is, of course, different.)

2d4d · 2021-01-16T15:44:43Z

some more observations, which might be helpful in tracking it down:

x$ is much faster than ^x
A$ is as slow as ^x

$ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('x$', a)"
10 loops, best of 5: 32.9 msec per loop

$ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('A$', a)"
1 loop, best of 5: 802 msec per loop

serhiy-storchaka · 2022-03-22T15:28:06Z

New changeset 492d410 by Serhiy Storchaka in branch 'main':
bpo-42885: Optimize search for regular expressions starting with "\A" or "^" (GH-32021)
492d410

2d4d mannequin added stdlib Python modules in the Lib dir 3.10 only security fixes performance Performance or resource usage labels Jan 10, 2021

stevendaprano added 3.9 only security fixes labels Jan 11, 2021

serhiy-storchaka removed the 3.9 only security fixes label Jan 16, 2021

serhiy-storchaka changed the title ~~Regex performance problem with ^ aka AT_BEGINNING~~ Optimize re.search() for \A (and maybe ^) Jan 16, 2021

serhiy-storchaka removed the 3.9 only security fixes label Jan 16, 2021

serhiy-storchaka changed the title ~~Regex performance problem with ^ aka AT_BEGINNING~~ Optimize re.search() for \A (and maybe ^) Jan 16, 2021

2d4d mannequin added topic-regex and removed stdlib Python modules in the Lib dir labels Jan 16, 2021

serhiy-storchaka closed this as completed Mar 22, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize re.search() for \A (and maybe ^) #87051

Optimize re.search() for \A (and maybe ^) #87051

2d4d mannequin commented Jan 10, 2021

2d4d mannequin commented Jan 10, 2021

stevendaprano commented Jan 11, 2021

serhiy-storchaka commented Jan 16, 2021

stevendaprano commented Jan 16, 2021

2d4d mannequin commented Jan 16, 2021

serhiy-storchaka commented Mar 22, 2022

Optimize re.search() for \A (and maybe ^) #87051

Optimize re.search() for \A (and maybe ^) #87051

Comments

2d4d mannequin commented Jan 10, 2021

2d4d mannequin commented Jan 10, 2021

stevendaprano commented Jan 11, 2021

serhiy-storchaka commented Jan 16, 2021

stevendaprano commented Jan 16, 2021

2d4d mannequin commented Jan 16, 2021

serhiy-storchaka commented Mar 22, 2022