classification
Title: Optimize re.search() for \A (and maybe ^)
Type: performance Stage:
Components: Regular Expressions Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: 2d4d, ezio.melotti, mrabarnett, serhiy.storchaka, steven.daprano
Priority: normal Keywords:

Created on 2021-01-10 21:58 by 2d4d, last changed 2021-01-16 21:07 by 2d4d.

Files
File name Uploaded Description Edit
regex_timeit.py 2d4d, 2021-01-10 21:58
Messages (5)
msg384782 - (view) Author: Arnim Rupp (2d4d) Date: 2021-01-10 21:58
The re lib needs 7 seconds to check if a billion As start with an x. So e.g. this statement takes this long:

re.search(r'^x', 'A' * 1000000000)

It takes longer, the longer the string is. The string handling is not the problem, checking if it starts which an A takes just 0.00014 seconds. See output and code below:

3.10.0a4+ (heads/master:d16f617, Jan  9 2021, 13:24:45) 
[GCC 7.5.0]
testing string len:  100000
re_test_false:  0.0008246829966083169
testing string len:  1000000000
re_test_false:  7.317708015005337
testing string len:  1000000000
re_test_true:   0.00014710200048284605


import re, timeit, functools, sys

def re_test_true(string):
    print("testing string len: ", len(string))
    re.search(r'^A', string)

def re_test_false(string):
    print("testing string len: ", len(string))
    re.search(r'^x', string)

print(sys.version)

huge_string = 'A' * 100000
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1))

huge_string = 'A' * 1000000000
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1))

print('re_test_true:  ', timeit.timeit(functools.partial(re_test_true, huge_string), number=1))
msg384805 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-01-11 09:27
I'm getting similar results in Python 3.9.

[steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)"
5000 loops, best of 5: 67.3 usec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*100000" -s "import re" "re.search('^x', a)"
500 loops, best of 5: 639 usec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*1000000" -s "import re" "re.search('^x', a)"
50 loops, best of 5: 6.27 msec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)"
5 loops, best of 5: 62.8 msec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.search('^x', a)"
1 loop, best of 5: 654 msec per loop


It looks like the time is roughly linear in the length of the string.

I get the same result as far back as Python 2.7 (I haven't tried older versions).

[steve ~]$ python2.7 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)"
10000 loops, best of 3: 75.7 usec per loop
[steve ~]$ python2.7 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)"
10 loops, best of 3: 73.4 msec per loop


I would have expected essentially constant time, as in re.match:

[steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.match('x', a)"
500000 loops, best of 5: 560 nsec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.match('x', a)"
500000 loops, best of 5: 561 nsec per loop
msg385137 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-01-16 08:59
^ matches not just the beginning of the string. It matches the beginning of a line, i.e. an anchor just after '\n'. If the input string contains '\n', the result cannot be found less than by linear time. If you want to check if the beginning of the string matches a regular expression, it is better to use match(). If you want the check if the whole string matches it, it is better to use fullmatch().

But in some cases you cannot choose what method to use. If you have a set of patterns, and only some of them should be anchored to the start of the string, you have to use search(). And while linear complexity for ^ is expected, search() is not optimized for \A.

So the original report is rejected, the behavior is expected and cannot be changed. It is not a bug. But some optimization can be added for \A, and perhaps the constant multiplier for ^ can be reduced too.
msg385138 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-01-16 10:53
On Sat, Jan 16, 2021 at 08:59:13AM +0000, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
> 
> ^ matches not just the beginning of the string. It matches the 
> beginning of a line, i.e. an anchor just after '\n'.

Only in MULTILINE mode.

I completely agree that in multiline mode '^a' has to search the entire 
string. But in single line mode, if the first character isn't an 'a', 
there is no need to continue searching.

As far as I can see from the docs, ^ in single line mode and \A always 
should be constant time.

> So the original report is rejected, the behavior is expected and 
> cannot be changed. It is not a bug.

I disagree that it is expected behaviour. "Match the start of the 
string" cannot possibly match anything except the start of the string, 
so who would expect it to keep scanning past the start of the string?

(Multiline mode is, of course, different.)
msg385142 - (view) Author: Arnim Rupp (2d4d) Date: 2021-01-16 15:44
some more observations, which might be helpful in tracking it down:

x$ is much faster than ^x
A$ is as slow as ^x

$ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('x$', a)"
10 loops, best of 5: 32.9 msec per loop

$ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('A$', a)"
1 loop, best of 5: 802 msec per loop
History
Date User Action Args
2021-01-16 21:07:452d4dsetnosy: + ezio.melotti, mrabarnett
components: + Regular Expressions, - Library (Lib)
2021-01-16 15:44:432d4dsetmessages: + msg385142
2021-01-16 10:53:05steven.dapranosetmessages: + msg385138
2021-01-16 08:59:13serhiy.storchakasettitle: Regex performance problem with ^ aka AT_BEGINNING -> Optimize re.search() for \A (and maybe ^)
messages: + msg385137
versions: - Python 3.9
2021-01-15 22:55:42terry.reedysetnosy: + serhiy.storchaka
2021-01-11 09:27:24steven.dapranosetnosy: + steven.daprano

messages: + msg384805
versions: + Python 3.9
2021-01-10 21:58:212d4dcreate