New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize re.search() for \A (and maybe ^) #87051
Comments
The re lib needs 7 seconds to check if a billion As start with an x. So e.g. this statement takes this long: re.search(r'^x', 'A' * 1000000000) It takes longer, the longer the string is. The string handling is not the problem, checking if it starts which an A takes just 0.00014 seconds. See output and code below: 3.10.0a4+ (heads/master:d16f617, Jan 9 2021, 13:24:45) import re, timeit, functools, sys
def re_test_true(string):
print("testing string len: ", len(string))
re.search(r'^A', string)
def re_test_false(string):
print("testing string len: ", len(string))
re.search(r'^x', string)
print(sys.version)
huge_string = 'A' * 100000
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1))
huge_string = 'A' * 1000000000
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1)) print('re_test_true: ', timeit.timeit(functools.partial(re_test_true, huge_string), number=1)) |
I'm getting similar results in Python 3.9. [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)" It looks like the time is roughly linear in the length of the string. I get the same result as far back as Python 2.7 (I haven't tried older versions). [steve ~]$ python2.7 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)" I would have expected essentially constant time, as in re.match: [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.match('x', a)" |
^ matches not just the beginning of the string. It matches the beginning of a line, i.e. an anchor just after '\n'. If the input string contains '\n', the result cannot be found less than by linear time. If you want to check if the beginning of the string matches a regular expression, it is better to use match(). If you want the check if the whole string matches it, it is better to use fullmatch(). But in some cases you cannot choose what method to use. If you have a set of patterns, and only some of them should be anchored to the start of the string, you have to use search(). And while linear complexity for ^ is expected, search() is not optimized for \A. So the original report is rejected, the behavior is expected and cannot be changed. It is not a bug. But some optimization can be added for \A, and perhaps the constant multiplier for ^ can be reduced too. |
On Sat, Jan 16, 2021 at 08:59:13AM +0000, Serhiy Storchaka wrote:
Only in MULTILINE mode. I completely agree that in multiline mode '^a' has to search the entire As far as I can see from the docs, ^ in single line mode and \A always
I disagree that it is expected behaviour. "Match the start of the (Multiline mode is, of course, different.) |
some more observations, which might be helpful in tracking it down: x$ is much faster than ^x $ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('x$', a)"
10 loops, best of 5: 32.9 msec per loop
$ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('A$', a)"
1 loop, best of 5: 802 msec per loop |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: