Issue 42885: Optimize re.search() for \A (and maybe ^)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87051

classification

Title:	Optimize re.search() for \A (and maybe ^)
Type:	performance	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	2d4d, ezio.melotti, mrabarnett, serhiy.storchaka, steven.daprano
Priority:	normal	Keywords:	patch

Created on 2021-01-10 21:58 by 2d4d, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
regex_timeit.py	2d4d, 2021-01-10 21:58

Pull Requests
URL	Status	Linked	Edit
PR 32021	merged	serhiy.storchaka, 2022-03-21 06:35

Messages (6)
msg384782 - (view)	Author: Arnim Rupp (2d4d)	Date: 2021-01-10 21:58
The re lib needs 7 seconds to check if a billion As start with an x. So e.g. this statement takes this long: re.search(r'^x', 'A' * 1000000000) It takes longer, the longer the string is. The string handling is not the problem, checking if it starts which an A takes just 0.00014 seconds. See output and code below: 3.10.0a4+ (heads/master:d16f617, Jan 9 2021, 13:24:45) [GCC 7.5.0] testing string len: 100000 re_test_false: 0.0008246829966083169 testing string len: 1000000000 re_test_false: 7.317708015005337 testing string len: 1000000000 re_test_true: 0.00014710200048284605 import re, timeit, functools, sys def re_test_true(string): print("testing string len: ", len(string)) re.search(r'^A', string) def re_test_false(string): print("testing string len: ", len(string)) re.search(r'^x', string) print(sys.version) huge_string = 'A' * 100000 print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1)) huge_string = 'A' * 1000000000 print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1)) print('re_test_true: ', timeit.timeit(functools.partial(re_test_true, huge_string), number=1))
msg384805 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2021-01-11 09:27
I'm getting similar results in Python 3.9. [steve ~]$ python3.9 -m timeit -s "a = 'A'10000" -s "import re" "re.search('^x', a)" 5000 loops, best of 5: 67.3 usec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'100000" -s "import re" "re.search('^x', a)" 500 loops, best of 5: 639 usec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'1000000" -s "import re" "re.search('^x', a)" 50 loops, best of 5: 6.27 msec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'10000000" -s "import re" "re.search('^x', a)" 5 loops, best of 5: 62.8 msec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'100000000" -s "import re" "re.search('^x', a)" 1 loop, best of 5: 654 msec per loop It looks like the time is roughly linear in the length of the string. I get the same result as far back as Python 2.7 (I haven't tried older versions). [steve ~]$ python2.7 -m timeit -s "a = 'A'10000" -s "import re" "re.search('^x', a)" 10000 loops, best of 3: 75.7 usec per loop [steve ~]$ python2.7 -m timeit -s "a = 'A'10000000" -s "import re" "re.search('^x', a)" 10 loops, best of 3: 73.4 msec per loop I would have expected essentially constant time, as in re.match: [steve ~]$ python3.9 -m timeit -s "a = 'A'10000" -s "import re" "re.match('x', a)" 500000 loops, best of 5: 560 nsec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.match('x', a)" 500000 loops, best of 5: 561 nsec per loop
msg385137 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-01-16 08:59
^ matches not just the beginning of the string. It matches the beginning of a line, i.e. an anchor just after '\n'. If the input string contains '\n', the result cannot be found less than by linear time. If you want to check if the beginning of the string matches a regular expression, it is better to use match(). If you want the check if the whole string matches it, it is better to use fullmatch(). But in some cases you cannot choose what method to use. If you have a set of patterns, and only some of them should be anchored to the start of the string, you have to use search(). And while linear complexity for ^ is expected, search() is not optimized for \A. So the original report is rejected, the behavior is expected and cannot be changed. It is not a bug. But some optimization can be added for \A, and perhaps the constant multiplier for ^ can be reduced too.
msg385138 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2021-01-16 10:53
On Sat, Jan 16, 2021 at 08:59:13AM +0000, Serhiy Storchaka wrote: > > Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: > > ^ matches not just the beginning of the string. It matches the > beginning of a line, i.e. an anchor just after '\n'. Only in MULTILINE mode. I completely agree that in multiline mode '^a' has to search the entire string. But in single line mode, if the first character isn't an 'a', there is no need to continue searching. As far as I can see from the docs, ^ in single line mode and \A always should be constant time. > So the original report is rejected, the behavior is expected and > cannot be changed. It is not a bug. I disagree that it is expected behaviour. "Match the start of the string" cannot possibly match anything except the start of the string, so who would expect it to keep scanning past the start of the string? (Multiline mode is, of course, different.)
msg385142 - (view)	Author: Arnim Rupp (2d4d)	Date: 2021-01-16 15:44
some more observations, which might be helpful in tracking it down: x$ is much faster than ^x A$ is as slow as ^x $ python3 -m timeit -s "a = 'A'100000000" "import re" "re.search('x$', a)" 10 loops, best of 5: 32.9 msec per loop $ python3 -m timeit -s "a = 'A'100000000" "import re" "re.search('A$', a)" 1 loop, best of 5: 802 msec per loop
msg415789 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-03-22 15:28
New changeset 492d4109f4d953c478cb48f17aa32adbb912623b by Serhiy Storchaka in branch 'main': bpo-42885: Optimize search for regular expressions starting with "\A" or "^" (GH-32021) https://github.com/python/cpython/commit/492d4109f4d953c478cb48f17aa32adbb912623b

History
Date	User	Action	Args
2022-04-11 14:59:40	admin	set	github: 87051
2022-03-22 15:28:31	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2022-03-22 15:28:05	serhiy.storchaka	set	messages: + msg415789
2022-03-21 06:35:29	serhiy.storchaka	set	keywords: + patch stage: patch review pull_requests: + pull_request30109
2021-01-16 21:07:45	2d4d	set	nosy: + ezio.melotti, mrabarnett components: + Regular Expressions, - Library (Lib)
2021-01-16 15:44:43	2d4d	set	messages: + msg385142
2021-01-16 10:53:05	steven.daprano	set	messages: + msg385138
2021-01-16 08:59:13	serhiy.storchaka	set	title: Regex performance problem with ^ aka AT_BEGINNING -> Optimize re.search() for \A (and maybe ^) messages: + msg385137 versions: - Python 3.9
2021-01-15 22:55:42	terry.reedy	set	nosy: + serhiy.storchaka
2021-01-11 09:27:24	steven.daprano	set	nosy: + steven.daprano messages: + msg384805 versions: + Python 3.9
2021-01-10 21:58:21	2d4d	create