Issue42885
Created on 2021-01-10 21:58 by 2d4d, last changed 2021-01-16 21:07 by 2d4d.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
regex_timeit.py | 2d4d, 2021-01-10 21:58 |
Messages (5) | |||
---|---|---|---|
msg384782 - (view) | Author: Arnim Rupp (2d4d) | Date: 2021-01-10 21:58 | |
The re lib needs 7 seconds to check if a billion As start with an x. So e.g. this statement takes this long: re.search(r'^x', 'A' * 1000000000) It takes longer, the longer the string is. The string handling is not the problem, checking if it starts which an A takes just 0.00014 seconds. See output and code below: 3.10.0a4+ (heads/master:d16f617, Jan 9 2021, 13:24:45) [GCC 7.5.0] testing string len: 100000 re_test_false: 0.0008246829966083169 testing string len: 1000000000 re_test_false: 7.317708015005337 testing string len: 1000000000 re_test_true: 0.00014710200048284605 import re, timeit, functools, sys def re_test_true(string): print("testing string len: ", len(string)) re.search(r'^A', string) def re_test_false(string): print("testing string len: ", len(string)) re.search(r'^x', string) print(sys.version) huge_string = 'A' * 100000 print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1)) huge_string = 'A' * 1000000000 print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1)) print('re_test_true: ', timeit.timeit(functools.partial(re_test_true, huge_string), number=1)) |
|||
msg384805 - (view) | Author: Steven D'Aprano (steven.daprano) * ![]() |
Date: 2021-01-11 09:27 | |
I'm getting similar results in Python 3.9. [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)" 5000 loops, best of 5: 67.3 usec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'*100000" -s "import re" "re.search('^x', a)" 500 loops, best of 5: 639 usec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'*1000000" -s "import re" "re.search('^x', a)" 50 loops, best of 5: 6.27 msec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)" 5 loops, best of 5: 62.8 msec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.search('^x', a)" 1 loop, best of 5: 654 msec per loop It looks like the time is roughly linear in the length of the string. I get the same result as far back as Python 2.7 (I haven't tried older versions). [steve ~]$ python2.7 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)" 10000 loops, best of 3: 75.7 usec per loop [steve ~]$ python2.7 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)" 10 loops, best of 3: 73.4 msec per loop I would have expected essentially constant time, as in re.match: [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.match('x', a)" 500000 loops, best of 5: 560 nsec per loop [steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.match('x', a)" 500000 loops, best of 5: 561 nsec per loop |
|||
msg385137 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2021-01-16 08:59 | |
^ matches not just the beginning of the string. It matches the beginning of a line, i.e. an anchor just after '\n'. If the input string contains '\n', the result cannot be found less than by linear time. If you want to check if the beginning of the string matches a regular expression, it is better to use match(). If you want the check if the whole string matches it, it is better to use fullmatch(). But in some cases you cannot choose what method to use. If you have a set of patterns, and only some of them should be anchored to the start of the string, you have to use search(). And while linear complexity for ^ is expected, search() is not optimized for \A. So the original report is rejected, the behavior is expected and cannot be changed. It is not a bug. But some optimization can be added for \A, and perhaps the constant multiplier for ^ can be reduced too. |
|||
msg385138 - (view) | Author: Steven D'Aprano (steven.daprano) * ![]() |
Date: 2021-01-16 10:53 | |
On Sat, Jan 16, 2021 at 08:59:13AM +0000, Serhiy Storchaka wrote: > > Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: > > ^ matches not just the beginning of the string. It matches the > beginning of a line, i.e. an anchor just after '\n'. Only in MULTILINE mode. I completely agree that in multiline mode '^a' has to search the entire string. But in single line mode, if the first character isn't an 'a', there is no need to continue searching. As far as I can see from the docs, ^ in single line mode and \A always should be constant time. > So the original report is rejected, the behavior is expected and > cannot be changed. It is not a bug. I disagree that it is expected behaviour. "Match the start of the string" cannot possibly match anything except the start of the string, so who would expect it to keep scanning past the start of the string? (Multiline mode is, of course, different.) |
|||
msg385142 - (view) | Author: Arnim Rupp (2d4d) | Date: 2021-01-16 15:44 | |
some more observations, which might be helpful in tracking it down: x$ is much faster than ^x A$ is as slow as ^x $ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('x$', a)" 10 loops, best of 5: 32.9 msec per loop $ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('A$', a)" 1 loop, best of 5: 802 msec per loop |
History | |||
---|---|---|---|
Date | User | Action | Args |
2021-01-16 21:07:45 | 2d4d | set | nosy:
+ ezio.melotti, mrabarnett components: + Regular Expressions, - Library (Lib) |
2021-01-16 15:44:43 | 2d4d | set | messages: + msg385142 |
2021-01-16 10:53:05 | steven.daprano | set | messages: + msg385138 |
2021-01-16 08:59:13 | serhiy.storchaka | set | title: Regex performance problem with ^ aka AT_BEGINNING -> Optimize re.search() for \A (and maybe ^) messages: + msg385137 versions: - Python 3.9 |
2021-01-15 22:55:42 | terry.reedy | set | nosy:
+ serhiy.storchaka |
2021-01-11 09:27:24 | steven.daprano | set | nosy:
+ steven.daprano messages: + msg384805 versions: + Python 3.9 |
2021-01-10 21:58:21 | 2d4d | create |