Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize re.search() for \A (and maybe ^) #87051

Closed
2d4d mannequin opened this issue Jan 10, 2021 · 6 comments
Closed

Optimize re.search() for \A (and maybe ^) #87051

2d4d mannequin opened this issue Jan 10, 2021 · 6 comments
Labels
3.10 only security fixes performance Performance or resource usage topic-regex

Comments

@2d4d
Copy link
Mannequin

2d4d mannequin commented Jan 10, 2021

BPO 42885
Nosy @ezio-melotti, @stevendaprano, @serhiy-storchaka, @2d4d
PRs
  • bpo-42885: Optimize search for regular expressions starting with "\A" or "^" #32021
  • Files
  • regex_timeit.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-03-22.15:28:31.421>
    created_at = <Date 2021-01-10.21:58:21.401>
    labels = ['expert-regex', '3.10', 'performance']
    title = 'Optimize re.search() for \\A (and maybe ^)'
    updated_at = <Date 2022-03-22.15:28:31.421>
    user = 'https://github.com/2d4d'

    bugs.python.org fields:

    activity = <Date 2022-03-22.15:28:31.421>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-03-22.15:28:31.421>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2021-01-10.21:58:21.401>
    creator = '2d4d'
    dependencies = []
    files = ['49733']
    hgrepos = []
    issue_num = 42885
    keywords = ['patch']
    message_count = 6.0
    messages = ['384782', '384805', '385137', '385138', '385142', '415789']
    nosy_count = 5.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'steven.daprano', 'serhiy.storchaka', '2d4d']
    pr_nums = ['32021']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue42885'
    versions = ['Python 3.10']

    @2d4d
    Copy link
    Mannequin Author

    2d4d mannequin commented Jan 10, 2021

    The re lib needs 7 seconds to check if a billion As start with an x. So e.g. this statement takes this long:

    re.search(r'^x', 'A' * 1000000000)

    It takes longer, the longer the string is. The string handling is not the problem, checking if it starts which an A takes just 0.00014 seconds. See output and code below:

    3.10.0a4+ (heads/master:d16f617, Jan 9 2021, 13:24:45)
    [GCC 7.5.0]
    testing string len: 100000
    re_test_false: 0.0008246829966083169
    testing string len: 1000000000
    re_test_false: 7.317708015005337
    testing string len: 1000000000
    re_test_true: 0.00014710200048284605

    import re, timeit, functools, sys
    
    def re_test_true(string):
        print("testing string len: ", len(string))
        re.search(r'^A', string)
    
    def re_test_false(string):
        print("testing string len: ", len(string))
        re.search(r'^x', string)
    
    print(sys.version)
    
    huge_string = 'A' * 100000
    print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1))
    
    huge_string = 'A' * 1000000000
    print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, huge_string), number=1))

    print('re_test_true: ', timeit.timeit(functools.partial(re_test_true, huge_string), number=1))

    @2d4d 2d4d mannequin added stdlib Python modules in the Lib dir 3.10 only security fixes performance Performance or resource usage labels Jan 10, 2021
    @stevendaprano
    Copy link
    Member

    I'm getting similar results in Python 3.9.

    [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)"
    5000 loops, best of 5: 67.3 usec per loop
    [steve ~]$ python3.9 -m timeit -s "a = 'A'*100000" -s "import re" "re.search('^x', a)"
    500 loops, best of 5: 639 usec per loop
    [steve ~]$ python3.9 -m timeit -s "a = 'A'*1000000" -s "import re" "re.search('^x', a)"
    50 loops, best of 5: 6.27 msec per loop
    [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)"
    5 loops, best of 5: 62.8 msec per loop
    [steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.search('^x', a)"
    1 loop, best of 5: 654 msec per loop

    It looks like the time is roughly linear in the length of the string.

    I get the same result as far back as Python 2.7 (I haven't tried older versions).

    [steve ~]$ python2.7 -m timeit -s "a = 'A'*10000" -s "import re" "re.search('^x', a)"
    10000 loops, best of 3: 75.7 usec per loop
    [steve ~]$ python2.7 -m timeit -s "a = 'A'*10000000" -s "import re" "re.search('^x', a)"
    10 loops, best of 3: 73.4 msec per loop

    I would have expected essentially constant time, as in re.match:

    [steve ~]$ python3.9 -m timeit -s "a = 'A'*10000" -s "import re" "re.match('x', a)"
    500000 loops, best of 5: 560 nsec per loop
    [steve ~]$ python3.9 -m timeit -s "a = 'A'*100000000" -s "import re" "re.match('x', a)"
    500000 loops, best of 5: 561 nsec per loop

    @stevendaprano stevendaprano added 3.9 only security fixes labels Jan 11, 2021
    @serhiy-storchaka
    Copy link
    Member

    ^ matches not just the beginning of the string. It matches the beginning of a line, i.e. an anchor just after '\n'. If the input string contains '\n', the result cannot be found less than by linear time. If you want to check if the beginning of the string matches a regular expression, it is better to use match(). If you want the check if the whole string matches it, it is better to use fullmatch().

    But in some cases you cannot choose what method to use. If you have a set of patterns, and only some of them should be anchored to the start of the string, you have to use search(). And while linear complexity for ^ is expected, search() is not optimized for \A.

    So the original report is rejected, the behavior is expected and cannot be changed. It is not a bug. But some optimization can be added for \A, and perhaps the constant multiplier for ^ can be reduced too.

    @serhiy-storchaka serhiy-storchaka removed the 3.9 only security fixes label Jan 16, 2021
    @serhiy-storchaka serhiy-storchaka changed the title Regex performance problem with ^ aka AT_BEGINNING Optimize re.search() for \A (and maybe ^) Jan 16, 2021
    @serhiy-storchaka serhiy-storchaka removed the 3.9 only security fixes label Jan 16, 2021
    @serhiy-storchaka serhiy-storchaka changed the title Regex performance problem with ^ aka AT_BEGINNING Optimize re.search() for \A (and maybe ^) Jan 16, 2021
    @stevendaprano
    Copy link
    Member

    On Sat, Jan 16, 2021 at 08:59:13AM +0000, Serhiy Storchaka wrote:

    Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:

    ^ matches not just the beginning of the string. It matches the
    beginning of a line, i.e. an anchor just after '\n'.

    Only in MULTILINE mode.

    I completely agree that in multiline mode '^a' has to search the entire
    string. But in single line mode, if the first character isn't an 'a',
    there is no need to continue searching.

    As far as I can see from the docs, ^ in single line mode and \A always
    should be constant time.

    So the original report is rejected, the behavior is expected and
    cannot be changed. It is not a bug.

    I disagree that it is expected behaviour. "Match the start of the
    string" cannot possibly match anything except the start of the string,
    so who would expect it to keep scanning past the start of the string?

    (Multiline mode is, of course, different.)

    @2d4d
    Copy link
    Mannequin Author

    2d4d mannequin commented Jan 16, 2021

    some more observations, which might be helpful in tracking it down:

    x$ is much faster than ^x
    A$ is as slow as ^x

    $ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('x$', a)"
    10 loops, best of 5: 32.9 msec per loop
    
    $ python3 -m timeit -s "a = 'A'*100000000" "import re" "re.search('A$', a)"
    1 loop, best of 5: 802 msec per loop

    @2d4d 2d4d mannequin added topic-regex and removed stdlib Python modules in the Lib dir labels Jan 16, 2021
    @serhiy-storchaka
    Copy link
    Member

    New changeset 492d410 by Serhiy Storchaka in branch 'main':
    bpo-42885: Optimize search for regular expressions starting with "\A" or "^" (GH-32021)
    492d410

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes performance Performance or resource usage topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants