Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in Modules/sre_lib.h #67877

Open
abacabadabacaba mannequin opened this issue Mar 17, 2015 · 14 comments
Open

Memory leak in Modules/sre_lib.h #67877

abacabadabacaba mannequin opened this issue Mar 17, 2015 · 14 comments
Assignees
Labels
3.11 only security fixes topic-regex type-bug An unexpected behavior, bug, or error

Comments

@abacabadabacaba
Copy link
Mannequin

abacabadabacaba mannequin commented Mar 17, 2015

BPO 23689
Nosy @vstinner, @ezio-melotti, @serhiy-storchaka, @animalize
PRs
  • [WIP] bpo-23689: Memory leak in Modules/sre_lib.h #11926
  • bpo-23689: re module, allocate SRE_REPEAT in a memory pool #12160
  • [WIP] bpo-23689: re module, fix memory leak when a match is terminated by a signal #32188
  • bpo-23689: re module, fix memory leak when a match is terminated by a signal or memory allocation failure #32223
  • bpo-23689: re module, fix memory leak when a match is terminated by a signal or memory allocation failure #32283
  • Files
  • sre_clean_repeat_data.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2022-04-03.16:21:31.788>
    created_at = <Date 2015-03-17.16:56:20.153>
    labels = ['expert-regex', '3.11', 'performance']
    title = 'Memory leak in Modules/sre_lib.h'
    updated_at = <Date 2022-04-03.16:21:31.787>
    user = 'https://bugs.python.org/abacabadabacaba'

    bugs.python.org fields:

    activity = <Date 2022-04-03.16:21:31.787>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2022-04-03.16:21:31.788>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2015-03-17.16:56:20.153>
    creator = 'abacabadabacaba'
    dependencies = []
    files = ['38529']
    hgrepos = []
    issue_num = 23689
    keywords = ['patch']
    message_count = 13.0
    messages = ['238313', '238320', '238324', '238326', '238328', '238343', '238388', '335889', '337113', '416266', '416269', '416630', '416631']
    nosy_count = 7.0
    nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'abacabadabacaba', 'serhiy.storchaka', 'malin', 'alexei.romanov']
    pr_nums = ['11926', '12160', '32188', '32223', '32283']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue23689'
    versions = ['Python 3.11']

    @abacabadabacaba
    Copy link
    Mannequin Author

    abacabadabacaba mannequin commented Mar 17, 2015

    In Modules/sre_lib.h on line 882 [1], a block of memory is allocated. If SRE(match) function later terminates abruptly, either because of a signal or because subsequent memory allocation fails, this block is never released.

    [1] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l882

    @abacabadabacaba abacabadabacaba mannequin added topic-regex performance Performance or resource usage labels Mar 17, 2015
    @vstinner
    Copy link
    Member

    There is maybe a bug. Can you show an example of regex and a text where the memory leak occurs? You can use the tracemalloc module to check if there is a memory leak. Or use sys.getcounts() if you compiled Python in debug mode.

    sre_lib.h is very complex, it uses the C instruction "goto" with regex bytecodes.

    "DO_JUMP(JUMP_REPEAT, jump_repeat, ctx->pattern+ctx->pattern[0]);" calls "goto entrace" to execute following bytecodes, but later it comes back after DO_JUMP() with the "jump_repeat:" label:

    https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l1180

    @abacabadabacaba
    Copy link
    Mannequin Author

    abacabadabacaba mannequin commented Mar 17, 2015

    Memory leak only happens if match operation terminates abruptly, e.g. because of SIGINT. In this case, DO_JUMP doesn't come back.

    @abacabadabacaba
    Copy link
    Mannequin Author

    abacabadabacaba mannequin commented Mar 17, 2015

    Tracemalloc code:

        import re
        import signal
        import tracemalloc
    
        class AlarmError(Exception):
            pass
        def handle_alarm(signal, frame):
            raise AlarmError
        signal.signal(signal.SIGALRM, handle_alarm)
    
        s1 = tracemalloc.take_snapshot()
        for _ in range(20):
            try:
                signal.alarm(1)
                re.match('(?:a|a|(?=b)){1000}', 'a'*999)
                raise RuntimeError
            except AlarmError:
                pass
        s2 = tracemalloc.take_snapshot()
        res = s2.compare_to(s1, 'lineno')
        for e in res[:10]:
            print(e)

    For me, it shows almost 3 MiB allocated in re.py.

    @serhiy-storchaka
    Copy link
    Member

    May be this patch helps.

    @vstinner
    Copy link
    Member

    Oh cool, you wrote a script to reproduce the issue! And Serhiy wrote a patch, great! Great job guys.

    sre_clean_repeat_data.patch looks good to me.

    @serhiy: Can you try the example to ensure that it fixes the issue? If yes, go ahead!

    @abacabadabacaba
    Copy link
    Mannequin Author

    abacabadabacaba mannequin commented Mar 18, 2015

    This patch doesn't fix the issue. The problem is that the list starting with state->repeat doesn't necessarily contains all repeat contexts that are allocated. Indeed, here [1] and here [2] repeat contexts are temporarily removed from the list. If the match procedure terminates abruptly, they are not added back.

    [1] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l963
    [2] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l1002

    @serhiy-storchaka serhiy-storchaka self-assigned this May 16, 2015
    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Nov 16, 2017
    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Feb 19, 2019

    Try to allocate SRE_REPEAT on state's stack, the performance has not changed significantly.

    It passes the other tests, except this one (test_stack_overflow):
    https://github.com/python/cpython/blob/v3.8.0a1/Lib/test/test_re.py#L1225-L1230

    I'll try to fix bpo-35859, bpo-9134 first.

    @animalize animalize mannequin added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Feb 19, 2019
    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 4, 2019

    PR11926 (closed) tried to allocate SRE_REPEAT on state's stack.
    It's feasible, but messes up the code in sre_lib.h, and reduces performance a little (roughly 6% slower), so I gave up this solution.

    PR12160 uses a memory pool, this solution doesn't mess up the code.

    🔸For infrequent alloc/free scenes, it adds a small overhead:

    s = 'a'
    p = re.compile(r'(a)?')
    p.match(s)  # <- measure this statement

    before patch: 316 ns +- 19 ns
    after patch: 324 ns +- 11 ns, 2.5% slower.
    (by perf module)

    🔸For very frequent alloc/free scenes, it brings a speedup:

    s = 200_000_000 * 'a'
    p = re.compile(r'.*?(?:bb)+')
    p.match(s)  # <- measure this statement

    before patch: 7.16 sec
    after patch: 5.82 sec, 18.7% faster.
    (best of 10 tests)

    🔸I tested in a real case that use 17 patterns to process 100MB data:

    before patch: 27.09 sec
    after patch: 26.78 sec, 1.1% faster.
    (best of 4 tests)

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 29, 2022

    My PR methods are suboptimal, so I closed them.

    The number of REPEAT can be counted when compiling a pattern, and allocate a SRE_REPEAT array in SRE_STATE (with that number items).

    It seem at any time, a REPEAT will only have one in active, so a SRE_REPEAT array is fine.
    regex module does like this:
    https://github.com/mrabarnett/mrab-regex/blob/hg/regex_3/_regex.c#L18287-L18288

    Can the number of REPEAT be placed in SRE_OP_INFO?
    And add a field to SRE_OP_REPEAT to indicate the index of this REPEAT.

    @serhiy-storchaka
    Copy link
    Member

    This looks promising. Please, go ahead! You are free to add any fields to any opcodes. It may break some third-party code which generates compiled patterns from a sequence of opcodes, it the stability of this interface was not promised. And they will be broken in any case due to reorganizing of internal code (bpo-47152).

    @serhiy-storchaka
    Copy link
    Member

    New changeset 6e3eee5 by Ma Lin in branch 'main':
    bpo-23689: re module, fix memory leak when a match is terminated by a signal or memory allocation failure (GH-32283)
    6e3eee5

    @serhiy-storchaka
    Copy link
    Member

    Thank you Ma Lin for all your work.

    The fix changes interfaces of some internal functions which can be used in third-party code, and the bug occurs only in special circumstances, so it is not practical to backport it.

    @serhiy-storchaka serhiy-storchaka added 3.11 only security fixes and removed 3.8 only security fixes labels Apr 3, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @gpshead
    Copy link
    Member

    gpshead commented Jun 15, 2022

    PR #93882 reverts the commit that resolved this as it causes a major performance and usability regression: #91404.

    @gpshead gpshead reopened this Jun 15, 2022
    @iritkatriel iritkatriel added type-bug An unexpected behavior, bug, or error and removed performance Performance or resource usage labels Aug 17, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants