This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Memory leak in Modules/sre_lib.h
Type: resource usage Stage: resolved
Components: Regular Expressions Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: abacabadabacaba, alexei.romanov, ezio.melotti, malin, mrabarnett, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2015-03-17 16:56 by abacabadabacaba, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sre_clean_repeat_data.patch serhiy.storchaka, 2015-03-17 18:31 review
Pull Requests
URL Status Linked Edit
PR 11926 closed malin, 2019-02-19 06:13
PR 12160 closed malin, 2019-03-04 13:13
PR 32188 closed malin, 2022-03-30 08:11
PR 32223 closed malin, 2022-04-01 05:45
PR 32283 merged malin, 2022-04-03 08:23
Messages (13)
msg238313 - (view) Author: Evgeny Kapun (abacabadabacaba) Date: 2015-03-17 16:56
In Modules/sre_lib.h on line 882 [1], a block of memory is allocated. If SRE(match) function later terminates abruptly, either because of a signal or because subsequent memory allocation fails, this block is never released.

[1] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l882
msg238320 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-17 17:09
There is maybe a bug. Can you show an example of regex and a text where the memory leak occurs? You can use the tracemalloc module to check if there is a memory leak. Or use sys.getcounts() if you compiled Python in debug mode.

sre_lib.h is very complex, it uses the C instruction "goto" with regex bytecodes.

"DO_JUMP(JUMP_REPEAT, jump_repeat, ctx->pattern+ctx->pattern[0]);" calls "goto entrace" to execute following bytecodes, but later it comes back after DO_JUMP() with the "jump_repeat:" label:

https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l1180
msg238324 - (view) Author: Evgeny Kapun (abacabadabacaba) Date: 2015-03-17 17:22
Memory leak only happens if match operation terminates abruptly, e.g. because of SIGINT. In this case, DO_JUMP doesn't come back.
msg238326 - (view) Author: Evgeny Kapun (abacabadabacaba) Date: 2015-03-17 17:44
Tracemalloc code:

    import re
    import signal
    import tracemalloc

    class AlarmError(Exception):
        pass
    def handle_alarm(signal, frame):
        raise AlarmError
    signal.signal(signal.SIGALRM, handle_alarm)

    s1 = tracemalloc.take_snapshot()
    for _ in range(20):
        try:
            signal.alarm(1)
            re.match('(?:a|a|(?=b)){1000}', 'a'*999)
            raise RuntimeError
        except AlarmError:
            pass
    s2 = tracemalloc.take_snapshot()
    res = s2.compare_to(s1, 'lineno')
    for e in res[:10]:
        print(e)

For me, it shows almost 3 MiB allocated in re.py.
msg238328 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-17 18:31
May be this patch helps.
msg238343 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-17 21:35
Oh cool, you wrote a script to reproduce the issue! And Serhiy wrote a patch, great! Great job guys.

sre_clean_repeat_data.patch looks good to me.

@Serhiy: Can you try the example to ensure that it fixes the issue? If yes, go ahead!
msg238388 - (view) Author: Evgeny Kapun (abacabadabacaba) Date: 2015-03-18 09:02
This patch doesn't fix the issue. The problem is that the list starting with state->repeat doesn't necessarily contains all repeat contexts that are allocated. Indeed, here [1] and here [2] repeat contexts are temporarily removed from the list. If the match procedure terminates abruptly, they are not added back.

[1] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l963
[2] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l1002
msg335889 - (view) Author: Ma Lin (malin) * Date: 2019-02-19 06:13
Try to allocate SRE_REPEAT on state's stack, the performance has not changed significantly.

It passes the other tests, except this one (test_stack_overflow):
https://github.com/python/cpython/blob/v3.8.0a1/Lib/test/test_re.py#L1225-L1230

I'll try to fix issue35859, issue9134 first.
msg337113 - (view) Author: Ma Lin (malin) * Date: 2019-03-04 13:15
PR11926 (closed) tried to allocate SRE_REPEAT on state's stack.
It's feasible, but messes up the code in sre_lib.h, and reduces performance a little (roughly 6% slower), so I gave up this solution.

PR12160 uses a memory pool, this solution doesn't mess up the code.

🔸For infrequent alloc/free scenes, it adds a small overhead:

s = 'a'
p = re.compile(r'(a)?')
p.match(s)  # <- measure this statement

before patch: 316 ns  +- 19 ns
after patch:  324 ns  +- 11 ns, 2.5% slower.
(by perf module)

🔸For very frequent alloc/free scenes, it brings a speedup:

s = 200_000_000 * 'a'
p = re.compile(r'.*?(?:bb)+')
p.match(s)  # <- measure this statement

before patch: 7.16 sec
after patch:  5.82 sec, 18.7% faster.
(best of 10 tests)

🔸I tested in a real case that use 17 patterns to process 100MB data:

before patch: 27.09 sec
after patch:  26.78 sec, 1.1% faster.
(best of 4 tests)
msg416266 - (view) Author: Ma Lin (malin) * Date: 2022-03-29 15:13
My PR methods are suboptimal, so I closed them.

The number of REPEAT can be counted when compiling a pattern, and allocate a `SRE_REPEAT` array in `SRE_STATE` (with that number items).

It seem at any time, a REPEAT will only have one in active, so a `SRE_REPEAT` array is fine.
regex module does like this:
https://github.com/mrabarnett/mrab-regex/blob/hg/regex_3/_regex.c#L18287-L18288

Can the number of REPEAT be placed in `SRE_OP_INFO`?
And add a field to `SRE_OP_REPEAT` to indicate the index of this REPEAT.
msg416269 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-29 15:55
This looks promising. Please, go ahead! You are free to add any fields to any opcodes. It may break some third-party code which generates compiled patterns from a sequence of opcodes, it the stability of this interface was not promised. And they will be broken in any case due to reorganizing of internal code (issue47152).
msg416630 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-03 16:16
New changeset 6e3eee5c11b539e9aab39cff783acf57838c355a by Ma Lin in branch 'main':
bpo-23689: re module, fix memory leak when a match is terminated by a signal or memory allocation failure (GH-32283)
https://github.com/python/cpython/commit/6e3eee5c11b539e9aab39cff783acf57838c355a
msg416631 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-03 16:21
Thank you Ma Lin for all your work.

The fix changes interfaces of some internal functions which can be used in third-party code, and the bug occurs only in special circumstances, so it is not practical to backport it.
History
Date User Action Args
2022-04-11 14:58:14adminsetgithub: 67877
2022-04-03 16:21:31serhiy.storchakasetstatus: open -> closed
versions: + Python 3.11, - Python 3.8
messages: + msg416631

resolution: fixed
stage: patch review -> resolved
2022-04-03 16:16:41serhiy.storchakasetmessages: + msg416630
2022-04-03 08:23:30malinsetpull_requests: + pull_request30344
2022-04-01 05:45:58malinsetpull_requests: + pull_request30298
2022-03-30 08:11:04malinsetpull_requests: + pull_request30265
2022-03-29 15:55:22serhiy.storchakasetmessages: + msg416269
2022-03-29 15:13:16malinsetmessages: + msg416266
2019-03-04 13:15:34malinsetmessages: + msg337113
2019-03-04 13:13:50malinsetpull_requests: + pull_request12158
2019-02-19 06:13:46malinsetpull_requests: + pull_request11951
2019-02-19 06:13:32malinsetmessages: + msg335889
versions: + Python 3.8, - Python 2.7, Python 3.6, Python 3.7
2019-02-18 15:01:06serhiy.storchakasetnosy: + malin
2017-11-16 13:26:58serhiy.storchakasetversions: + Python 3.6, Python 3.7, - Python 3.4, Python 3.5
2015-05-16 17:16:06serhiy.storchakasetassignee: serhiy.storchaka
2015-03-18 12:06:57alexei.romanovsetnosy: + alexei.romanov
2015-03-18 09:02:36abacabadabacabasetmessages: + msg238388
2015-03-17 21:35:19vstinnersetmessages: + msg238343
2015-03-17 18:31:58serhiy.storchakasetfiles: + sre_clean_repeat_data.patch
versions: + Python 2.7, Python 3.5
messages: + msg238328

keywords: + patch
stage: patch review
2015-03-17 18:00:24serhiy.storchakasetnosy: + serhiy.storchaka
2015-03-17 17:44:24abacabadabacabasetmessages: + msg238326
2015-03-17 17:22:18abacabadabacabasetmessages: + msg238324
2015-03-17 17:09:24vstinnersetnosy: + vstinner
messages: + msg238320
2015-03-17 16:56:20abacabadabacabacreate