Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re: limit the maximum capturing group to 1,073,741,823, reduce sizeof(match_context). #91412

Closed
animalize mannequin opened this issue Apr 8, 2022 · 1 comment · Fixed by #32411
Closed

re: limit the maximum capturing group to 1,073,741,823, reduce sizeof(match_context). #91412

animalize mannequin opened this issue Apr 8, 2022 · 1 comment · Fixed by #32411
Labels
3.11 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir topic-regex

Comments

@animalize
Copy link
Mannequin

animalize mannequin commented Apr 8, 2022

BPO 47256
Nosy @ezio-melotti, @serhiy-storchaka, @animalize
PRs
  • bpo-47256: re module, limit the maximum capturing group to 1,073,741,823, increasing the depth of backtracking. #32411
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-04-08.06:54:08.832>
    labels = ['3.11', 'library', 'performance']
    title = 're: limit the maximum capturing group to 1,073,741,823, reduce sizeof(match_context).'
    updated_at = <Date 2022-04-08.07:02:49.829>
    user = 'https://github.com/animalize'

    bugs.python.org fields:

    activity = <Date 2022-04-08.07:02:49.829>
    actor = 'malin'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2022-04-08.06:54:08.832>
    creator = 'malin'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 47256
    keywords = ['patch']
    message_count = 1.0
    messages = ['416960']
    nosy_count = 4.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'malin']
    pr_nums = ['32411']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue47256'
    versions = ['Python 3.11']

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Apr 8, 2022

    These changes reduce sizeof(match_context):

    • 32-bit build: 36 bytes, no change.
    • 64-bit build: 72 bytes -> 56 bytes.

    sre uses stack and match_context struct to simulate recursive call, smaller struct brings:
    - deeper recursive call
    - less memory consume
    - less memory realloc

    Here is a test, if limit the stack size to 1 GiB, the max available value of n is:

    re.match(r'(ab)*', n * 'ab')   # need to save MARKs
    72 bytes: n = 11,184,808
    64 bytes: n = 12,201,609
    56 bytes: n = 13,421,770
    
    re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs
    72 bytes: n = 13,421,770
    64 bytes: n = 14,913,078
    56 bytes: n = 16,777,213
    

    1,073,741,823 capturing groups should enough for almost all users.
    If limit it to 16,383 (2-byte integer), the context size may reduce more. But maybe some patterns generated by program will have more than this number of capturing groups.

    1️⃣Performance:

    Before
    regex_dna: Mean +- std dev: 149 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.22 ms +- 0.02 ms
    regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
    my benchmark[1]: 13.9 sec +- 0.0 sec

    Commit 1. limit the maximum capture group to 1,073,741,823
    regex_dna: Mean +- std dev: 150 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
    regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
    my benchmark: 13.8 sec +- 0.0 sec

    Commit 2. further reduce sizeof(SRE(match_context))
    regex_dna: Mean +- std dev: 150 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
    regex_v8: Mean +- std dev: 22.2 ms +- 0.1 ms
    my benchmark: 13.8 sec +- 0.1 sec

    If further change the types of toplevel/jump from int to char, in 32-bit build sizeof(match_context) will be reduced from 36 to 32 (In 64-bit build still 56). But it's slower on 64-bit build, so I didn't adopt it:
    regex_dna: Mean +- std dev: 150 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.18 ms +- 0.01 ms
    regex_v8: Mean +- std dev: 22.4 ms +- 0.1 ms
    my benchmark: 14.1 sec +- 0.0 sec

    2️⃣ The type of match_context.count is Py_ssize_t
    - If change it to 4-byte integer, need to modify some engine code.
    - If keep it as Py_ssize_t, SRE_MAXREPEAT may >= 4 GiB in future versions.
    Currently SRE_MAXREPEAT can't >= 4 GiB.
    So the type of match_context.count is unchanged.

    [1] My re benchmark, it uses 16 patterns to process 100 MiB text data:
    https://github.com/animalize/re_benchmarks

    @animalize animalize mannequin added 3.11 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Apr 8, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir topic-regex
    Projects
    None yet
    2 participants