Issue 47256: re: limit the maximum capturing group to 1,073,741,823, reduce sizeof(match_context).

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/91412

classification

Title:	re: limit the maximum capturing group to 1,073,741,823, reduce sizeof(match_context).
Type:	resource usage	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, malin, mrabarnett, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2022-04-08 06:54 by malin, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 32411	open	malin, 2022-04-08 07:02

Messages (1)
msg416960 - (view)	Author: Ma Lin (malin) *	Date: 2022-04-08 06:54
These changes reduce sizeof(match_context): - 32-bit build: 36 bytes, no change. - 64-bit build: 72 bytes -> 56 bytes. sre uses stack and `match_context` struct to simulate recursive call, smaller struct brings: - deeper recursive call - less memory consume - less memory realloc Here is a test, if limit the stack size to 1 GiB, the max available value of n is: re.match(r'(ab)', n 'ab') # need to save MARKs 72 bytes: n = 11,184,808 64 bytes: n = 12,201,609 56 bytes: n = 13,421,770 re.match(r'(?:ab)', n 'ab') # no need to save MARKs 72 bytes: n = 13,421,770 64 bytes: n = 14,913,078 56 bytes: n = 16,777,213 1,073,741,823 capturing groups should enough for almost all users. If limit it to 16,383 (2-byte integer), the context size may reduce more. But maybe some patterns generated by program will have more than this number of capturing groups. 1️⃣Performance: Before regex_dna: Mean +- std dev: 149 ms +- 1 ms regex_effbot: Mean +- std dev: 2.22 ms +- 0.02 ms regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms my benchmark[1]: 13.9 sec +- 0.0 sec Commit 1. limit the maximum capture group to 1,073,741,823 regex_dna: Mean +- std dev: 150 ms +- 1 ms regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms my benchmark: 13.8 sec +- 0.0 sec Commit 2. further reduce sizeof(SRE(match_context)) regex_dna: Mean +- std dev: 150 ms +- 1 ms regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms regex_v8: Mean +- std dev: 22.2 ms +- 0.1 ms my benchmark: 13.8 sec +- 0.1 sec If further change the types of toplevel/jump from int to char, in 32-bit build sizeof(match_context) will be reduced from 36 to 32 (In 64-bit build still 56). But it's slower on 64-bit build, so I didn't adopt it: regex_dna: Mean +- std dev: 150 ms +- 1 ms regex_effbot: Mean +- std dev: 2.18 ms +- 0.01 ms regex_v8: Mean +- std dev: 22.4 ms +- 0.1 ms my benchmark: 14.1 sec +- 0.0 sec 2️⃣ The type of match_context.count is Py_ssize_t - If change it to 4-byte integer, need to modify some engine code. - If keep it as Py_ssize_t, SRE_MAXREPEAT may >= 4 GiB in future versions. Currently SRE_MAXREPEAT can't >= 4 GiB. So the type of match_context.count is unchanged. [1] My re benchmark, it uses 16 patterns to process 100 MiB text data: https://github.com/animalize/re_benchmarks

msg416960 - (view)

Author: Ma Lin (malin) *

Date: 2022-04-08 06:54

These changes reduce sizeof(match_context):
- 32-bit build: 36 bytes, no change.
- 64-bit build: 72 bytes -> 56 bytes.

sre uses stack and `match_context` struct to simulate recursive call, smaller struct brings:
    - deeper recursive call
    - less memory consume
    - less memory realloc

Here is a test, if limit the stack size to 1 GiB, the max available value of n is:

    re.match(r'(ab)*', n * 'ab')   # need to save MARKs
    72 bytes: n = 11,184,808
    64 bytes: n = 12,201,609
    56 bytes: n = 13,421,770

    re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs
    72 bytes: n = 13,421,770
    64 bytes: n = 14,913,078
    56 bytes: n = 16,777,213
    
1,073,741,823 capturing groups should enough for almost all users.
If limit it to 16,383 (2-byte integer), the context size may reduce more. But maybe some patterns generated by program will have more than this number of capturing groups.

1️⃣Performance:

Before
    regex_dna: Mean +- std dev: 149 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.22 ms +- 0.02 ms
    regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
    my benchmark[1]: 13.9 sec +- 0.0 sec

Commit 1. limit the maximum capture group to 1,073,741,823
    regex_dna: Mean +- std dev: 150 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
    regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
    my benchmark: 13.8 sec +- 0.0 sec

Commit 2. further reduce sizeof(SRE(match_context))
    regex_dna: Mean +- std dev: 150 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
    regex_v8: Mean +- std dev: 22.2 ms +- 0.1 ms
    my benchmark: 13.8 sec +- 0.1 sec

If further change the types of toplevel/jump from int to char, in 32-bit build sizeof(match_context) will be reduced from 36 to 32 (In 64-bit build still 56). But it's slower on 64-bit build, so I didn't adopt it:
    regex_dna: Mean +- std dev: 150 ms +- 1 ms
    regex_effbot: Mean +- std dev: 2.18 ms +- 0.01 ms
    regex_v8: Mean +- std dev: 22.4 ms +- 0.1 ms
    my benchmark: 14.1 sec +- 0.0 sec

2️⃣ The type of match_context.count is Py_ssize_t
    - If change it to 4-byte integer, need to modify some engine code.
    - If keep it as Py_ssize_t, SRE_MAXREPEAT may >= 4 GiB in future versions.  
      Currently SRE_MAXREPEAT can't >= 4 GiB.
So the type of match_context.count is unchanged.

[1] My re benchmark, it uses 16 patterns to process 100 MiB text data:
https://github.com/animalize/re_benchmarks

History
Date	User	Action	Args
2022-04-11 14:59:58	admin	set	github: 91412
2022-04-08 07:02:49	malin	set	keywords: + patch stage: patch review pull_requests: + pull_request30437
2022-04-08 06:54:08	malin	create