This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Capture behavior depends on the order of an alternation
Type: behavior Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: davisjam, effbot, ezio.melotti, malin, mrabarnett, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch, patch, patch

Created on 2019-01-30 14:16 by davisjam, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
confusing-regex-behavior.py davisjam, 2019-01-30 14:16 Demonstrate confusing regex behavior
test_scripts_for_7_engines.zip malin, 2019-03-01 13:31 The comment in these scripts described how to use them
t.py malin, 2019-03-12 13:56 some micro benchmarks
Pull Requests
URL Status Linked Edit
PR 11756 closed malin, 2019-02-04 15:55
PR 12134 closed malin, 2019-03-02 00:20
PR 12288 closed malin, 2019-03-12 13:49
PR 12289 closed malin, 2019-03-12 13:50
PR 12290 closed malin, 2019-03-12 13:51
PR 12427 merged malin, 2019-03-19 06:40
Messages (19)
msg334560 - (view) Author: James Davis (davisjam) * Date: 2019-01-30 14:16
I have two regexes: /(a|ab)*?b/ and /(ab|a)*?b/.
If I re.search the string "ab" for these regexes, I get inconsistent behavior.
Specifically, /(a|ab)*?b/ matches with capture "a", while /(ab|a)*?b/ matches with an empty capture group.

I am not actually sure which behavior is correct.


Interpretation 1: The (ab|a) clause matches the a, satisfying the (ab|a)*? once, and the engine proceeds to the b and completes. The capture group ends up containing "a".

Interpretation 2: The (ab|a) clause matches the a. Since the clause is marked with *, the engine repeats the attempt and finds nothing the second time. It proceeds to the b and completes. Because the second match attempt on (ab|a) found nothing, the capture group ends up empty.

The behavior depends on both the order of (ab|a) vs. (a|ab), and the use of the non-greedy quantifier.

I cannot see why changing the order of the alternation should have this effect.

The change in behavior occurs in the built-in "re" module but not in the competing "regex" module.
The behavior is consistent in both Python 2.7 and Python 3.5. I have not tested other versions.

I have included the confusing-regex-behavior.py file for troubleshooting.

Below is the behavior for matches on these and many variants.
I find the following lines the most striking:

Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab                ('',)
(ab|a)+?b                            True                   ab                ('',)
(ab|a){0,}?b                         True                   ab                ('',)
(ab|a){0,2}?b                        True                   ab                ('',)
(ab|a){0,1}?b                        True                   ab               ('a',)
(ab|a)*b                             True                   ab               ('a',)
(ab|a)+b                             True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)+?b                            True                   ab               ('a',)


(08:58:48) jamie@woody ~ $ python3 /tmp/confusing-regex-behavior.py 


Behavior from re


Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab                ('',)
(ab|a)+?b                            True                   ab                ('',)
(ab|a){0,}?b                         True                   ab                ('',)
(ab|a){0,2}?b                        True                   ab                ('',)
(ab|a)?b                             True                   ab               ('a',)
(ab|a)??b                            True                   ab               ('a',)
(ab|a)b                              True                   ab               ('a',)
(ab|a){0,1}?b                        True                   ab               ('a',)
(ab|a)*b                             True                   ab               ('a',)
(ab|a)+b                             True                   ab               ('a',)
(a|ab)*b                             True                   ab               ('a',)
(a|ab)+b                             True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)+?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(bb|a)*?b                            True                   ab               ('a',)
((?:ab|a)*?)b                        True                   ab               ('a',)
((?:a|ab)*?)b                        True                   ab               ('a',)


Behavior from regex


Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab               ('a',)
(ab|a)+?b                            True                   ab               ('a',)
(ab|a){0,}?b                         True                   ab               ('a',)
(ab|a){0,2}?b                        True                   ab               ('a',)
(ab|a)?b                             True                   ab               ('a',)
(ab|a)??b                            True                   ab               ('a',)
(ab|a)b                              True                   ab               ('a',)
(ab|a){0,1}?b                        True                   ab               ('a',)
(ab|a)*b                             True                   ab               ('a',)
(ab|a)+b                             True                   ab               ('a',)
(a|ab)*b                             True                   ab               ('a',)
(a|ab)+b                             True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)+?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(bb|a)*?b                            True                   ab               ('a',)
((?:ab|a)*?)b                        True                   ab               ('a',)
((?:a|ab)*?)b                        True                   ab               ('a',)
msg334573 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-01-30 16:59
> I cannot see why changing the order of the alternation should have this effect.

The first regex, r'(a|ab)*?b', looks for the first alternative group by matching left-to-right [1] stopping at the first matching alternation "a".  Roughly, the regex simplifies to r'(a)*?b' giving 'a' in the captured group.

The second regex, r'(ab|a)*?b', looks for the first  alternative group by matching left-to-right [1] stopping at the first matching alternation "ab".  Roughly, the regex simplifies to r'(ab)*?b' giving '' in the captured group.

From there, I'm not clear on how a non-greedy kleene-star works with capturing groups and with the overall span().  A starting point would be to look at the re.DEBUG output for each pattern [2][3].

[1] From the re docs for the alternation operator:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

[2] re.DEBUG output for r'(a|ab)*?b'
 0. INFO 4 0b0 1 MAXREPEAT (to 5)
 5: REPEAT 19 0 MAXREPEAT (to 25)
 9.   MARK 0
11.   LITERAL 0x61 ('a')
13.   BRANCH 3 (to 17)
15.     JUMP 7 (to 23)
17:   branch 5 (to 22)
18.     LITERAL 0x62 ('b')
20.     JUMP 2 (to 23)
22:   FAILURE
23:   MARK 1
25: MIN_UNTIL
26. LITERAL 0x62 ('b')
28. SUCCESS

[3] re.DEBUG output for r'(ab|a)*?b'
MIN_REPEAT 0 MAXREPEAT
  SUBPATTERN 1 0 0
    LITERAL 97
    BRANCH
      LITERAL 98
    OR
LITERAL 98

 0. INFO 4 0b0 1 MAXREPEAT (to 5)
 5: REPEAT 19 0 MAXREPEAT (to 25)
 9.   MARK 0
11.   LITERAL 0x61 ('a')
13.   BRANCH 5 (to 19)
15.     LITERAL 0x62 ('b')
17.     JUMP 5 (to 23)
19:   branch 3 (to 22)
20.     JUMP 2 (to 23)
22:   FAILURE
23:   MARK 1
25: MIN_UNTIL
26. LITERAL 0x62 ('b')
28. SUCCESS
msg334586 - (view) Author: James Davis (davisjam) * Date: 2019-01-30 18:18
Thanks for your thoughts, Raymond. I understand that the alternation has "short-circuit" behavior, but I still find it confusing in this case.

Consider these two:


Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab                ('',)
(ab|a)+?b                            True                   ab                ('',)

In order to satisfy the first "(ab|a)+?" clause the regex engine has to find at least one match for (ab|a), and still match the final "b" clause of the pattern.

In this case, although "(ab|a)" will match "ab", doing so would cause the overall pattern to mismatch. So it seems like in order to obtain the match (which it does, see the second column), the regex engine must proceed past the first "ab" into the "a" part of the OR. But then I would expect the capture group to contain "a" and it does not.

For what it's worth, I also tried the match /(ab|a)*?b/ in PHP, Perl, Java, Ruby, Go, Rust and Node.js. The other 7 regex engines all matched "ab" and captured "a". Only Python's re module matches with an empty capture -- and even here it disagrees with the Python "regex" module as I noted in my initial post.
msg334592 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-01-30 19:49
It looks like a bug in re to me.
msg334600 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-01-30 22:44
I'm not at all clear on how these features should interact (alternation, non-greedy matching, and group capture).  Was just pointing that the first-match behavior of alternation is the documented behavior and that re.DEBUG can be used to explore what the regex is actually doing.   Whether this is correct, intended, desirable or consistent with other tools is a completely different question.  Even if not, there is a question about whether the behavior should be changed or just documented rather than risk breaking an untold number of regexes in already tested and deployed code.
msg334601 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-01-30 23:11
It matches, and the span is (0, 2).

The only way that it can match like that is for the capture group to match the 'a', and the final 'b' to match the 'b'.

Therefore, re.search(r'(ab|a)*b', 'ab').groups() should be ('a', ), as it is for the pattern with a greedy repeat.
msg334603 - (view) Author: Ma Lin (malin) * Date: 2019-01-31 03:28
You can `#define VERBOSE` in file `_sre.c`, it will print the engine's actual actions:

|02FAC684|02FC7402|MARK 0
...
|02FAC6BC|02FC7401|MARK 1

In my computer, 02FC7400 points to "ab", 02FC7401 points 'b' in "ab", 02FC7402 points to the end of "ab".

This capture group, begin at 02FC7402, end at 02FC7401. `begin > end` makes it return an empty string.

This seems a bug, the begin should at 02FC7400.
msg335131 - (view) Author: Ma Lin (malin) * Date: 2019-02-09 11:39
For a capture group, state->mark[] array stores it's begin and end:
begin: state->mark[(group_number-1)*2]
end:   state->mark[(group_number-1)*2+1]

So state->mark[0] is the begin of the first capture group.
state->mark[1] is the end of the first capture group.

re.search(r'(ab|a)*?b', 'ab')
In this case, here is a simplified actions record:

01  MARK 0
02  "a":  first "a" in the pattern [SUCCESS]
03  BRANCH
04    "b": first "b" in the pattern [SUCCESS]
05    MARK 1
06    "b": second "b" in the pattern [FAIL]
07    try next (ab|a)*? [FAIL]
08      MARK 0
09      "a":  first "a" in the pattern [FAIL]
10  BRANCH: try next branch
11    "": the second branch [SUCCESS]
12    MARK 1
13    "b" [SUCCESS]: second "b" in the pattern

MARK_PUSH(lastmark) macro didn't protect MARK-0 if it was the only available mark, while the BRANCH op uses this macro to protect capture groups before trying a branch.

So capture group 1 is [MARK-0 at Line-08, MARK-1 at line-12), this is wrong. 
The correct capture group 1 should be [MARK-0 at Line-01, MARK-1 at line-12).
msg336293 - (view) Author: Ma Lin (malin) * Date: 2019-02-22 11:04
A bug harvest, see PR11756, maybe sre has more bugs.
Those bug exist since Python 2.

Any ideas from regular expression experts?
msg336917 - (view) Author: Ma Lin (malin) * Date: 2019-03-01 13:31
The PR11756 is prepared.
I force-pushed the patch in four steps, hope you can review it easier:
https://github.com/python/cpython/pull/11756/commits

🔴 Step 1, test-cases

Show the wrong behaviors before this fix, the corresponding test-case will be updated in next steps.

🔴 Step 2, MARK_PUSH(lastmark) macro bug

This bug was reported in this issue.

MARK_PUSH(lastmark) macro didn't protect MARK-0 if it was the only available mark.

Note that if skip this step and apply Step 3, it happens to pass the test. But this is indeed a bug that should be fixed.

🔴 Step 3, jump JUMP_MIN_UNTIL_3 needs LASTMARK_SAVE() and MARK_PUSH()

This bug was triggered by a pattern which provided by Serhiy Storchaka:
>>> re.match(r'(ab?)*?b', 'ab').group(1)
''
Expected result: 'a'

This is due to the lack of protection before JUMP.
sre has these JUMPs, the current protection is written in [], the purpose of this JUMP is written in ().

in op SRE_OP_MAX_UNTIL 🔹 (...)*
JUMP_MAX_UNTIL_1 [no protect] (try a repeat, to min limit of repeat)
JUMP_MAX_UNTIL_2 [LASTMARK_SAVE, MARK_PUSH] (try a repeat, until max limit of repeat)
JUMP_MAX_UNTIL_3 [no protect] (try the tail of this repeat)

in op SRE_OP_MIN_UNTIL 🔹 (...)*?
JUMP_MIN_UNTIL_1 [no protect] (try a repeat, to min limit of repeat)
JUMP_MIN_UNTIL_2 [LASTMARK_SAVE] (try the tail of this repeat)
JUMP_MIN_UNTIL_3 [no protect] (try a repeat, until max limit of repeat)

in op SRE_OP_REPEAT_ONE 🔹 .*
JUMP_REPEAT_ONE_1 [LASTMARK_SAVE] (try the tail of this repeat)
JUMP_REPEAT_ONE_2 [LASTMARK_SAVE] (try the tail of this repeat)

in op SRE_OP_MIN_REPEAT_ONE 🔹 .*?
JUMP_MIN_REPEAT_ONE [LASTMARK_SAVE] (try the tail of this repeat) 

in op SRE_OP_BRANCH 🔹 ...|...
JUMP_BRANCH [LASTMARK_SAVE in any case, MARK_PUSH if in a repeat] (try a branch)

in op SRE_OP_ASSERT 🔹 (?=...)
JUMP_ASSERT [no protect] (try to match a sub-pattern)

in op SRE_OP_ASSERT_NOT 🔹 (?!=...)
JUMP_ASSERT_NOT [no protect] (try to match a sub-pattern)

These protections have not been changed since commit caf1c9dfe779 (2003-4-27).
Why it came out like this? There is a note in the message of commit be733ee7fb7e (2003-4-20):

> Gustavo Niemeyer wrote:
> As a note. It seems that there are other places that require the
> "protection" of LASTMARK_SAVE()/LASTMARK_RESTORE(), and are just
> waiting for someone to find how to break them. Particularly, I
> believe that every recursion of SRE_MATCH() should be protected
> by these macros. I won't do that right now since I'm not completely
> sure about this, and we don't have much time for testing until
> the next release.

Now we found some test-cases to break them, it seems JUMP_MIN_UNTIL_3 should be protected.

I tried to summarize a rule of protection:

if is_backtrack_point: # may backtrack if the tail fail
    LASTMARK_SAVE()
elif is_repeat_body:   # is a sub-pattern inside (...)* or (...)*?
    LASTMARK_SAVE()
    MARK_PUSH()

I did some experiments, it seems this rule works.
Since JUMP_MIN_UNTIL_3 is a repeat body, it needs protection, same as JUMP_MAX_UNTIL_2.

With this rule, JUMP_MAX_UNTIL_3 should LASTMARK_SAVE() as well, but this is not needed. sre uses stack to simulate recursive call, in fact JUMP_MAX_UNTIL_3 is protected by JUMP_MAX_UNTIL_2 from the previous SRE_OP_MAX_UNTIL execution.

🔴 Step 4, JUMP_ASSERT_NOT needs LASTMARK_SAVE()

This bug is about negative assertion, initially reported in issue725149, but they didn't fix it for some reasons:

> Gustavo Niemeyer wrote:
> it changes the current behavior, which is also compatible to how
> perl works.
> ...
> This way, we can think further about this, and look for an elegant
> solution to fix that support, certainly including some algorithm
> to check for half-marked groups.

IMO we can fix it, here is two test-cases:

Case-A, negative assertion:
re.match(r'(?!(..)c)', 'ab').group(1)

Case-B, negative assertion in a repeat:
re.match(r'(?:(?!(ab)c).)*', 'ab').group(1)

                    Case-A         Case-B
PHP 7.3.2           NULL           NULL
Java 11.0.2         null           null
Perl 5.26.1         "ab"           "ab"
Ruby 2.6.1          nil            nil
Go 1.12             doesn't support lookaround
Rust 1.32.0         doesn't support lookaround
Node.js 10.15.1     undefined      undefined
regex 2019.2.21     None           None
re before patch     "ab"           "b"
re after patch      None           None

In Case-A, sre looks compatible with Perl.
But in Case-B, sre is definitely wrong, so we can follow the mainstream behavior.

🔴 Interesting sidelights 1

Found a Perl bug:

perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" if "ab" =~ /((ab?)*?b)/'
"ab",undef

Expected result: "ab","a"

All other engines (except unpatched sre) return the correct result.

🔴 Interesting sidelights 2

>>> re.match(r'(((a)|b)*)', 'ab').groups()
('ab', 'b', 'a')

Maybe the correct result is ('ab', 'b', None), only Node.js 10.15.1 returns this.
All other engines return the same result as sre.
msg337088 - (view) Author: Ma Lin (malin) * Date: 2019-03-04 09:59
Found another bug in re:

>>> re.match(r'(?:.*?\b(?=(\t)|(x))x)*', 'a\txa\tx').groups()
('\t', 'x')

Expected result: (None, 'x')

PHP 7.3.2           NULL, "x"
Java 11.0.2         "\t", "x"
Perl 5.28.1         "\t", "x"
Ruby 2.6.1          nil, "x"
Go 1.12             doesn't support lookaround
Rust 1.32.0         doesn't support lookaround
Node.js 10.15.1     undefined, "x"
regex 2019.2.21     None, "x"
re                  "\t", "x"

This is a very rare bug, can be fixed by adding MARH_PUSH() before JUMP_MIN_REPEAT_ONE. And maybe other JUMPs should MARK_PUSH() as well.

I'm impressed with regex module, it never went wrong.
IMHO, I would like to see a pruned version be adopted into stdlib.

~~~~~~~~~~~~~~~~~~~~~~
> Interesting sidelights 1
> Found a Perl bug

I reported to Perl, it's a bug in perl-5.26, and already fixed in perl-5.28.0.
msg337118 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-04 13:57
Thank you for your great work Ma Lin! But it will take a time to make a review of it.

Could you please create and run some microbenchmarks to measure possible performance penalty of additional MARH_PUSHes? I am especially interesting in worst cases. If the penalty is significant, it will be a goal of future optimizations. If it is unsignificant, we will not be bothered about this.

I am not sure about backporting these changes. This behavior is such old, that there is a chance to break someone's code that depend on it.
msg337739 - (view) Author: Ma Lin (malin) * Date: 2019-03-12 13:52
> Could you please create and run some microbenchmarks to measure
> possible performance penalty of additional MARH_PUSHes? I am 
> especially interesting in worst cases.

Besides the worst case, I prepared two solutions.

Solution_A (PR12288):
Fix the bugs, I can find test-case for every changes.
I feel JUMP_MIN_UNTIL_3 should MARK_PUSH() as well, but I really can't find a test-case to prove this feel.
Commit_12 is a safety-check, if JUMP_MIN_UNTIL_3 or other JUMPs should be protected, maybe there will be a bug report come from millions of users.

Solution_B (PR12289):
Based on Solution_A, protects JUMP_MIN_UNTIL_3.

Worst (PR12290):
Based on Solution_B, protects in everywhere, this should the worst performance.

Legend of this table:
* No: no protection
* L : LASTMARK_SAVE()
* Mu: MARK_PUSH() unconditionally
* Mr: MARK_PUSH() if in a repeat

                  unpatched  Solution_A  Solution_B   Worst
JUMP_MAX_UNTIL_1     No         No          No    ->  L/Mu
JUMP_MAX_UNTIL_2     L/Mu       L/Mu        L/Mu      L/Mu
JUMP_MAX_UNTIL_3     No         No          No    ->  L/Mu

JUMP_MIN_UNTIL_1     No         No          No    ->  L/Mu
JUMP_MIN_UNTIL_2     L    ->    L/Mr        L/Mr  ->  L/Mu
JUMP_MIN_UNTIL_3     No         No     ->   L/Mu  ->  L/Mu

JUMP_REPEAT_ONE_1    L    ->    L/Mr        L/Mr  ->  L/Mu
JUMP_REPEAT_ONE_2    L    ->    L/Mr        L/Mr  ->  L/Mu

JUMP_MIN_REPEAT_ONE  L    ->    L/Mr        L/Mr  ->  L/Mu

JUMP_BRANCH          L/Mr       L/Mr        L/Mr  ->  L/Mu

JUMP_ASSERT          No         No          No    ->  L/Mu

JUMP_ASSERT_NOT      No   ->    L/Mr        L/Mr  ->  L/Mu

🔶 I made a benchmark tool for Python RE engines.
https://github.com/animalize/re_benchmarks

re100mb.py was extracted from a real case, process 100MB real data in 16 rounds (with 16 patterns).
This table picks best of 4 results from each round:

             unpatched  A        B        Worst    regex 2019.3.9
test 01 best:  0.647    0.629    0.625    0.635    0.102
test 02 best:  3.980    4.046    4.052    4.005    4.530
test 03 best:  0.730    0.708    0.709    0.706    0.433
test 04 best:  0.763    0.743    0.740    0.736    0.799
test 05 best:  2.566    2.693    2.692    2.654    0.981
test 06 best:  0.883    0.865    0.859    0.874    0.872
test 07 best:  2.847    2.773    2.797    2.890    1.202
test 08 best:  3.644    3.664    3.740    3.699    1.139
test 09 best:  0.344    0.348    0.343    0.345    0.378
test 10 best:  0.341    0.347    0.343    0.344    0.407
test 11 best:  4.490    4.655    4.520    4.440    1.340
test 12 best:  0.264    0.263    0.262    0.264    0.271
test 13 best:  0.230    0.230    0.231    0.233    0.281
test 14 best:  0.932    0.925    0.919    0.943    0.334
test 15 best:  1.837    1.815    1.804    1.866    0.683
test 16 best:  1.691    1.708    1.676    2.042    3.805
--------------------------------------------------------
sum of above: 26.189   26.412   26.312   26.676   17.557

There seems no significant changes.

🔶 Below are some micro benchmarks, run t.py with the benchmark tool testit.py

SRE_OP_MAX_UNTIL a few repeats
unpatched: 744.85 nsec
SolutionA: 755.86 nsec
SolutionB: 745.14 nsec
Worst:     843.56 nsec
regex:     2.45 usec

SRE_OP_MAX_UNTIL many repeats
unpatched: 393.24 msec
SolutionA: 321.16 msec
SolutionB: 323.24 msec
Worst:     498.48 msec
regex:     240.73 msec

------------------
SRE_OP_MIN_UNTIL a few repeats
unpatched: 702.75 nsec
SolutionA: 701.90 nsec
SolutionB: 730.81 nsec
Worst:     873.67 nsec
regex:     1.84 usec

SRE_OP_MIN_UNTIL many repeats
unpatched: 210.60 msec
SolutionA: 174.20 msec
SolutionB: 323.93 msec
Worst:     491.73 msec
regex:     195.11 msec

------------------
SRE_OP_REPEAT_ONE short string
unpatched: 466.56 nsec
SolutionA: 468.85 nsec
SolutionB: 466.04 nsec
Worst:     463.86 nsec
regex:     1.13 usec

SRE_OP_REPEAT_ONE long string
unpatched: 2.19 msec
SolutionA: 2.19 msec
SolutionB: 2.19 msec
Worst:     2.19 msec
regex:     3.23 msec

------------------
SRE_OP_MIN_REPEAT_ONE short string
unpatched: 563.76 nsec
SolutionA: 566.68 nsec
SolutionB: 561.92 nsec
Worst:     601.69 nsec
regex:     1.12 usec

SRE_OP_MIN_REPEAT_ONE long string
unpatched: 11.16 msec
SolutionA: 11.27 msec
SolutionB: 10.55 msec
Worst:     15.97 msec
regex:     7.24 msec

------------------
SRE_OP_BRANCH
unpatched: 419.34 nsec
SolutionA: 422.07 nsec
SolutionB: 418.25 nsec
Worst:     423.56 nsec
regex:     1.34 usec

------------------
SRE_OP_ASSERT
unpatched: 320.58 nsec
SolutionA: 326.29 nsec
SolutionB: 320.81 nsec
Worst:     333.22 nsec
regex:     1.14 usec

SRE_OP_ASSERT_NOT
unpatched: 440.18 nsec
SolutionA: 440.93 nsec
SolutionB: 434.44 nsec
Worst:     446.33 nsec
regex:     1.00 usec

🔶 reduce the size of match_context struct

In stack-consuming scenes, Solution_A and Solution_B are better than unpatched.
This is because commit 10 and commit 11, they reduced the size of match_context struct, the stack uses this struct to simulate recursive call.

On 32 bit platform, sizeof(match_context): 36 bytes -> 32 bytes.
On 64 bit platform, sizeof(match_context): 72 bytes -> 56 bytes.

It brings:
- deeper recursive call
- less memory consume
- less memory realloc

If limit the stack size to 1GB, the max value of n is:

re.match(r'(ab)*', n * 'ab')   # need to save MARKs
72 bytes: n = 11,184,809
64 bytes: n = 12,201,610
56 bytes: n = 13,421,771

re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs
72 bytes: n = 13,421,770
64 bytes: n = 14,913,079
56 bytes: n = 16,777,214

🔶 Future optimizations
> If the penalty is significant, it will be a goal of future optimizations.

Maybe these two questions can be predicted when compiling the pattern:
- whether protect or not
- which MARKs should be protected
Then sre don't need to MARK_PUSH() all available MARKs to stack.
msg338320 - (view) Author: Ma Lin (malin) * Date: 2019-03-19 06:45
I guess PR12427 is mature enough for review, I have been working on it these days.

You may review these commits one by one, commit message is review guide.
https://github.com/python/cpython/pull/12427/commits

Maybe you will need two or three days to understand it, and ponder some days.

> I am not sure about backporting these changes.
> This behavior is such old, that there is a chance
> to break someone's code that depend on it.

How about this plan, if no one complains these changes in Python 3.8 before the end of this year, then we backport to 3.7 and 2.7 branches at that time, and document the changes in notable changes section.
msg370524 - (view) Author: Ma Lin (malin) * Date: 2020-06-01 00:57
Is there hope to merge to 3.9 branch?
msg372547 - (view) Author: Ma Lin (malin) * Date: 2020-06-29 06:00
Do I need to write a detailed review guide? I suppose that after reading it from beginning to end, it will be easy to understand PR 12427, no need to read anything else.

Or plan to replace the sre module with the regex module in a future version?
msg416260 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-29 14:31
New changeset 356997cccc21a3391175d20e9ef03d434675b496 by Ma Lin in branch 'main':
bpo-35859: Fix a few long-standing bugs in re engine (GH-12427)
https://github.com/python/cpython/commit/356997cccc21a3391175d20e9ef03d434675b496
msg416261 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-29 14:35
Since the old behavior in many cases matches the behavior of Perl and Java (which are considered bugs, but still), it was decided to not backport the fix to avoid possible breakage in bugfix releases.

Thank you Ma Lin for your contribution.
msg416262 - (view) Author: Ma Lin (malin) * Date: 2022-03-29 14:41
Thanks for your review.

3.11 has a more powerful re module, also thank you for rebasing the atomic grouping code.
History
Date User Action Args
2022-04-11 14:59:10adminsetgithub: 80040
2022-03-29 14:41:19malinsetmessages: + msg416262
2022-03-29 14:35:33serhiy.storchakasetstatus: open -> closed

components: + Library (Lib)
versions: + Python 3.11, - Python 2.7, Python 3.7, Python 3.8
keywords: patch, patch, patch
resolution: fixed
messages: + msg416261
stage: patch review -> resolved
2022-03-29 14:31:24serhiy.storchakasetmessages: + msg416260
2020-06-29 06:00:58malinsetmessages: + msg372547
2020-06-01 00:57:51malinsetmessages: + msg370524
2019-03-19 06:45:25malinsetmessages: + msg338320
2019-03-19 06:40:59malinsetpull_requests: + pull_request12381
2019-03-12 13:56:32malinsetfiles: + t.py
2019-03-12 13:52:50malinsetmessages: + msg337739
2019-03-12 13:51:22malinsetpull_requests: + pull_request12271
2019-03-12 13:50:33malinsetpull_requests: + pull_request12270
2019-03-12 13:49:42malinsetpull_requests: + pull_request12269
2019-03-04 13:57:44serhiy.storchakasetkeywords: patch, patch, patch

messages: + msg337118
2019-03-04 09:59:51malinsetmessages: + msg337088
2019-03-02 00:20:09malinsetpull_requests: + pull_request12137
2019-03-01 13:31:17malinsetfiles: + test_scripts_for_7_engines.zip

messages: + msg336917
2019-02-22 11:04:15malinsetmessages: + msg336293
2019-02-19 09:03:45serhiy.storchakasetpull_requests: - pull_request11701
2019-02-19 09:03:39serhiy.storchakasetpull_requests: - pull_request11700
2019-02-09 11:39:34malinsetmessages: + msg335131
versions: + Python 3.7, Python 3.8, - Python 3.5
2019-02-04 15:55:50malinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request11701
2019-02-04 15:55:40malinsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11700
2019-02-04 15:55:30malinsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11699
2019-01-31 03:28:20malinsetmessages: + msg334603
2019-01-30 23:11:43mrabarnettsetmessages: + msg334601
2019-01-30 22:45:37rhettingersetnosy: + tim.peters, effbot, serhiy.storchaka
2019-01-30 22:44:24rhettingersetmessages: + msg334600
2019-01-30 19:49:33mrabarnettsetmessages: + msg334592
2019-01-30 18:18:35davisjamsetmessages: + msg334586
2019-01-30 16:59:24rhettingersetnosy: + rhettinger
messages: + msg334573
2019-01-30 15:00:15malinsetnosy: + malin
2019-01-30 14:16:14davisjamcreate