classification
Title: Capture behavior depends on the order of an alternation
Type: behavior Stage: patch review
Components: Regular Expressions Versions: Python 3.8, Python 3.7, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ma Lin, davisjam, effbot, ezio.melotti, mrabarnett, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch, patch, patch

Created on 2019-01-30 14:16 by davisjam, last changed 2019-02-09 11:39 by Ma Lin.

Files
File name Uploaded Description Edit
confusing-regex-behavior.py davisjam, 2019-01-30 14:16 Demonstrate confusing regex behavior
Pull Requests
URL Status Linked Edit
PR 11756 open Ma Lin, 2019-02-04 15:55
PR 11756 open Ma Lin, 2019-02-04 15:55
PR 11756 open Ma Lin, 2019-02-04 15:55
Messages (8)
msg334560 - (view) Author: James Davis (davisjam) * Date: 2019-01-30 14:16
I have two regexes: /(a|ab)*?b/ and /(ab|a)*?b/.
If I re.search the string "ab" for these regexes, I get inconsistent behavior.
Specifically, /(a|ab)*?b/ matches with capture "a", while /(ab|a)*?b/ matches with an empty capture group.

I am not actually sure which behavior is correct.


Interpretation 1: The (ab|a) clause matches the a, satisfying the (ab|a)*? once, and the engine proceeds to the b and completes. The capture group ends up containing "a".

Interpretation 2: The (ab|a) clause matches the a. Since the clause is marked with *, the engine repeats the attempt and finds nothing the second time. It proceeds to the b and completes. Because the second match attempt on (ab|a) found nothing, the capture group ends up empty.

The behavior depends on both the order of (ab|a) vs. (a|ab), and the use of the non-greedy quantifier.

I cannot see why changing the order of the alternation should have this effect.

The change in behavior occurs in the built-in "re" module but not in the competing "regex" module.
The behavior is consistent in both Python 2.7 and Python 3.5. I have not tested other versions.

I have included the confusing-regex-behavior.py file for troubleshooting.

Below is the behavior for matches on these and many variants.
I find the following lines the most striking:

Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab                ('',)
(ab|a)+?b                            True                   ab                ('',)
(ab|a){0,}?b                         True                   ab                ('',)
(ab|a){0,2}?b                        True                   ab                ('',)
(ab|a){0,1}?b                        True                   ab               ('a',)
(ab|a)*b                             True                   ab               ('a',)
(ab|a)+b                             True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)+?b                            True                   ab               ('a',)


(08:58:48) jamie@woody ~ $ python3 /tmp/confusing-regex-behavior.py 


Behavior from re


Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab                ('',)
(ab|a)+?b                            True                   ab                ('',)
(ab|a){0,}?b                         True                   ab                ('',)
(ab|a){0,2}?b                        True                   ab                ('',)
(ab|a)?b                             True                   ab               ('a',)
(ab|a)??b                            True                   ab               ('a',)
(ab|a)b                              True                   ab               ('a',)
(ab|a){0,1}?b                        True                   ab               ('a',)
(ab|a)*b                             True                   ab               ('a',)
(ab|a)+b                             True                   ab               ('a',)
(a|ab)*b                             True                   ab               ('a',)
(a|ab)+b                             True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)+?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(bb|a)*?b                            True                   ab               ('a',)
((?:ab|a)*?)b                        True                   ab               ('a',)
((?:a|ab)*?)b                        True                   ab               ('a',)


Behavior from regex


Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab               ('a',)
(ab|a)+?b                            True                   ab               ('a',)
(ab|a){0,}?b                         True                   ab               ('a',)
(ab|a){0,2}?b                        True                   ab               ('a',)
(ab|a)?b                             True                   ab               ('a',)
(ab|a)??b                            True                   ab               ('a',)
(ab|a)b                              True                   ab               ('a',)
(ab|a){0,1}?b                        True                   ab               ('a',)
(ab|a)*b                             True                   ab               ('a',)
(ab|a)+b                             True                   ab               ('a',)
(a|ab)*b                             True                   ab               ('a',)
(a|ab)+b                             True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)+?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(a|ab)*?b                            True                   ab               ('a',)
(bb|a)*?b                            True                   ab               ('a',)
((?:ab|a)*?)b                        True                   ab               ('a',)
((?:a|ab)*?)b                        True                   ab               ('a',)
msg334573 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-01-30 16:59
> I cannot see why changing the order of the alternation should have this effect.

The first regex, r'(a|ab)*?b', looks for the first alternative group by matching left-to-right [1] stopping at the first matching alternation "a".  Roughly, the regex simplifies to r'(a)*?b' giving 'a' in the captured group.

The second regex, r'(ab|a)*?b', looks for the first  alternative group by matching left-to-right [1] stopping at the first matching alternation "ab".  Roughly, the regex simplifies to r'(ab)*?b' giving '' in the captured group.

From there, I'm not clear on how a non-greedy kleene-star works with capturing groups and with the overall span().  A starting point would be to look at the re.DEBUG output for each pattern [2][3].

[1] From the re docs for the alternation operator:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

[2] re.DEBUG output for r'(a|ab)*?b'
 0. INFO 4 0b0 1 MAXREPEAT (to 5)
 5: REPEAT 19 0 MAXREPEAT (to 25)
 9.   MARK 0
11.   LITERAL 0x61 ('a')
13.   BRANCH 3 (to 17)
15.     JUMP 7 (to 23)
17:   branch 5 (to 22)
18.     LITERAL 0x62 ('b')
20.     JUMP 2 (to 23)
22:   FAILURE
23:   MARK 1
25: MIN_UNTIL
26. LITERAL 0x62 ('b')
28. SUCCESS

[3] re.DEBUG output for r'(ab|a)*?b'
MIN_REPEAT 0 MAXREPEAT
  SUBPATTERN 1 0 0
    LITERAL 97
    BRANCH
      LITERAL 98
    OR
LITERAL 98

 0. INFO 4 0b0 1 MAXREPEAT (to 5)
 5: REPEAT 19 0 MAXREPEAT (to 25)
 9.   MARK 0
11.   LITERAL 0x61 ('a')
13.   BRANCH 5 (to 19)
15.     LITERAL 0x62 ('b')
17.     JUMP 5 (to 23)
19:   branch 3 (to 22)
20.     JUMP 2 (to 23)
22:   FAILURE
23:   MARK 1
25: MIN_UNTIL
26. LITERAL 0x62 ('b')
28. SUCCESS
msg334586 - (view) Author: James Davis (davisjam) * Date: 2019-01-30 18:18
Thanks for your thoughts, Raymond. I understand that the alternation has "short-circuit" behavior, but I still find it confusing in this case.

Consider these two:


Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab                ('',)
(ab|a)+?b                            True                   ab                ('',)

In order to satisfy the first "(ab|a)+?" clause the regex engine has to find at least one match for (ab|a), and still match the final "b" clause of the pattern.

In this case, although "(ab|a)" will match "ab", doing so would cause the overall pattern to mismatch. So it seems like in order to obtain the match (which it does, see the second column), the regex engine must proceed past the first "ab" into the "a" part of the OR. But then I would expect the capture group to contain "a" and it does not.

For what it's worth, I also tried the match /(ab|a)*?b/ in PHP, Perl, Java, Ruby, Go, Rust and Node.js. The other 7 regex engines all matched "ab" and captured "a". Only Python's re module matches with an empty capture -- and even here it disagrees with the Python "regex" module as I noted in my initial post.
msg334592 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-01-30 19:49
It looks like a bug in re to me.
msg334600 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-01-30 22:44
I'm not at all clear on how these features should interact (alternation, non-greedy matching, and group capture).  Was just pointing that the first-match behavior of alternation is the documented behavior and that re.DEBUG can be used to explore what the regex is actually doing.   Whether this is correct, intended, desirable or consistent with other tools is a completely different question.  Even if not, there is a question about whether the behavior should be changed or just documented rather than risk breaking an untold number of regexes in already tested and deployed code.
msg334601 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-01-30 23:11
It matches, and the span is (0, 2).

The only way that it can match like that is for the capture group to match the 'a', and the final 'b' to match the 'b'.

Therefore, re.search(r'(ab|a)*b', 'ab').groups() should be ('a', ), as it is for the pattern with a greedy repeat.
msg334603 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-31 03:28
You can `#define VERBOSE` in file `_sre.c`, it will print the engine's actual actions:

|02FAC684|02FC7402|MARK 0
...
|02FAC6BC|02FC7401|MARK 1

In my computer, 02FC7400 points to "ab", 02FC7401 points 'b' in "ab", 02FC7402 points to the end of "ab".

This capture group, begin at 02FC7402, end at 02FC7401. `begin > end` makes it return an empty string.

This seems a bug, the begin should at 02FC7400.
msg335131 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-02-09 11:39
For a capture group, state->mark[] array stores it's begin and end:
begin: state->mark[(group_number-1)*2]
end:   state->mark[(group_number-1)*2+1]

So state->mark[0] is the begin of the first capture group.
state->mark[1] is the end of the first capture group.

re.search(r'(ab|a)*?b', 'ab')
In this case, here is a simplified actions record:

01  MARK 0
02  "a":  first "a" in the pattern [SUCCESS]
03  BRANCH
04    "b": first "b" in the pattern [SUCCESS]
05    MARK 1
06    "b": second "b" in the pattern [FAIL]
07    try next (ab|a)*? [FAIL]
08      MARK 0
09      "a":  first "a" in the pattern [FAIL]
10  BRANCH: try next branch
11    "": the second branch [SUCCESS]
12    MARK 1
13    "b" [SUCCESS]: second "b" in the pattern

MARK_PUSH(lastmark) macro didn't protect MARK-0 if it was the only available mark, while the BRANCH op uses this macro to protect capture groups before trying a branch.

So capture group 1 is [MARK-0 at Line-08, MARK-1 at line-12), this is wrong. 
The correct capture group 1 should be [MARK-0 at Line-01, MARK-1 at line-12).
History
Date User Action Args
2019-02-09 11:39:34Ma Linsetmessages: + msg335131
versions: + Python 3.7, Python 3.8, - Python 3.5
2019-02-04 15:55:50Ma Linsetkeywords: + patch
stage: patch review
pull_requests: + pull_request11701
2019-02-04 15:55:40Ma Linsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11700
2019-02-04 15:55:30Ma Linsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11699
2019-01-31 03:28:20Ma Linsetmessages: + msg334603
2019-01-30 23:11:43mrabarnettsetmessages: + msg334601
2019-01-30 22:45:37rhettingersetnosy: + tim.peters, effbot, serhiy.storchaka
2019-01-30 22:44:24rhettingersetmessages: + msg334600
2019-01-30 19:49:33mrabarnettsetmessages: + msg334592
2019-01-30 18:18:35davisjamsetmessages: + msg334586
2019-01-30 16:59:24rhettingersetnosy: + rhettinger
messages: + msg334573
2019-01-30 15:00:15Ma Linsetnosy: + Ma Lin
2019-01-30 14:16:14davisjamcreate