Author rhettinger
Recipients Ma Lin, davisjam, ezio.melotti, mrabarnett, rhettinger
Date 2019-01-30.16:59:24
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1548867564.27.0.765343075837.issue35859@roundup.psfhosted.org>
In-reply-to
Content
> I cannot see why changing the order of the alternation should have this effect.

The first regex, r'(a|ab)*?b', looks for the first alternative group by matching left-to-right [1] stopping at the first matching alternation "a".  Roughly, the regex simplifies to r'(a)*?b' giving 'a' in the captured group.

The second regex, r'(ab|a)*?b', looks for the first  alternative group by matching left-to-right [1] stopping at the first matching alternation "ab".  Roughly, the regex simplifies to r'(ab)*?b' giving '' in the captured group.

From there, I'm not clear on how a non-greedy kleene-star works with capturing groups and with the overall span().  A starting point would be to look at the re.DEBUG output for each pattern [2][3].

[1] From the re docs for the alternation operator:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

[2] re.DEBUG output for r'(a|ab)*?b'
 0. INFO 4 0b0 1 MAXREPEAT (to 5)
 5: REPEAT 19 0 MAXREPEAT (to 25)
 9.   MARK 0
11.   LITERAL 0x61 ('a')
13.   BRANCH 3 (to 17)
15.     JUMP 7 (to 23)
17:   branch 5 (to 22)
18.     LITERAL 0x62 ('b')
20.     JUMP 2 (to 23)
22:   FAILURE
23:   MARK 1
25: MIN_UNTIL
26. LITERAL 0x62 ('b')
28. SUCCESS

[3] re.DEBUG output for r'(ab|a)*?b'
MIN_REPEAT 0 MAXREPEAT
  SUBPATTERN 1 0 0
    LITERAL 97
    BRANCH
      LITERAL 98
    OR
LITERAL 98

 0. INFO 4 0b0 1 MAXREPEAT (to 5)
 5: REPEAT 19 0 MAXREPEAT (to 25)
 9.   MARK 0
11.   LITERAL 0x61 ('a')
13.   BRANCH 5 (to 19)
15.     LITERAL 0x62 ('b')
17.     JUMP 5 (to 23)
19:   branch 3 (to 22)
20.     JUMP 2 (to 23)
22:   FAILURE
23:   MARK 1
25: MIN_UNTIL
26. LITERAL 0x62 ('b')
28. SUCCESS
History
Date User Action Args
2019-01-30 16:59:26rhettingersetrecipients: + rhettinger, ezio.melotti, mrabarnett, Ma Lin, davisjam
2019-01-30 16:59:24rhettingersetmessageid: <1548867564.27.0.765343075837.issue35859@roundup.psfhosted.org>
2019-01-30 16:59:24rhettingerlinkissue35859 messages
2019-01-30 16:59:24rhettingercreate