Author davisjam
Recipients Ma Lin, davisjam, ezio.melotti, mrabarnett, rhettinger
Date 2019-01-30.18:18:35
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1548872315.96.0.745498600802.issue35859@roundup.psfhosted.org>
In-reply-to
Content
Thanks for your thoughts, Raymond. I understand that the alternation has "short-circuit" behavior, but I still find it confusing in this case.

Consider these two:


Regex pattern                    matched?       matched string     captured content
-------------------- -------------------- -------------------- --------------------
(ab|a)*?b                            True                   ab                ('',)
(ab|a)+?b                            True                   ab                ('',)

In order to satisfy the first "(ab|a)+?" clause the regex engine has to find at least one match for (ab|a), and still match the final "b" clause of the pattern.

In this case, although "(ab|a)" will match "ab", doing so would cause the overall pattern to mismatch. So it seems like in order to obtain the match (which it does, see the second column), the regex engine must proceed past the first "ab" into the "a" part of the OR. But then I would expect the capture group to contain "a" and it does not.

For what it's worth, I also tried the match /(ab|a)*?b/ in PHP, Perl, Java, Ruby, Go, Rust and Node.js. The other 7 regex engines all matched "ab" and captured "a". Only Python's re module matches with an empty capture -- and even here it disagrees with the Python "regex" module as I noted in my initial post.
History
Date User Action Args
2019-01-30 18:18:37davisjamsetrecipients: + davisjam, rhettinger, ezio.melotti, mrabarnett, Ma Lin
2019-01-30 18:18:35davisjamsetmessageid: <1548872315.96.0.745498600802.issue35859@roundup.psfhosted.org>
2019-01-30 18:18:35davisjamlinkissue35859 messages
2019-01-30 18:18:35davisjamcreate