classification
Title: left-to-right violation in match order
Type: behavior Stage: resolved
Components: Versions: Python 3.6
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: steve.newcomb, steven.daprano
Priority: normal Keywords:

Created on 2018-12-14 16:06 by steve.newcomb, last changed 2018-12-16 20:30 by steve.newcomb. This issue is now closed.

Files
File name Uploaded Description Edit
left-to-right_violation_in_python3_re_match.py steve.newcomb, 2018-12-14 16:06
logcheck3.py steven.daprano, 2018-12-16 00:58
Messages (3)
msg331838 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2018-12-14 16:06
Documentation for the re module insists that matches are made left-to-right within the alternatives delimited by an "or* | group.  I seem to have found a case where the rightmost alternative is matched unless it (and only it) is commented out.  See attached script, which is self-explanatory.
msg331910 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-16 00:58
> See attached script, which is self-explanatory.

I'm glad one of us thinks so, because I find it clear as mud.

I spent *way* longer on this than I should have, but I simplified your sample code to the best of my ability. (See attached.) As far as I can tell, your code and mine does roughly the same thing, but please check that you agree.

I agree that with the IPV6 portion of the regex removed, it matches on "208.123.4.22", but with the IPV6 portion included, it matches on "::ffff:208.123.4.22". But I'm not sure that's a bug. I think it is working as designed. For example:


py> import re
py> text = 'green pepper'
py> re.search('pepper|green pepper', text).group(0)
'green pepper'


seems to be analogous to your example, but simpler. Do you agree? If not, it would also help a lot if you could find a simpler regex that demonstrates the issue. See http://www.sscce.org/

In your case, I believe that the rightmost alternative matches from position 1 of the text, while the leftmost alternative doesn't match until position 8. So starting from position 0, the IPV6 check matches first, and so wins.

It is possible you were expecting that the IPV4 check would be tested against position 0, then position 1, then position 2, then ... and so on until the end of the string, and only then the IPV6 check tested against position 0, then 1 etc.
msg331936 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2018-12-16 20:30
I'm very grateful for your time and attention, and sorry to have distracted you.  You're correct when you say:  

Steven D'Aprano: ...the rightmost alternative matches from position 1 of the text, while the leftmost alternative doesn't match until position 8. So starting from position 0, the IPV6 check matches first, and so wins.

I see now that what I was trying to do is simply not possible. I was looking for a way to do a kind of hat trick: to keep a matched substring ("::ffff:") out of matchObject.group(0).  I guess I just don't get to do that.  

It would be a nice feature to add: a "consume-and-forget" or "suppress" extension group type. Non-capturing groups forget about themselves, but they don't suppress their matched contents.  It's a nice thing to be able to do because some software accepts regular expressions as configuration items but doesn't allow configuration of selection among the groups that may appear within it.  (I admit there aren't many occasions when suppression of substrings from group(0) is really necessary, but I think they do occur.)
History
Date User Action Args
2018-12-16 20:30:01steve.newcombsetstatus: open -> closed

messages: + msg331936
stage: resolved
2018-12-16 00:58:46steven.dapranosetfiles: + logcheck3.py
nosy: + steven.daprano
messages: + msg331910

2018-12-14 16:06:16steve.newcombcreate