classification
Title: regex A|B : both A and B match, but B is wrongly preferred
Type: behavior Stage:
Components: Regular Expressions Versions: Python 3.1
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, tzot
Priority: normal Keywords:

Created on 2010-10-18 22:26 by tzot, last changed 2010-10-19 08:11 by tzot. This issue is now closed.

Messages (6)
msg119088 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2010-10-18 22:25
This is based on that StackOverflow answer: http://stackoverflow.com/questions/3957164/3963443#3963443. It also applies to Python 2.6 .

Searching for a regular expression that satisfies the mentioned SO question (a regular expression that matches strings with an initial A and/or final Z and returns everything except said initial A and final Z), I discovered something that I consider a bug. I've tried to thoroughly verify that this is not a PEBCAK before reporting the issue here.

Given:

>>> import re
>>> text= 'A***Z'

then:

>>> re.compile('(?<=^A).*(?=Z$)').search(text).group(0) # regex_1
'***'
>>> re.compile('(?<=^A).*').search(text).group(0) # regex_2
'***Z'
>>> re.compile('.*(?=Z$)').search(text).group(0) # regex_3
'A***'
>>> re.compile('(?<=^A).*(?=Z$)|(?<=^A).*').search(text).group(0) # regex_1|regex_2
'***'
>>> re.compile('(?<=^A).*(?=Z$)|.*(?=Z$)').search(text).group(0) # regex_1|regex_3
'A***'
>>> re.compile('(?<=^A).*|.*(?=Z$)').search(text).group(0) # regex_2|regex_3
'A***'
>>> re.compile('(?<=^A).*(?=Z$)|(?<=^A).*|.*(?=Z$)').search(text).group(0) # regex_1|regex_2|regex_3
'A***'

regex_1 returns '***'. Based on the documentation (http://docs.python.org/py3k/library/re.html#regular-expression-syntax), I assert that, likewise, '***' should be returned by:

regex_1|regex_2
regex_1|regex_3
regex_1|regex_2|regex_3

And yet, regex_3 ( ".*(?=Z$)" ) seems to take precedence over both regex_1 and regex_2, even though it's the last alternative.

This works even if I substitute "(?:regex_n)" for every "regex_n", so it's not a matter of precedence.

I really hope that this is a PEBCAK; if that is true, I apologize for any time lost on the issue by anyone; but really don't think it is.
msg119090 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2010-10-18 22:38
For completeness' sake, I also provide the "(?:regex_n)" results:

>>> text= 'A***Z'
>>> re.compile('(?:(?<=^A).*(?=Z$))').search(text).group(0) # regex_1
'***'
>>> re.compile('(?:(?<=^A).*)').search(text).group(0) # regex_2
'***Z'
>>> re.compile('(?:.*(?=Z$))').search(text).group(0) # regex_3
'A***'
>>> re.compile('(?:(?<=^A).*(?=Z$))|(?:(?<=^A).*)').search(text).group(0) # regex_1|regex_2
'***'
>>> re.compile('(?:(?<=^A).*(?=Z$))|(?:.*(?=Z$))').search(text).group(0) # regex_1|regex_3
'A***'
>>> re.compile('(?:(?<=^A).*)|(?:.*(?=Z$))').search(text).group(0) # regex_2|regex_3
'A***'
>>> re.compile('(?:(?<=^A).*(?=Z$))|(?:(?<=^A).*)|(?:.*(?=Z$))').search(text).group(0) # regex_1|regex_2|regex_3
'A***'
msg119125 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-10-19 07:33
I'm not sure this is valid.  First, I think I have a much easier example:

>>> import re
>>> re.search('bc|abc', 'abc').group()
'abc'

I assume you'd expect this to give 'bc' as well.  However, for a string s, "search" looks for matches looking at s, then looking at s[1:], then s[2:], and so on.  For s, it looks at both branches, and the second branch matches.

This can be inferred from the docs of "search": """Scan through string looking for a location where the regular expression pattern produces a match;""", for the first location a match is produced for the second branch.
msg119126 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2010-10-19 07:50
As I see it, it's more like:

>>> re.search('a.*c|a.*|.*c', 'abc').group()

producing 'bc' instead of 'abc'. Substitute "(?<=^A)" for "a" and "(?=Z$)" for "c" in the pattern above.

In your example, the first part ('bc') does not match the whole string ('abc'). In my example, the first part ('(?<=^A).*(?=Z$)') matches the whole string ('A***Z').
msg119127 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2010-10-19 07:54
Georg, please re-open it. Focus on the difference between example regex_1|regex_2 (both matching; regex_1 is used as it should be), and regex_1|regex_3 (both matching; regex_3 is used incorrectly).
msg119128 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2010-10-19 08:11
No, my mistake, you did well for closing it.

The more explicit version of the explanation: both regex_1 and regex_2 start actually matching at index 1, while regex_3 starts matching at index 0.
History
Date User Action Args
2010-10-19 08:11:32tzotsetmessages: + msg119128
2010-10-19 07:54:25tzotsetmessages: + msg119127
2010-10-19 07:50:13tzotsetmessages: + msg119126
2010-10-19 07:33:25georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg119125

resolution: wont fix
2010-10-18 22:38:15tzotsetmessages: + msg119090
2010-10-18 22:26:00tzotcreate