classification
Title: re module shows unexpected non-greedy behavior when using groups
Type: behavior Stage:
Components: Regular Expressions Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Hendrik.Lemelson, ezio.melotti, mrabarnett, pitrou, tim.peters
Priority: normal Keywords:

Created on 2013-02-20 17:55 by Hendrik.Lemelson, last changed 2013-02-21 08:51 by Hendrik.Lemelson. This issue is now closed.

Messages (3)
msg182535 - (view) Author: Hendrik Lemelson (Hendrik.Lemelson) Date: 2013-02-20 17:55
When using the Python 2.7.3 re module, it shows a strange behavior upon the use of quantifiers together with groups:

>>> re.search('(a*)', 'caaaat').groups()
('',)
>>> re.search('(a+)', 'caaaat').groups()
('aaaa',)
>>> re.search('(a{0,5})', 'caaaat').groups()
('',)
>>> re.search('(a{1,5})', 'caaaat').groups()
('aaaa',)

Whenever a quantifier is used that allows also zero occurrences, the quantifier loses its greedy behavior. This in my eyes is a defect in the re module. In the following there is another example with nested groups where the quantifier for the outer group even prevents the inner groups to match:

>>> re.search('(a(b*)a)', 'caabbaat').groups()
('aa', '')
>>> re.search('(a(b+)a)', 'caabbaat').groups()
('abba', 'bb')
>>> re.search('(a(b*)a){0,1}', 'caabbaat').groups()
(None, None)
>>> re.search('(a(b+)a){0,1}', 'caabbaat').groups()
(None, None)

It would be great if you could manage to fix this.
Thank you in advance.

Regards
Hendrik Lemelson
msg182539 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-02-20 18:29
This is how it's supposed to work:  Python's re matches at the leftmost position possible, and _then_ matches the longest possible substring at that position.  When a regexp _can_ match 0 characters, it will match starting at index 0.  So, e.g.,

>>> re.search('(a*)', 'caaaat').span()
(0, 0)

shows that the regexp matches the empty slice 'caaaat'[0:0] (the leftmost position at which it _can_ match), and

>>> re.search('(a(b+)a){0,1}', 'caabbaat').span()
(0, 0)

shows the same.  The groups didn't match anything in this case, because the outer {0,1} said "it's OK if you can't match anything".  Put another group around it:

>>> re.search('((a(b+)a){0,1})', 'caabbaat').groups()
('', None, None)

to see that the regexp as a whole did match the empty string.
msg182581 - (view) Author: Hendrik Lemelson (Hendrik.Lemelson) Date: 2013-02-21 08:51
Thank you for clarifying this. While it still not seems really intuitive to me I can handle the behavior.

To summarize: It is not possible with re to have an optional ({0,1}) group that contains further subgroups, because re considers (0,0) to already fulfill the constraints for the outer group?
History
Date User Action Args
2013-02-21 08:51:10Hendrik.Lemelsonsetstatus: open -> closed
resolution: not a bug
messages: + msg182581
2013-02-20 18:29:08tim.peterssetnosy: + tim.peters
messages: + msg182539
2013-02-20 18:08:45serhiy.storchakasetversions: + Python 3.2, Python 3.3, Python 3.4
2013-02-20 17:55:44Hendrik.Lemelsoncreate