Issue 17257: re module shows unexpected non-greedy behavior when using groups

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61459

classification

Title:	re module shows unexpected non-greedy behavior when using groups
Type:	behavior	Stage:
Components:	Regular Expressions	Versions:	Python 3.2, Python 3.3, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	Hendrik.Lemelson, ezio.melotti, mrabarnett, pitrou, tim.peters
Priority:	normal	Keywords:

Created on 2013-02-20 17:55 by Hendrik.Lemelson, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (3)
msg182535 - (view)	Author: Hendrik Lemelson (Hendrik.Lemelson)	Date: 2013-02-20 17:55
When using the Python 2.7.3 re module, it shows a strange behavior upon the use of quantifiers together with groups: >>> re.search('(a)', 'caaaat').groups() ('',) >>> re.search('(a+)', 'caaaat').groups() ('aaaa',) >>> re.search('(a{0,5})', 'caaaat').groups() ('',) >>> re.search('(a{1,5})', 'caaaat').groups() ('aaaa',) Whenever a quantifier is used that allows also zero occurrences, the quantifier loses its greedy behavior. This in my eyes is a defect in the re module. In the following there is another example with nested groups where the quantifier for the outer group even prevents the inner groups to match: >>> re.search('(a(b)a)', 'caabbaat').groups() ('aa', '') >>> re.search('(a(b+)a)', 'caabbaat').groups() ('abba', 'bb') >>> re.search('(a(b*)a){0,1}', 'caabbaat').groups() (None, None) >>> re.search('(a(b+)a){0,1}', 'caabbaat').groups() (None, None) It would be great if you could manage to fix this. Thank you in advance. Regards Hendrik Lemelson
msg182539 - (view)	Author: Tim Peters (tim.peters) *	Date: 2013-02-20 18:29
This is how it's supposed to work: Python's re matches at the leftmost position possible, and _then_ matches the longest possible substring at that position. When a regexp _can_ match 0 characters, it will match starting at index 0. So, e.g., >>> re.search('(a*)', 'caaaat').span() (0, 0) shows that the regexp matches the empty slice 'caaaat'[0:0] (the leftmost position at which it _can_ match), and >>> re.search('(a(b+)a){0,1}', 'caabbaat').span() (0, 0) shows the same. The groups didn't match anything in this case, because the outer {0,1} said "it's OK if you can't match anything". Put another group around it: >>> re.search('((a(b+)a){0,1})', 'caabbaat').groups() ('', None, None) to see that the regexp as a whole did match the empty string.
msg182581 - (view)	Author: Hendrik Lemelson (Hendrik.Lemelson)	Date: 2013-02-21 08:51
Thank you for clarifying this. While it still not seems really intuitive to me I can handle the behavior. To summarize: It is not possible with re to have an optional ({0,1}) group that contains further subgroups, because re considers (0,0) to already fulfill the constraints for the outer group?

History
Date	User	Action	Args
2022-04-11 14:57:42	admin	set	github: 61459
2013-02-21 08:51:10	Hendrik.Lemelson	set	status: open -> closed resolution: not a bug messages: + msg182581
2013-02-20 18:29:08	tim.peters	set	nosy: + tim.peters messages: + msg182539
2013-02-20 18:08:45	serhiy.storchaka	set	versions: + Python 3.2, Python 3.3, Python 3.4
2013-02-20 17:55:44	Hendrik.Lemelson	create