The text about non-greedy match in the documentation for re library is misleading.
The docs for py2.7 (https://docs.python.org/2.7/library/re.html) and 3.6 (https://docs.python.org/3.6/library/re.html) says: "The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>."
The docs for py3.4 (https://docs.python.org/3.4/library/re.html) offers a little bit different example:
"The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'."
However, in reality if the non-greedy match is not successful, it might fallback to the greedy match, see:
>>> import re
>>> a = re.compile(r"<.*?><span>")
>>> a.match("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>
>>> a.search("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>
So the '<.*?>' part of the regex matches '<a> b <c>' in this example. I propose to add to the documentation the following text:
"However, note that even the non-greedy version can match additional text, for example consider the RE '(<.*>)<d>' to be matched against '<a> b <c><d>'. The match is successful and the unnamed group contains '<a> b <c>'."
|