This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ipolcak
Recipients docs@python, ipolcak
Date 2017-01-17.08:04:30
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1484640271.49.0.239743289753.issue29291@psf.upfronthosting.co.za>
In-reply-to
Content
The text about non-greedy match in the documentation for re library is misleading.

The docs for py2.7 (https://docs.python.org/2.7/library/re.html) and 3.6 (https://docs.python.org/3.6/library/re.html) says: "The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>."

The docs for py3.4 (https://docs.python.org/3.4/library/re.html) offers a little bit different example:
"The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'."

However, in reality if the non-greedy match is not successful, it might fallback to the greedy match, see:

>>> import re
>>> a = re.compile(r"<.*?><span>")
>>> a.match("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>
>>> a.search("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>

So the '<.*?>' part of the regex matches '<a> b <c>' in this example. I propose to add to the documentation the following text:

"However, note that even the non-greedy version can match additional text, for example consider the RE '(<.*>)<d>' to be matched against '<a> b <c><d>'. The match is successful and the unnamed group contains '<a> b <c>'."
History
Date User Action Args
2017-01-17 08:04:31ipolcaksetrecipients: + ipolcak, docs@python
2017-01-17 08:04:31ipolcaksetmessageid: <1484640271.49.0.239743289753.issue29291@psf.upfronthosting.co.za>
2017-01-17 08:04:31ipolcaklinkissue29291 messages
2017-01-17 08:04:30ipolcakcreate