Message 285619 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ipolcak
Recipients	docs@python, ipolcak
Date	2017-01-17.08:04:30
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1484640271.49.0.239743289753.issue29291@psf.upfronthosting.co.za>
In-reply-to

Content
The text about non-greedy match in the documentation for re library is misleading. The docs for py2.7 (https://docs.python.org/2.7/library/re.html) and 3.6 (https://docs.python.org/3.6/library/re.html) says: "The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.?> will match only <a>." The docs for py3.4 (https://docs.python.org/3.4/library/re.html) offers a little bit different example: "The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .? in the previous expression will match only '<H1>'." However, in reality if the non-greedy match is not successful, it might fallback to the greedy match, see: >>> import re >>> a = re.compile(r"<.?><span>") >>> a.match("<a> b <c><span>") <_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'> >>> a.search("<a> b <c><span>") <_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'> So the '<.?>' part of the regex matches '<a> b <c>' in this example. I propose to add to the documentation the following text: "However, note that even the non-greedy version can match additional text, for example consider the RE '(<.*>)<d>' to be matched against '<a> b <c><d>'. The match is successful and the unnamed group contains '<a> b <c>'."

The text about non-greedy match in the documentation for re library is misleading.

The docs for py2.7 (https://docs.python.org/2.7/library/re.html) and 3.6 (https://docs.python.org/3.6/library/re.html) says: "The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>."

The docs for py3.4 (https://docs.python.org/3.4/library/re.html) offers a little bit different example:
"The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'."

However, in reality if the non-greedy match is not successful, it might fallback to the greedy match, see:

>>> import re
>>> a = re.compile(r"<.*?><span>")
>>> a.match("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>
>>> a.search("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>

So the '<.*?>' part of the regex matches '<a> b <c>' in this example. I propose to add to the documentation the following text:

"However, note that even the non-greedy version can match additional text, for example consider the RE '(<.*>)<d>' to be matched against '<a> b <c><d>'. The match is successful and the unnamed group contains '<a> b <c>'."

History
Date	User	Action	Args
2017-01-17 08:04:31	ipolcak	set	recipients: + ipolcak, docs@python
2017-01-17 08:04:31	ipolcak	set	messageid: <1484640271.49.0.239743289753.issue29291@psf.upfronthosting.co.za>
2017-01-17 08:04:31	ipolcak	link	issue29291 messages
2017-01-17 08:04:30	ipolcak	create