classification
Title: Misleading text in the documentation of re library for non-greedy match
Type: behavior Stage:
Components: Documentation Versions: Python 3.6, Python 3.4, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, ipolcak, rhettinger, serhiy.storchaka, xiang.zhang
Priority: normal Keywords:

Created on 2017-01-17 08:04 by ipolcak, last changed 2017-01-17 09:11 by rhettinger. This issue is now closed.

Messages (4)
msg285619 - (view) Author: (ipolcak) Date: 2017-01-17 08:04
The text about non-greedy match in the documentation for re library is misleading.

The docs for py2.7 (https://docs.python.org/2.7/library/re.html) and 3.6 (https://docs.python.org/3.6/library/re.html) says: "The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>."

The docs for py3.4 (https://docs.python.org/3.4/library/re.html) offers a little bit different example:
"The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'."

However, in reality if the non-greedy match is not successful, it might fallback to the greedy match, see:

>>> import re
>>> a = re.compile(r"<.*?><span>")
>>> a.match("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>
>>> a.search("<a> b <c><span>")
<_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'>

So the '<.*?>' part of the regex matches '<a> b <c>' in this example. I propose to add to the documentation the following text:

"However, note that even the non-greedy version can match additional text, for example consider the RE '(<.*>)<d>' to be matched against '<a> b <c><d>'. The match is successful and the unnamed group contains '<a> b <c>'."
msg285624 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-17 08:44
The documentation doesn't look incorrect to me. The non-greedy match doesn't fallback to the greedy match, it always matches as few characters as *possible* will be matched. For example a.match("<a> b <c><span><d> e <f><span>") matches "<a> b <c><span>", not "<a> b <c><span><d> e <f><span>".
msg285626 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-01-17 08:52
This doesn't look like a bug in the doc to me.
msg285627 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-17 09:11
I concur with Serhiy and Xiang that the docs are correct as-is.  Also IMO, the proposed rewording will cause more confusion than it cures.  Since this issue hasn't previously come up in the rather long life span of the re module, it suggests that most folks are getting the understanding they need from the existing docs.

I think it is the job for regex tutorials (rather than the main docs) to show when to use "<.*?><span>" versus "<[^>]*><span>".
History
Date User Action Args
2017-01-17 09:11:37rhettingersetstatus: open -> closed

nosy: + rhettinger
messages: + msg285627

resolution: not a bug
2017-01-17 08:52:10xiang.zhangsetnosy: + xiang.zhang
messages: + msg285626
2017-01-17 08:44:38serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg285624
2017-01-17 08:04:31ipolcakcreate