Issue 29291: Misleading text in the documentation of re library for non-greedy match

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73477

classification

Title:	Misleading text in the documentation of re library for non-greedy match
Type:	behavior	Stage:
Components:	Documentation	Versions:	Python 3.6, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, ipolcak, rhettinger, serhiy.storchaka, xiang.zhang
Priority:	normal	Keywords:

Created on 2017-01-17 08:04 by ipolcak, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (4)
msg285619 - (view)	Author: (ipolcak)	Date: 2017-01-17 08:04
The text about non-greedy match in the documentation for re library is misleading. The docs for py2.7 (https://docs.python.org/2.7/library/re.html) and 3.6 (https://docs.python.org/3.6/library/re.html) says: "The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.?> will match only <a>." The docs for py3.4 (https://docs.python.org/3.4/library/re.html) offers a little bit different example: "The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .? in the previous expression will match only '<H1>'." However, in reality if the non-greedy match is not successful, it might fallback to the greedy match, see: >>> import re >>> a = re.compile(r"<.?><span>") >>> a.match("<a> b <c><span>") <_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'> >>> a.search("<a> b <c><span>") <_sre.SRE_Match object; span=(0, 15), match='<a> b <c><span>'> So the '<.?>' part of the regex matches '<a> b <c>' in this example. I propose to add to the documentation the following text: "However, note that even the non-greedy version can match additional text, for example consider the RE '(<.*>)<d>' to be matched against '<a> b <c><d>'. The match is successful and the unnamed group contains '<a> b <c>'."
msg285624 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-01-17 08:44
The documentation doesn't look incorrect to me. The non-greedy match doesn't fallback to the greedy match, it always matches as few characters as possible will be matched. For example a.match("<a> b <c><span><d> e <f><span>") matches "<a> b <c><span>", not "<a> b <c><span><d> e <f><span>".
msg285626 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2017-01-17 08:52
This doesn't look like a bug in the doc to me.
msg285627 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-01-17 09:11
I concur with Serhiy and Xiang that the docs are correct as-is. Also IMO, the proposed rewording will cause more confusion than it cures. Since this issue hasn't previously come up in the rather long life span of the re module, it suggests that most folks are getting the understanding they need from the existing docs. I think it is the job for regex tutorials (rather than the main docs) to show when to use "<.?><span>" versus "<[^>]><span>".

History
Date	User	Action	Args
2022-04-11 14:58:42	admin	set	github: 73477
2017-01-17 09:11:37	rhettinger	set	status: open -> closed nosy: + rhettinger messages: + msg285627 resolution: not a bug
2017-01-17 08:52:10	xiang.zhang	set	nosy: + xiang.zhang messages: + msg285626
2017-01-17 08:44:38	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg285624
2017-01-17 08:04:31	ipolcak	create