Issue 19055: Clarify docs for re module: why * does not match as many repetitions as possible.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63255

classification

Title:	Clarify docs for re module: why * does not match as many repetitions as possible.
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 3.4, Python 3.5

process

Status:	closed	Resolution:	works for me
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Jason.Stumpf, akira, dbenbenn, docs@python, ezio.melotti, janzert, mrabarnett, r.david.murray, serhiy.storchaka
Priority:	normal	Keywords:	easy, patch

Created on 2013-09-19 22:28 by Jason.Stumpf, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re-docs-repetitions.patch	akira, 2014-08-11 09:29	make Python documentation for '\|' more similar to PCRE and clarify '*' definition	review

Messages (11)
msg198116 - (view)	Author: Jason Stumpf (Jason.Stumpf)	Date: 2013-09-19 22:28
>>> re.match('(a\|ab)',('aba')).group(0) 'a' According to the documentation, the should match as many repetitions as possible. 2 are possible, it matches 1. Reversing the order of the operands of \| changes the behaviour. >>> re.match('(ab\|a)*',('aba')).group(0) 'aba'
msg198119 - (view)	Author: janzert (janzert) *	Date: 2013-09-19 22:52
The documentation on the \| operator in the re module pretty explicitly covers this. http://docs.python.org/2/library/re.html "A\|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '\|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '\|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '\|' operator is never greedy."
msg198120 - (view)	Author: Jason Stumpf (Jason.Stumpf)	Date: 2013-09-19 23:18
Even with the documentation to \|, the documentation to * is wrong. >>> re.match('(a\|ab)c',('abac')).group(0) 'abac' From the doc: In general, if a string p matches A and another string q matches B, the string pq will match AB. Since '(a\|ab)c' matches 'abac', and 'c' matches 'c', that means '(a\|ab)' matches 'aba'. It does so with 2 repetitions. Thus, in the example from my initial post, it was not matching with as many repetitions as possible. I think what you mean is that attempts to match again after each match of the preceding regular expression.
msg198121 - (view)	Author: Jason Stumpf (Jason.Stumpf)	Date: 2013-09-19 23:29
Sorry, that implication was backwards. I don't think I can prove from just the documentation that '(a\|ab)' can match 'aba' in certain contexts. If the docs said: " attempts to match again after each match of the preceding regular expression." I think it would describe the observed behaviour.
msg198122 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2013-09-20 00:00
The behaviour is correct. Here's a summary of what's happening:- First iteration of the repeated group: Try the first branch. Can match "a". Second iteration of the repeated group: Try the first branch. Can't match "a". Try the second branch. Can't match "ab". Continue with the remainder of the pattern. Can't match "c", therefore backtrack to the first iteration of the repeated group: Try the second branch. Can match "ab". Second iteration of the repeated group: Try the first branch. Can match "a". Third iteration of the repeated group: Try the first branch. Can't match "a". Try the second branch. Can't match "ab". Continue with the remainder of the pattern. Can match "c". Reached the end of the pattern. It has matched "abac".
msg198123 - (view)	Author: Jason Stumpf (Jason.Stumpf)	Date: 2013-09-20 00:02
I understand what's happening, but that is not what the documentation describes. If the behaviour is correct, the documentation is incorrect.
msg198126 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-09-20 01:36
The documentation is correct and unambiguous. Regular expressions just aren't very intuitive. The documentation says "Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible." "as many repetitions of the preceding RE" means that: (a\|ab)* is equivalent to (a\|ab)(a\|ab)(a\|ab)... where "..." represents "add copies of the RE until it doesn't match". Then you have to look at the documentation of '\|' to see what it matches, and that documentation (already quoted) explains why the expanded pattern only matches 'a' in 'aba'. Perhaps it would be clearer if it read "Causes the RE to be evaluated as if there were zero or more repetitions of the preceding RE, as many repetitions as produce matches"?
msg198152 - (view)	Author: Jason Stumpf (Jason.Stumpf)	Date: 2013-09-20 17:06
I like that clearer description. "as produce matches" is more correct than "as possible".
msg224479 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-08-01 09:38
I think this issue can be closed. Behavior and documentation are correct and match one other. Nothing to do with it.
msg224781 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-08-04 23:10
I agree.
msg225180 - (view)	Author: Akira Li (akira) *	Date: 2014-08-11 09:27
tl;dr: added patch that clarifies Python re behavior. Please, review. --- The documented behavior is not clear: why (a\|ab)* is not equivalent to (a\|ab)(a\|ab) for aba if the docs say "as many repetitions as are possible"? And it is not obvious (it is not the only possible behavior) e.g., POSIX [1] expects the longest match, PCRE [2] group may be atomic (/possesive quantifiers), or there is ungreedy repetition: $ echo abac \| grep -o '$a\\|ab$* # Basic Regular Expression aba $ echo abac \| grep -Eo '(a\|ab)' # Extended Regular Expression aba $ echo abac \| grep -Po '(a\|ab)' # PCRE (like Python regexes) a a $ echo abac \| grep -Po '(a\|ab)(a\|ab)' # PCRE (two repeats) aba $ echo abac \| grep -Po '(a\|ab)c' # PCRE (re-eval to match the rest) abac $ echo abAc \| grep -iPo '(a\|ab)+c' # PCRE (possessive quantifiers) Ac $ echo abac \| grep -Po '(a\|ab)?' # PCRE (ungreedy, zero repeats) # matches nothing where BREs, EREs are defined in [1] that says: If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. and: Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string.* Python regexes work like PCRE [2] that says: The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. If the alternatives are within a subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern. It explains why (a\|ab)* matches only a in aba ("the first one that succeeds") and why at the same time (a\|ab)c matches abac: (a\|ab) may match ab in the latter case because the main pattern would fail otherwise i.e., the advanced definition of "succeeds" is used: ""succeeds" means matching the rest of the main pattern ...". Python docs [3] are similar but do not contain the "advanced" "succeeds" definition: REs separated by ``'\|'`` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once ``A`` matches, ``B`` will not be tested further, even if it would produce a longer overall match. In other words, the ``'\|'`` operator is never greedy. It explains why (a\|ab) matches a in aba. '' behavior [2]: By default, the quantifiers are "greedy", that is, they match as much as possible (up to the maximum number of permitted times), without causing the rest of the pattern to fail. and again Python docs [3] are similar: ``''`` Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. It implies that (a\|ab) should be equivalent to (a\|ab)(a\|ab) to match aba -- TWO REPETITIONS ARE MORE THAN ONE: "as many repetitions as are possible". But it matches only 'a' i.e., it works as (a\|ab) -- only one repetition instead of two. In reality, (a\|ab)* along works like a. I've uploaded a documentation patch that makes Python documentation for '\|' more similar to PCRE and clarifies '' definition as R. David Murray suggested in msg198126. Please, review. [1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html [2] http://man7.org/linux/man-pages/man3/pcrepattern.3.html [3] http://hg.python.org/cpython/file/18a311479e8b/Doc/library/re.rst

History
Date	User	Action	Args
2022-04-11 14:57:51	admin	set	github: 63255
2014-08-11 09:29:02	akira	set	files: + re-docs-repetitions.patch
2014-08-11 09:28:54	akira	set	files: - re-docs-repetitions.patch
2014-08-11 09:27:31	akira	set	files: + re-docs-repetitions.patch components: - Regular Expressions versions: + Python 3.5, - Python 2.7, Python 3.3 keywords: + patch nosy: + akira title: Regular expressions: * does not match as many repetitions as possible. -> Clarify docs for re module: why * does not match as many repetitions as possible. messages: + msg225180
2014-08-04 23:10:07	ezio.melotti	set	status: open -> closed resolution: works for me messages: + msg224781 stage: needs patch -> resolved
2014-08-01 09:38:36	serhiy.storchaka	set	messages: + msg224479
2013-10-05 23:42:58	ezio.melotti	set	nosy: + docs@python assignee: docs@python components: + Documentation keywords: + easy type: behavior -> enhancement stage: needs patch
2013-09-20 17:06:27	Jason.Stumpf	set	messages: + msg198152
2013-09-20 01:36:38	r.david.murray	set	nosy: + r.david.murray messages: + msg198126
2013-09-20 00:02:27	Jason.Stumpf	set	messages: + msg198123
2013-09-20 00:00:51	mrabarnett	set	messages: + msg198122
2013-09-19 23:29:02	Jason.Stumpf	set	messages: + msg198121
2013-09-19 23:18:07	Jason.Stumpf	set	messages: + msg198120
2013-09-19 22:52:06	janzert	set	nosy: + janzert messages: + msg198119
2013-09-19 22:40:19	serhiy.storchaka	set	versions: + Python 3.3, Python 3.4
2013-09-19 22:39:56	serhiy.storchaka	set	nosy: + serhiy.storchaka
2013-09-19 22:34:34	dbenbenn	set	nosy: + dbenbenn
2013-09-19 22:29:18	Jason.Stumpf	set	nosy: + ezio.melotti, mrabarnett components: + Regular Expressions
2013-09-19 22:28:43	Jason.Stumpf	create