Issue 23532: add example of 'first match wins' to regex "|" documentation?

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67720

classification

Title:	add example of 'first match wins' to regex "\|" documentation?
Type:	enhancement	Stage:	resolved
Components:	Documentation, Regular Expressions	Versions:	Python 3.7, Python 3.6, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Mark.Shannon, Rick Otten, docs@python, ezio.melotti, mrabarnett, r.david.murray, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2015-02-26 22:55 by Rick Otten, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (8)
msg236715 - (view)	Author: Rick Otten (Rick Otten)	Date: 2015-02-26 23:00
The documentation states that "\|" parsing goes from left to right. This doesn't seem to be true when spaces are involved. (or \s). Example: In [40]: mystring Out[40]: 'rwo incorporated' In [41]: re.sub('incorporated\| inc\|llc\|corporation\|corp\| co', '', mystring) Out[41]: 'rwoorporated' In this case " inc" was processed before incorporated. If I take the space out: In [42]: re.sub('incorporated\|inc\|llc\|corporation\|corp\| co', '', mystring) Out[42]: 'rwo ' incorporated is processed first. If I put a space with each, then " incorporated" is processed first: In [43]: re.sub(' incorporated\| inc\|llc\|corporation\|corp\| co', '', mystring) Out[43]: 'rwo' And If use \s instead of a space, it is processed first: In [44]: re.sub('incorporated\|\sinc\|llc\|corporation\|corp\| co', '', mystring) Out[44]: 'rwoorporated'
msg236716 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2015-02-26 23:13
This looks like the expected behaviour to me. re.sub matches the leftmost occurence and the regular expression is greedy so (x\|xy) will always match xy if it can.
msg236718 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2015-02-27 00:07
@Mark is correct, it's not a bug. In the first example: It tries to match each alternative at position 0. Failure. It tries to match each alternative at position 1. Failure. It tries to match each alternative at position 2. Failure. It tries to match each alternative at position 3. Success. ' inc' matches. In the second example: It tries to match each alternative at position 0. Failure. It tries to match each alternative at position 1. Failure. It tries to match each alternative at position 2. Failure. It tries to match each alternative at position 3. Failure. It tries to match each alternative at position 4. Success. 'incorporated' matches. ('inc' is a later alternative; it's considered only if the earlier alternatives have failed to match at that position.)
msg236720 - (view)	Author: Rick Otten (Rick Otten)	Date: 2015-02-27 00:36
Can the documentation be updated to make this more clear? I see now where the clause "As the target string is scanned, ..." is describing what you have listed here. I and a coworker both read the description several times and missed that. I thought it first tried "incorporated" against the whole string, then tried " inc" against the whole string, etc... When actually it was trying each, "incorporated" and " inc" and the others against the first position of the string. And then again for the second position. Since I want to force the order against the whole string before trying the next one for my particular use case, I'll do a series of re.subs instead of trying to do them all in one. It makes sense now and is easy to fix. Thanks for looking at it and explaining what is happening more clearly. It was really not obvious. I tried at least 100 variations and wasn't seeing the pattern.
msg236725 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-02-27 02:18
The thing is, what you describe is fundamental to how regular expressions work. I'm not sure it makes sense to add a specific mention of it to the '\|' docs, since it applies to all regexes.
msg236821 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2015-02-27 19:18
Not quite all. POSIX regexes will always look for the longest match, so the order of the alternatives doesn't matter, i.e. x\|xy would give the same result as xy\|x.
msg295128 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-04 15:57
From the documentation: """ As the target string is scanned, REs separated by ``'\|'`` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once ``A`` matches, ``B`` will not be tested further, even if it would produce a longer overall match. In other words, the ``'\|'`` operator is never greedy. """ I think this completely describes the behavior.
msg295129 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-06-04 16:19
I concur with Serhiy that the docs correctly and completely describe the behavior.

History
Date	User	Action	Args
2022-04-11 14:58:13	admin	set	github: 67720
2017-10-11 14:46:48	berker.peksag	set	status: open -> closed stage: resolved
2017-06-04 16:19:21	rhettinger	set	status: pending -> open nosy: + rhettinger messages: + msg295129
2017-06-04 15:57:51	serhiy.storchaka	set	status: open -> pending nosy: + serhiy.storchaka messages: + msg295128 resolution: not a bug
2016-10-16 22:32:17	serhiy.storchaka	set	type: behavior -> enhancement components: + Regular Expressions versions: + Python 3.5, Python 3.6, Python 3.7
2015-02-27 19:18:42	mrabarnett	set	messages: + msg236821
2015-02-27 02:18:20	r.david.murray	set	title: regex "\|" behavior differs from documentation -> add example of 'first match wins' to regex "\|" documentation? nosy: + r.david.murray, docs@python messages: + msg236725 assignee: docs@python components: + Documentation, - Regular Expressions
2015-02-27 00:36:49	Rick Otten	set	messages: + msg236720
2015-02-27 00:07:32	mrabarnett	set	messages: + msg236718
2015-02-26 23:13:00	Mark.Shannon	set	nosy: + Mark.Shannon messages: + msg236716
2015-02-26 23:00:23	Rick Otten	set	messages: + msg236715
2015-02-26 22:55:54	Rick Otten	create