classification
Title: add example of 'first match wins' to regex "|" documentation?
Type: enhancement Stage: resolved
Components: Documentation, Regular Expressions Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Mark.Shannon, Rick Otten, docs@python, ezio.melotti, mrabarnett, r.david.murray, rhettinger, serhiy.storchaka
Priority: normal Keywords:

Created on 2015-02-26 22:55 by Rick Otten, last changed 2017-10-11 14:46 by berker.peksag. This issue is now closed.

Messages (8)
msg236715 - (view) Author: Rick Otten (Rick Otten) Date: 2015-02-26 23:00
The documentation states that "|" parsing goes from left to right.  This doesn't seem to be true when spaces are involved.  (or \s).

Example:

In [40]: mystring
Out[40]: 'rwo incorporated'

In [41]: re.sub('incorporated| inc|llc|corporation|corp| co', '', mystring)
Out[41]: 'rwoorporated'

In this case " inc" was processed before incorporated.
If I take the space out:

In [42]: re.sub('incorporated|inc|llc|corporation|corp| co', '', mystring)
Out[42]: 'rwo '

incorporated is processed first.

If I put a space with each, then " incorporated" is processed first:

In [43]: re.sub(' incorporated| inc|llc|corporation|corp| co', '', mystring)
Out[43]: 'rwo'

And If use \s instead of a space, it is processed first:

In [44]: re.sub('incorporated|\sinc|llc|corporation|corp| co', '', mystring)
Out[44]: 'rwoorporated'
msg236716 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2015-02-26 23:13
This looks like the expected behaviour to me.
re.sub matches the leftmost occurence and the regular expression is greedy so (x|xy) will always match xy if it can.
msg236718 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2015-02-27 00:07
@Mark is correct, it's not a bug.

In the first example:

It tries to match each alternative at position 0. Failure.
It tries to match each alternative at position 1. Failure.
It tries to match each alternative at position 2. Failure.
It tries to match each alternative at position 3. Success. ' inc' matches.

In the second example:

It tries to match each alternative at position 0. Failure.
It tries to match each alternative at position 1. Failure.
It tries to match each alternative at position 2. Failure.
It tries to match each alternative at position 3. Failure.
It tries to match each alternative at position 4. Success. 'incorporated' matches. ('inc' is a later alternative; it's considered only if the earlier alternatives have failed to match at that position.)
msg236720 - (view) Author: Rick Otten (Rick Otten) Date: 2015-02-27 00:36
Can the documentation be updated to make this more clear?

I see now where the clause "As the target string is scanned, ..." is describing what you have listed here.

I and a coworker both read the description several times and missed that.  I thought it first tried "incorporated" against the whole string, then tried " inc" against the whole string, etc...  When actually it was trying each, "incorporated" and " inc" and the others against the first position of the string.  And then again for the second position.

Since I want to force the order against the whole string before trying the next one for my particular use case, I'll do a series of re.subs instead of trying to do them all in one.  It makes sense now and is easy to fix.

Thanks for looking at it and explaining what is happening more clearly.  It was really not obvious.  I tried at least 100 variations and wasn't seeing the pattern.
msg236725 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-02-27 02:18
The thing is, what you describe is fundamental to how regular expressions work.  I'm not sure it makes sense to add a specific mention of it to the '|' docs, since it applies to all regexes.
msg236821 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2015-02-27 19:18
Not quite all. POSIX regexes will always look for the longest match, so the order of the alternatives doesn't matter, i.e. x|xy would give the same result as xy|x.
msg295128 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-04 15:57
From the documentation:

"""
As the target string is scanned, REs separated by ``'|'`` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once ``A`` matches, ``B`` will not be tested further, even if it would produce a longer overall match.  In other words, the ``'|'`` operator is never greedy.
"""

I think this completely describes the behavior.
msg295129 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-06-04 16:19
I concur with Serhiy that the docs correctly and completely describe the behavior.
History
Date User Action Args
2017-10-11 14:46:48berker.peksagsetstatus: open -> closed
stage: resolved
2017-06-04 16:19:21rhettingersetstatus: pending -> open
nosy: + rhettinger
messages: + msg295129

2017-06-04 15:57:51serhiy.storchakasetstatus: open -> pending

nosy: + serhiy.storchaka
messages: + msg295128

resolution: not a bug
2016-10-16 22:32:17serhiy.storchakasettype: behavior -> enhancement
components: + Regular Expressions
versions: + Python 3.5, Python 3.6, Python 3.7
2015-02-27 19:18:42mrabarnettsetmessages: + msg236821
2015-02-27 02:18:20r.david.murraysettitle: regex "|" behavior differs from documentation -> add example of 'first match wins' to regex "|" documentation?
nosy: + r.david.murray, docs@python

messages: + msg236725

assignee: docs@python
components: + Documentation, - Regular Expressions
2015-02-27 00:36:49Rick Ottensetmessages: + msg236720
2015-02-27 00:07:32mrabarnettsetmessages: + msg236718
2015-02-26 23:13:00Mark.Shannonsetnosy: + Mark.Shannon
messages: + msg236716
2015-02-26 23:00:23Rick Ottensetmessages: + msg236715
2015-02-26 22:55:54Rick Ottencreate