Capturing start of line '^' #69241

AlcoloAlcolo · 2015-09-10T12:19:57Z

BPO	25054
Nosy	@ezio-melotti, @bitdancer, @vadmium, @serhiy-storchaka
PRs	bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471 bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns (alternate version). #4678

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2018-03-14.17:16:23.514>
created_at = <Date 2015-09-10.12:19:56.653>
labels = ['expert-regex', 'type-bug', '3.7']
title = "Capturing start of line '^'"
updated_at = <Date 2018-03-14.17:16:23.513>
user = 'https://bugs.python.org/AlcoloAlcolo'

bugs.python.org fields:

activity = <Date 2018-03-14.17:16:23.513>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2018-03-14.17:16:23.514>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2015-09-10.12:19:56.653>
creator = 'Alcolo Alcolo'
dependencies = []
files = []
hgrepos = []
issue_num = 25054
keywords = ['patch']
message_count = 15.0
messages = ['250364', '250366', '250370', '250377', '250392', '257296', '306517', '307400', '307424', '307441', '307454', '307461', '307467', '307476', '307557']
nosy_count = 6.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'r.david.murray', 'martin.panter', 'serhiy.storchaka', 'Alcolo Alcolo']
pr_nums = ['4471', '4678']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25054'
versions = ['Python 2.7', 'Python 3.6', 'Python 3.7']

AlcoloAlcolo · 2015-09-10T12:19:57Z

Why
re.findall('^|a', 'a') != ['', 'a'] ?

We have:
re.findall('^|a', ' a') == ['', 'a']
and
re.findall('$|a', ' a') == ['a', '']

Capturing '^' take the 1st character. It's look like a bug ...

bitdancer · 2015-09-10T12:27:44Z

^ finds an empty match at the beginning of the string, $ finds an empty match at the end. I don't see the bug (but I'm not a regex expert).

mrabarnett · 2015-09-10T13:34:55Z

After matching '^', it advances so that it won't find the same match again (and again and again...).

Unfortunately, that means that it sometimes misses some matches.

It's a known issue.

AlcoloAlcolo · 2015-09-10T14:02:01Z

Naively, I thinked that ^ is be considered as a 0-length token (like $, \b, \B), then after capturing it, we can read the next token : 'a' (for the input string "a").

I use a simple work around: prepending my string with ' ' (because ' ' is neutral with my regex results).

mrabarnett · 2015-09-10T17:09:36Z

Just to confirm, it _is_ a bug.

It tries to avoid getting stuck, but the way it does that causes it to skip a character, sometimes missing a match it should have found.

ezio-melotti · 2016-01-01T20:41:00Z

AFAIU the problem is at Modules/_sre.c:852: after matching, if the ptr is still at the start position, the start position gets incremented to avoid an endless loop.
Ideally the problem could be avoided by marking and skipping the part(s) of the pattern that have already been tested and produced a zero-length match, however I don't see any easy way to do it.
Unless someone can come up with a reasonable solution, I would suggest to close this as wontfix, and possibly add a note to the docs about this corner case.

serhiy-storchaka · 2017-11-20T00:04:13Z

PR 4471 fixes this issue, bpo-1647489, and a couple of similar issues.

The most visible change is the change in re.split(). This is compatibility breaking change, and it affects third-party code. But ValueError or FutureWarning were raised for patterns that will change the behavior in this PR for two Python releases, since Python 3.5. Developers had enough time for fixing them. In most cases this is so trivial as changing * to + in \s*.

Changes in sub(), findall(), and finditer() are less visible. No one existing test needs modification for them. Was:

>>> re.split(r"\b|:+", "a::bc")
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
['a:', 'bc']
>>> re.sub(r"\b|:+", "-", "a::bc")
'-a-:-bc-'
>>> re.findall(r"\b|:+", "a::bc")
['', '', ':', '', '']
>>> list(re.finditer(r"\b|:+", "a::bc"))
[<_sre.SRE_Match object; span=(0, 0), match=''>, <_sre.SRE_Match object; span=(1, 1), match=''>, <_sre.SRE_Match object; span=(2, 3), match=':'>, <_sre.SRE_Match object; span=(3, 3), match=''>, <_sre.SRE_Match object; span=(5, 5), match=''>]

Fixed:

>>> re.split(r"\b|:+", "a::bc")
['', 'a', '', 'bc', '']
>>> re.sub(r"\b|:+", "-", "a::bc")
'-a--bc-'
>>> re.findall(r"\b|:+", "a::bc")
['', '', '::', '', '']
>>> list(re.finditer(r"\b|:+", "a::bc"))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>, <re.Match object; span=(1, 3), match='::'>, <re.Match object; span=(3, 3), match=''>, <re.Match object; span=(5, 5), match=''>]

The behavior of re.split(), re.findall() and re.finditer() now is the same as in the regex module with the V1 flag. But the behavior of re.sub() left closer to the previous behavior, otherwise this would break existing tests. It is consistent with re.split() rather of re.findall() and re.finditer(). In regex with the V1 flag sub() is consistent with findall() and finditer(), but not with split().

serhiy-storchaka · 2017-12-01T18:54:37Z

Could anybody please make review at least of the documentation part? I want to merge this before 3.7.0a3 be released.

Initially I was going to backport the part that relates findall(), finditer() and sub(). It changes the behavior only in corner cases and I didn't expect it can break a real code. But since it broke a pattern in the doctest module, I afraid it can break a third-party code.

vadmium · 2017-12-02T10:50:48Z

The new “finditer” behaviour seems to contradict the documentation about excluding empty matches if they touch the start of another match.

>>> list(re.finditer(r"\b|:+", "a::bc"))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>, <re.Match object; span=(1, 3), match='::'>, <re.Match object; span=(3, 3), match=''>, <re.Match object; span=(5, 5), match=''>]

An empty match at (1, 1) is included, despite it touching the beginning of the match at (1, 3). My best guess is that when an empty match is found, searching continues at the same position for the first non-empty match.

serhiy-storchaka · 2017-12-02T17:37:25Z

Good point. Neither old nor new (which matches regex) behaviors conform the documentation: "Empty matches are included in the result unless they touch the beginning of another match." It is easy to exclude empty matches that touch the *ending* of another match. This would be consistent with the new behavior of split() and sub().

But this would break a one existing test for bpo-817234. Though that issue shouldn't rely on this detail. The test should just test that iterating doesn't hang.

And this would break a regular expression in pprint.

PR 4678 implements this version. I don't know what version is better.

>>> list(re.finditer(r"\b|:+", "a::bc"))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>, <re.Match object; span=(1, 3), match='::'>, <re.Match object; span=(5, 5), match=''>]
>>> re.sub(r"(\b|:+)", r"[\1]", "a::bc")
'[]a[][::]bc[]'

With PR 4471 the result of re.sub() is the same, but the result of re.finditer() is as in msg307424.

serhiy-storchaka · 2017-12-02T20:11:23Z

The clause "Empty matches are included in the result unless they touch the beginning of another match" was added in 2f3e548 (bpo-732120) and I suppose it never was correct. So we can ignore it in the context of this issue.

mrabarnett · 2017-12-02T21:29:19Z

The pattern:

\b|:+

will match a word boundary (zero-width) before colons, so if there's a word followed by colons, finditer will find the boundary and then the colons. You _can_ get a zero-width match (ZWM) joined to the start of a nonzero-width match (NWM). That's not really surprising.

If you wanted to avoid a ZWM joined to either end of a NWM, you'd need to keep looking for another match at a position even after you'd already found a match if what you'd found was zero-width. That would also affect re.search and re.match.

For regex on Python 3.7, I'm going with avoiding a ZWM joined to the end of a NWM, unless re's going a different way, in which case I have more work to do to remain compatible! The change I did for Python 3.7+ was trivial.

serhiy-storchaka · 2017-12-02T22:01:15Z

Avoiding ZWM after a NWM in re.sub() is explicitly documented (and the documentation is correct in this case). This follows the behavior in the ancient RE implementation. Once it was broken in sre, but then fixed (see 21009b9, bpo-462270). Changing this behavior doesn't break anything in the stdlib except the specially purposed test. I think it is better to keep this behavior, but maybe discuss its changing (for making matching the behavior of other RE engines) in the separate issue.

I don't know how the behavior of findall() and finditer() is related to this.

mrabarnett · 2017-12-02T22:48:58Z

findall() and finditer() consist of multiple uses of search(), basically, as do sub() and split(), so we want the same rule to apply to them all.

serhiy-storchaka · 2017-12-04T12:29:10Z

New changeset 70d56fb by Serhiy Storchaka in branch 'master':
bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. (bpo-4471)
70d56fb

AlcoloAlcolo mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Sep 10, 2015

serhiy-storchaka added the 3.7 (EOL) end of life label Nov 16, 2017

serhiy-storchaka self-assigned this Nov 16, 2017

serhiy-storchaka closed this as completed Mar 14, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capturing start of line '^' #69241

Capturing start of line '^' #69241

AlcoloAlcolo mannequin commented Sep 10, 2015

AlcoloAlcolo mannequin commented Sep 10, 2015

bitdancer commented Sep 10, 2015

mrabarnett mannequin commented Sep 10, 2015

AlcoloAlcolo mannequin commented Sep 10, 2015

mrabarnett mannequin commented Sep 10, 2015

ezio-melotti commented Jan 1, 2016

serhiy-storchaka commented Nov 20, 2017

serhiy-storchaka commented Dec 1, 2017

vadmium commented Dec 2, 2017

serhiy-storchaka commented Dec 2, 2017

serhiy-storchaka commented Dec 2, 2017

mrabarnett mannequin commented Dec 2, 2017

serhiy-storchaka commented Dec 2, 2017

mrabarnett mannequin commented Dec 2, 2017

serhiy-storchaka commented Dec 4, 2017

Capturing start of line '^' #69241

Capturing start of line '^' #69241

Comments

AlcoloAlcolo mannequin commented Sep 10, 2015

AlcoloAlcolo mannequin commented Sep 10, 2015

bitdancer commented Sep 10, 2015

mrabarnett mannequin commented Sep 10, 2015

AlcoloAlcolo mannequin commented Sep 10, 2015

mrabarnett mannequin commented Sep 10, 2015

ezio-melotti commented Jan 1, 2016

serhiy-storchaka commented Nov 20, 2017

serhiy-storchaka commented Dec 1, 2017

vadmium commented Dec 2, 2017

serhiy-storchaka commented Dec 2, 2017

serhiy-storchaka commented Dec 2, 2017

mrabarnett mannequin commented Dec 2, 2017

serhiy-storchaka commented Dec 2, 2017

mrabarnett mannequin commented Dec 2, 2017

serhiy-storchaka commented Dec 4, 2017