PR 4471 fixes this issue, issue1647489, and a couple of similar issues.
The most visible change is the change in re.split(). This is compatibility breaking change, and it affects third-party code. But ValueError or FutureWarning were raised for patterns that will change the behavior in this PR for two Python releases, since Python 3.5. Developers had enough time for fixing them. In most cases this is so trivial as changing `*` to `+` in `\s*`.
Changes in sub(), findall(), and finditer() are less visible. No one existing test needs modification for them. Was:
>>> re.split(r"\b|:+", "a::bc")
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
return _compile(pattern, flags).split(string, maxsplit)
['a:', 'bc']
>>> re.sub(r"\b|:+", "-", "a::bc")
'-a-:-bc-'
>>> re.findall(r"\b|:+", "a::bc")
['', '', ':', '', '']
>>> list(re.finditer(r"\b|:+", "a::bc"))
[<_sre.SRE_Match object; span=(0, 0), match=''>, <_sre.SRE_Match object; span=(1, 1), match=''>, <_sre.SRE_Match object; span=(2, 3), match=':'>, <_sre.SRE_Match object; span=(3, 3), match=''>, <_sre.SRE_Match object; span=(5, 5), match=''>]
Fixed:
>>> re.split(r"\b|:+", "a::bc")
['', 'a', '', 'bc', '']
>>> re.sub(r"\b|:+", "-", "a::bc")
'-a--bc-'
>>> re.findall(r"\b|:+", "a::bc")
['', '', '::', '', '']
>>> list(re.finditer(r"\b|:+", "a::bc"))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>, <re.Match object; span=(1, 3), match='::'>, <re.Match object; span=(3, 3), match=''>, <re.Match object; span=(5, 5), match=''>]
The behavior of re.split(), re.findall() and re.finditer() now is the same as in the regex module with the V1 flag. But the behavior of re.sub() left closer to the previous behavior, otherwise this would break existing tests. It is consistent with re.split() rather of re.findall() and re.finditer(). In regex with the V1 flag sub() is consistent with findall() and finditer(), but not with split().
|
The new “finditer” behaviour seems to contradict the documentation about excluding empty matches if they touch the start of another match.
>>> list(re.finditer(r"\b|:+", "a::bc"))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>, <re.Match object; span=(1, 3), match='::'>, <re.Match object; span=(3, 3), match=''>, <re.Match object; span=(5, 5), match=''>]
An empty match at (1, 1) is included, despite it touching the beginning of the match at (1, 3). My best guess is that when an empty match is found, searching continues at the same position for the first non-empty match.
|
Good point. Neither old nor new (which matches regex) behaviors conform the documentation: "Empty matches are included in the result unless they touch the beginning of another match." It is easy to exclude empty matches that touch the *ending* of another match. This would be consistent with the new behavior of split() and sub().
But this would break a one existing test for issue817234. Though that issue shouldn't rely on this detail. The test should just test that iterating doesn't hang.
And this would break a regular expression in pprint.
PR 4678 implements this version. I don't know what version is better.
>>> list(re.finditer(r"\b|:+", "a::bc"))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>, <re.Match object; span=(1, 3), match='::'>, <re.Match object; span=(5, 5), match=''>]
>>> re.sub(r"(\b|:+)", r"[\1]", "a::bc")
'[]a[][::]bc[]'
With PR 4471 the result of re.sub() is the same, but the result of re.finditer() is as in msg307424.
|