classification
Title: Replace empty matches adjacent to a previous non-empty match in re.sub()
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Anders.Hovmöller, ezio.melotti, mrabarnett, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-12-13 18:28 by serhiy.storchaka, last changed 2019-04-12 19:53 by mrabarnett. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4846 merged serhiy.storchaka, 2017-12-13 18:34
Messages (8)
msg308229 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-12-13 18:28
Currently re.sub() replaces empty matches only when not adjacent to a previous match. This makes it inconsistent with re.findall() and re.finditer() which finds empty matches adjacent to a previous non-empty match and with other RE engines.

Proposed PR makes all functions that makes repeated searching (re.split(), re.sub(), re.findall(), re.finditer()) mutually consistent.

The PR change the behavior of re.split() too, but this doesn't matter, since it already is different from the 3.6 behavior.

BDFL have approved this change.

This change doesn't break any stdlib code. It is expected that it will not break much third-party code, and even if it will break some code, it can be easily rewritten. For example replacing re.sub('(.*)', ...) (which now matches an empty string at the end of the string) with re.sub('(.+)', ...) is an obvious fix.
msg309055 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-12-26 10:14
Could anybody please make a review of at least the documentation part?
msg309458 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-04 09:06
New changeset fbb490fd2f38bd817d99c20c05121ad0168a38ee by Serhiy Storchaka in branch 'master':
bpo-32308: Replace empty matches adjacent to a previous non-empty match in re.sub(). (#4846)
https://github.com/python/cpython/commit/fbb490fd2f38bd817d99c20c05121ad0168a38ee
msg339949 - (view) Author: Anders Hovmöller (Anders.Hovmöller) * Date: 2019-04-11 09:50
This was a really bad idea in my opinion. We just found this and we have no way to know how this will impact production. It's really absurd that 

re.sub('(.*)', r'foo', 'asd')

is "foo" in python 1 to 3.6 but 'foofoo' in python 3.7.
msg339950 - (view) Author: Anders Hovmöller (Anders.Hovmöller) * Date: 2019-04-11 09:57
Just as a comparison, sed does the 3.6 thing:

> echo foo | sed 's/\(.*\)/x\1y/g'
xfooy
msg339989 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-04-11 17:46
It's now consistent with Perl, PCRE and .Net (C#), as well as re.split(), re.sub(), re.findall() and re.finditer().
msg340040 - (view) Author: Anders Hovmöller (Anders.Hovmöller) * Date: 2019-04-12 13:33
That might be true, but that seems like a weak argument. If anything, it means those others are broken. What is the logic behind "(.*)" returning the entire string (which is what you asked for) and exactly one empty string? Why not two empty strings? 3? 4? 5? Why not an empty string at the beginning? It makes no practical sense.

We will have to spend considerable effort to work around this change and adapt our code to 3.7. The lack of a discussion about backwards compatibility in this, and the other, thread before making this change is also a problem I think.
msg340102 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-04-12 19:53
Consider re.findall(r'.{0,2}', 'abcde').

It finds 'ab', then continues where it left off to find 'cd', then 'e'.

It can also find ''; re.match(r'.*', '') does match, after all.

It could, in fact, an infinite number of ''.

And what about re.match(r'()*', '')?

What should it do? Run forever? Raise an exception?

At some point you have to make a decision as to what should happen, and the general consensus has been to match once.
History
Date User Action Args
2019-04-12 19:53:42mrabarnettsetmessages: + msg340102
2019-04-12 13:33:22Anders.Hovmöllersetmessages: + msg340040
2019-04-11 17:46:30mrabarnettsetmessages: + msg339989
2019-04-11 09:57:04Anders.Hovmöllersetmessages: + msg339950
2019-04-11 09:50:19Anders.Hovmöllersetnosy: + Anders.Hovmöller
messages: + msg339949
2018-01-04 09:06:40serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-01-04 09:06:15serhiy.storchakasetmessages: + msg309458
2017-12-26 10:14:59serhiy.storchakasetmessages: + msg309055
2017-12-13 18:34:24serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request4734
2017-12-13 18:28:38serhiy.storchakacreate