classification
Title: Replace empty matches adjacent to a previous non-empty match in re.sub()
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Anders.Hovmöller, Mark Borgerding, ezio.melotti, mrabarnett, mu_mind, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-12-13 18:28 by serhiy.storchaka, last changed 2020-04-16 16:28 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4846 merged serhiy.storchaka, 2017-12-13 18:34
Messages (14)
msg308229 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-12-13 18:28
Currently re.sub() replaces empty matches only when not adjacent to a previous match. This makes it inconsistent with re.findall() and re.finditer() which finds empty matches adjacent to a previous non-empty match and with other RE engines.

Proposed PR makes all functions that makes repeated searching (re.split(), re.sub(), re.findall(), re.finditer()) mutually consistent.

The PR change the behavior of re.split() too, but this doesn't matter, since it already is different from the 3.6 behavior.

BDFL have approved this change.

This change doesn't break any stdlib code. It is expected that it will not break much third-party code, and even if it will break some code, it can be easily rewritten. For example replacing re.sub('(.*)', ...) (which now matches an empty string at the end of the string) with re.sub('(.+)', ...) is an obvious fix.
msg309055 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-12-26 10:14
Could anybody please make a review of at least the documentation part?
msg309458 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-04 09:06
New changeset fbb490fd2f38bd817d99c20c05121ad0168a38ee by Serhiy Storchaka in branch 'master':
bpo-32308: Replace empty matches adjacent to a previous non-empty match in re.sub(). (#4846)
https://github.com/python/cpython/commit/fbb490fd2f38bd817d99c20c05121ad0168a38ee
msg339949 - (view) Author: Anders Hovmöller (Anders.Hovmöller) * Date: 2019-04-11 09:50
This was a really bad idea in my opinion. We just found this and we have no way to know how this will impact production. It's really absurd that 

re.sub('(.*)', r'foo', 'asd')

is "foo" in python 1 to 3.6 but 'foofoo' in python 3.7.
msg339950 - (view) Author: Anders Hovmöller (Anders.Hovmöller) * Date: 2019-04-11 09:57
Just as a comparison, sed does the 3.6 thing:

> echo foo | sed 's/\(.*\)/x\1y/g'
xfooy
msg339989 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-04-11 17:46
It's now consistent with Perl, PCRE and .Net (C#), as well as re.split(), re.sub(), re.findall() and re.finditer().
msg340040 - (view) Author: Anders Hovmöller (Anders.Hovmöller) * Date: 2019-04-12 13:33
That might be true, but that seems like a weak argument. If anything, it means those others are broken. What is the logic behind "(.*)" returning the entire string (which is what you asked for) and exactly one empty string? Why not two empty strings? 3? 4? 5? Why not an empty string at the beginning? It makes no practical sense.

We will have to spend considerable effort to work around this change and adapt our code to 3.7. The lack of a discussion about backwards compatibility in this, and the other, thread before making this change is also a problem I think.
msg340102 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-04-12 19:53
Consider re.findall(r'.{0,2}', 'abcde').

It finds 'ab', then continues where it left off to find 'cd', then 'e'.

It can also find ''; re.match(r'.*', '') does match, after all.

It could, in fact, an infinite number of ''.

And what about re.match(r'()*', '')?

What should it do? Run forever? Raise an exception?

At some point you have to make a decision as to what should happen, and the general consensus has been to match once.
msg360352 - (view) Author: David Barnett (mu_mind) Date: 2020-01-21 04:56
We were also bitten by this behavior change in https://github.com/google/vroom/issues/110. I'm kinda baffled by the new behavior and assumed it had to be an accidental regression, but I guess not. If you have any other context on the BDFL conversation and reasoning for calling this behavior correct, I'd love to see additional info.
msg360355 - (view) Author: Anders Hovmöller (Anders.Hovmöller) * Date: 2020-01-21 06:07
We were also bitten by this. In fact we still run a compatibility shim in production where we log if the new and old behavior are different. We also didn't think this "bug fix" made sense or was treated with the appropriate gravity in the release notes. 

I understand the logic in the bug tracker and they it matches other languages is good. But the bahvior also makes no sense for the .* case unfortunately. 

> On 21 Jan 2020, at 05:56, David Barnett <report@bugs.python.org> wrote:
> 
> 
> David Barnett <davidbarnett2@gmail.com> added the comment:
> 
> We were also bitten by this behavior change in https://github.com/google/vroom/issues/110. I'm kinda baffled by the new behavior and assumed it had to be an accidental regression, but I guess not. If you have any other context on the BDFL conversation and reasoning for calling this behavior correct, I'd love to see additional info.
> 
> ----------
> nosy: +mu_mind
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue32308>
> _______________________________________
msg366595 - (view) Author: Mark Borgerding (Mark Borgerding) Date: 2020-04-16 12:57
So third-party code was knowingly broken to satisfy an aesthetic notion that substitution should be more like iteration.

Would not a FutureWarning have been a kinder way to stage this implementation?

A foolish consistency, indeed.
msg366602 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-16 14:46
The former implementation was wrong. See issue25054 which contains more obvious examples of that bug:

>>> re.sub(r"\b|:+", "-", "a::bc")
'-a-:-bc-'

Not all colons were replaced despite the fact that the pattern matches all colons.
msg366604 - (view) Author: Mark Borgerding (Mark Borgerding) Date: 2020-04-16 14:59
@serhiy.storchaka  Thanks for the link to issue25054 to clarify this change was not done solely for aesthetics.
Hopefully that will mollify others like me who find their way to this discussion as they try to figure out why their code broke with a new version of python.


I wish it had been done in a more staged and overt way, but that is just spitting in the wind at this point.


Thanks for all your work, my gripe du jour notwithstanding.
msg366606 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-16 16:28
If the behavior is obviously wrong (like in issue25054), we can fix it without warnings, and even backport the fix to older versions, because we do not expect that anybody depends on such weird behavior. If we are going to change the behavior, but expect that users can depend on the current behavior, we emit a FutureWarning first (and we did it for other changes in re). But this issue is the hard one. Before 3.7 we did not know that it is related to issue25054. We were not going to change this behavior (at least not in near future). But when a fix for issue25054 was written we did see that it is the same issue. We did not want to keep a bug in issue25054 few versions more, so we changed the behavior in this issue without warnings. It was an exceptional case.

This change was documented, in the module documentation, and in "What's New in Python 3.7" (section "Porting to Python 3.7"). If this is not enough we will be happy to get help to make it better.
History
Date User Action Args
2020-04-16 16:28:26serhiy.storchakasetmessages: + msg366606
2020-04-16 14:59:31Mark Borgerdingsetmessages: + msg366604
2020-04-16 14:46:18serhiy.storchakasetmessages: + msg366602
2020-04-16 12:57:37Mark Borgerdingsetnosy: + Mark Borgerding
messages: + msg366595
2020-01-21 06:07:35Anders.Hovmöllersetmessages: + msg360355
2020-01-21 04:56:04mu_mindsetnosy: + mu_mind
messages: + msg360352
2019-04-12 19:53:42mrabarnettsetmessages: + msg340102
2019-04-12 13:33:22Anders.Hovmöllersetmessages: + msg340040
2019-04-11 17:46:30mrabarnettsetmessages: + msg339989
2019-04-11 09:57:04Anders.Hovmöllersetmessages: + msg339950
2019-04-11 09:50:19Anders.Hovmöllersetnosy: + Anders.Hovmöller
messages: + msg339949
2018-01-04 09:06:40serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-01-04 09:06:15serhiy.storchakasetmessages: + msg309458
2017-12-26 10:14:59serhiy.storchakasetmessages: + msg309055
2017-12-13 18:34:24serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request4734
2017-12-13 18:28:38serhiy.storchakacreate