Replace empty matches adjacent to a previous non-empty match in re.sub() #76489

serhiy-storchaka · 2017-12-13T18:28:38Z

BPO	32308
Nosy	@ezio-melotti, @serhiy-storchaka, @boxed
PRs	bpo-32308: Replace empty matches adjacent to a previous non-empty match in re.sub(). #4846

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2018-01-04.09:06:40.130>
created_at = <Date 2017-12-13.18:28:37.987>
labels = ['expert-regex', 'type-feature', 'library', '3.7']
title = 'Replace empty matches adjacent to a previous non-empty match in re.sub()'
updated_at = <Date 2020-04-16.16:28:26.175>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2020-04-16.16:28:26.175>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2018-01-04.09:06:40.130>
closer = 'serhiy.storchaka'
components = ['Library (Lib)', 'Regular Expressions']
creation = <Date 2017-12-13.18:28:37.987>
creator = 'serhiy.storchaka'
dependencies = []
files = []
hgrepos = []
issue_num = 32308
keywords = ['patch']
message_count = 14.0
messages = ['308229', '309055', '309458', '339949', '339950', '339989', '340040', '340102', '360352', '360355', '366595', '366602', '366604', '366606']
nosy_count = 6.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'mu_mind', 'serhiy.storchaka', 'Anders.Hovm\xc3\xb6ller', 'Mark Borgerding']
pr_nums = ['4846']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue32308'
versions = ['Python 3.7']

serhiy-storchaka · 2017-12-13T18:28:38Z

Currently re.sub() replaces empty matches only when not adjacent to a previous match. This makes it inconsistent with re.findall() and re.finditer() which finds empty matches adjacent to a previous non-empty match and with other RE engines.

Proposed PR makes all functions that makes repeated searching (re.split(), re.sub(), re.findall(), re.finditer()) mutually consistent.

The PR change the behavior of re.split() too, but this doesn't matter, since it already is different from the 3.6 behavior.

BDFL have approved this change.

This change doesn't break any stdlib code. It is expected that it will not break much third-party code, and even if it will break some code, it can be easily rewritten. For example replacing re.sub('(.*)', ...) (which now matches an empty string at the end of the string) with re.sub('(.+)', ...) is an obvious fix.

serhiy-storchaka · 2017-12-26T10:14:59Z

Could anybody please make a review of at least the documentation part?

serhiy-storchaka · 2018-01-04T09:06:15Z

New changeset fbb490f by Serhiy Storchaka in branch 'master':
bpo-32308: Replace empty matches adjacent to a previous non-empty match in re.sub(). (bpo-4846)
fbb490f

boxed · 2019-04-11T09:50:20Z

This was a really bad idea in my opinion. We just found this and we have no way to know how this will impact production. It's really absurd that

re.sub('(.*)', r'foo', 'asd')

is "foo" in python 1 to 3.6 but 'foofoo' in python 3.7.

boxed · 2019-04-11T09:57:05Z

Just as a comparison, sed does the 3.6 thing:

echo foo | sed 's/$.*$/x\1y/g'
xfooy

mrabarnett · 2019-04-11T17:46:31Z

It's now consistent with Perl, PCRE and .Net (C#), as well as re.split(), re.sub(), re.findall() and re.finditer().

boxed · 2019-04-12T13:33:22Z

That might be true, but that seems like a weak argument. If anything, it means those others are broken. What is the logic behind "(.*)" returning the entire string (which is what you asked for) and exactly one empty string? Why not two empty strings? 3? 4? 5? Why not an empty string at the beginning? It makes no practical sense.

We will have to spend considerable effort to work around this change and adapt our code to 3.7. The lack of a discussion about backwards compatibility in this, and the other, thread before making this change is also a problem I think.

mrabarnett · 2019-04-12T19:53:42Z

Consider re.findall(r'.{0,2}', 'abcde').

It finds 'ab', then continues where it left off to find 'cd', then 'e'.

It can also find ''; re.match(r'.*', '') does match, after all.

It could, in fact, an infinite number of ''.

And what about re.match(r'()*', '')?

What should it do? Run forever? Raise an exception?

At some point you have to make a decision as to what should happen, and the general consensus has been to match once.

mumind · 2020-01-21T04:56:04Z

We were also bitten by this behavior change in google/vroom#110. I'm kinda baffled by the new behavior and assumed it had to be an accidental regression, but I guess not. If you have any other context on the BDFL conversation and reasoning for calling this behavior correct, I'd love to see additional info.

boxed · 2020-01-21T06:07:35Z

We were also bitten by this. In fact we still run a compatibility shim in production where we log if the new and old behavior are different. We also didn't think this "bug fix" made sense or was treated with the appropriate gravity in the release notes.

I understand the logic in the bug tracker and they it matches other languages is good. But the bahvior also makes no sense for the .* case unfortunately.

On 21 Jan 2020, at 05:56, David Barnett <report@bugs.python.org> wrote:

David Barnett <davidbarnett2@gmail.com> added the comment:

We were also bitten by this behavior change in google/vroom#110. I'm kinda baffled by the new behavior and assumed it had to be an accidental regression, but I guess not. If you have any other context on the BDFL conversation and reasoning for calling this behavior correct, I'd love to see additional info.

----------
nosy: +mu_mind

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue32308\>

MarkBorgerding · 2020-04-16T12:57:37Z

So third-party code was knowingly broken to satisfy an aesthetic notion that substitution should be more like iteration.

Would not a FutureWarning have been a kinder way to stage this implementation?

A foolish consistency, indeed.

serhiy-storchaka · 2020-04-16T14:46:18Z

The former implementation was wrong. See bpo-25054 which contains more obvious examples of that bug:

>>> re.sub(r"\b|:+", "-", "a::bc")
'-a-:-bc-'

Not all colons were replaced despite the fact that the pattern matches all colons.

MarkBorgerding · 2020-04-16T14:59:31Z

@serhiy.storchaka Thanks for the link to bpo-25054 to clarify this change was not done solely for aesthetics.
Hopefully that will mollify others like me who find their way to this discussion as they try to figure out why their code broke with a new version of python.

I wish it had been done in a more staged and overt way, but that is just spitting in the wind at this point.

Thanks for all your work, my gripe du jour notwithstanding.

serhiy-storchaka · 2020-04-16T16:28:26Z

If the behavior is obviously wrong (like in bpo-25054), we can fix it without warnings, and even backport the fix to older versions, because we do not expect that anybody depends on such weird behavior. If we are going to change the behavior, but expect that users can depend on the current behavior, we emit a FutureWarning first (and we did it for other changes in re). But this issue is the hard one. Before 3.7 we did not know that it is related to bpo-25054. We were not going to change this behavior (at least not in near future). But when a fix for bpo-25054 was written we did see that it is the same issue. We did not want to keep a bug in bpo-25054 few versions more, so we changed the behavior in this issue without warnings. It was an exceptional case.

This change was documented, in the module documentation, and in "What's New in Python 3.7" (section "Porting to Python 3.7"). If this is not enough we will be happy to get help to make it better.

Fernando-Cabral · 2022-05-24T13:56:49Z

I still think re.sub() behaves wrongly in certain cases like the following:
re.sub('.*', "needle", "One line haystack")
should return 'needle', but returns 'needleneedle'. I find no rational reason for this result.
So much so that the following operations return what is to be expected:
re.sub('^.*$', "needle", "One line haystack")
re.sub('^.*', "needle", "One line haystack")
In this context "." == "^." == "^.$" so they sould all give the same result. Nevertheless, they don't. Besides, ".$" should, but does not return the same result as "^.". In summary:
"." == ".$" != "^.$" == "^.".
This is highly inconsistent.
Also,
re.seach(".", "One line haystack")
re.match(".", "One line haystack")
re.findall(".", "One line haystack")
they find "One line laystack". So, why re.sub doesn't? I don't think it makes sense.

serhiy-storchaka · 2022-05-24T15:01:41Z

Please read all previous discussions and explanations scattered across several issues.

In your example, look how many occurrences of the pattern are found in the string:

>>> re.findall('.*', "One line haystack")
['One line haystack', '']

And where they are found:

>>> [m.span() for m in re.finditer('.*', "One line haystack")]
[(0, 17), (17, 17)]

Now, if you replace characters in the ranges 0 to 17 and 17 to 17 with the replacement string "needle", what do you get?

There was a behavioral change in Python 3.7 regarding empty matches. This adjusts the regex to no longer match an empty string. See python/cpython#76489

serhiy-storchaka added the 3.7 (EOL) end of life label Dec 13, 2017

serhiy-storchaka self-assigned this Dec 13, 2017

serhiy-storchaka added stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement labels Dec 13, 2017

serhiy-storchaka closed this as completed Jan 4, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

gaudenz mentioned this issue Sep 5, 2022

Fix regex_replace expression for Python 3.7 cloudscale-ch/ansible-haproxy#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace empty matches adjacent to a previous non-empty match in re.sub() #76489

Replace empty matches adjacent to a previous non-empty match in re.sub() #76489

serhiy-storchaka commented Dec 13, 2017

serhiy-storchaka commented Dec 13, 2017

serhiy-storchaka commented Dec 26, 2017

serhiy-storchaka commented Jan 4, 2018

boxed mannequin commented Apr 11, 2019

boxed mannequin commented Apr 11, 2019

mrabarnett mannequin commented Apr 11, 2019

boxed mannequin commented Apr 12, 2019

mrabarnett mannequin commented Apr 12, 2019

mumind mannequin commented Jan 21, 2020

boxed mannequin commented Jan 21, 2020

MarkBorgerding mannequin commented Apr 16, 2020

serhiy-storchaka commented Apr 16, 2020

MarkBorgerding mannequin commented Apr 16, 2020

serhiy-storchaka commented Apr 16, 2020

Fernando-Cabral commented May 24, 2022 •

edited

serhiy-storchaka commented May 24, 2022

Replace empty matches adjacent to a previous non-empty match in re.sub() #76489

Replace empty matches adjacent to a previous non-empty match in re.sub() #76489

Comments

serhiy-storchaka commented Dec 13, 2017

serhiy-storchaka commented Dec 13, 2017

serhiy-storchaka commented Dec 26, 2017

serhiy-storchaka commented Jan 4, 2018

boxed mannequin commented Apr 11, 2019

boxed mannequin commented Apr 11, 2019

mrabarnett mannequin commented Apr 11, 2019

boxed mannequin commented Apr 12, 2019

mrabarnett mannequin commented Apr 12, 2019

mumind mannequin commented Jan 21, 2020

boxed mannequin commented Jan 21, 2020

MarkBorgerding mannequin commented Apr 16, 2020

serhiy-storchaka commented Apr 16, 2020

MarkBorgerding mannequin commented Apr 16, 2020

serhiy-storchaka commented Apr 16, 2020

Fernando-Cabral commented May 24, 2022 • edited

serhiy-storchaka commented May 24, 2022

Fernando-Cabral commented May 24, 2022 •

edited