This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.sub inconsistency beginning with 3.7
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: 4wayned, WayneD, bsammon, ezio.melotti, mrabarnett
Priority: normal Keywords:

Created on 2020-03-20 17:21 by WayneD, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sub-bug.py WayneD, 2020-03-20 17:21 Test program to demonstrate re.sub bug.
Messages (7)
msg364688 - (view) Author: Wayne Davison (WayneD) Date: 2020-03-20 17:21
There is an inconsistency in re.sub() when substituting at the end of a string using a prior match with a '*' qualifier: the substitution now occurs twice.  For example:

txt = re.sub(r'\s*\Z', "\n", txt)

This should work like txt.rstrip() + "\n", but beginning in 3.7, the re.sub version now matches twice and changes any non-empty whitespace into "\n\n" instead of "\n". (If there is no trailing whitespace it only matches once.)

The bug is the same if '$' is used instead of '\Z', but it does not happen if an actual character is specified (e.g. a substitution of r'\s*x' does not substitute twice if x has preceding whitespace).

I tested 2.7.17, 3.6.9, 3.7.7, 3.8.2, and 3.9.0a4, and it starts to fail in 3.7.7 and beyond.

Attached is a test program.
msg364694 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2020-03-20 17:46
Duplicate of Issue39687.

See https://docs.python.org/3/library/re.html#re.sub and https://docs.python.org/3/whatsnew/3.7.html#changes-in-the-python-api.
msg364697 - (view) Author: Wayne Davison (WayneD) Date: 2020-03-20 17:56
This is not the same thing because the match is anchored, so it is not adjacent to the prior match -- it is the same match. I think that r'\s*\Z' should behave the same way as r'\s*x' due to the anchor point. The current behavior is matching the same \Z twice.
msg364699 - (view) Author: Wayne Davison (WayneD) Date: 2020-03-20 18:06
Another argument in favor of this being a bug, this does not exhibit the same doubling:

txt = ' test'
txt = re.sub(r'^\s*', '^', txt)

That always substitutes once.
msg375050 - (view) Author: Wayne Davison (4wayned) Date: 2020-08-08 16:23
Can this bug please be reopened and fixed? This is an anchored substitution, and so should never match more than once.
msg402459 - (view) Author: Brian (bsammon) Date: 2021-09-22 18:08
I just ran into this change in behavior myself.

It's worth noting that the new behavior appears to match perl's behavior:

# perl -e 'print(("he" =~ s/e*\Z/ah/rg), "\n")'
hahah
msg402461 - (view) Author: Brian (bsammon) Date: 2021-09-22 18:18
txt = ' test'
txt = re.sub(r'^\s*', '^', txt)

substitutes once because the * is greedy.

txt = ' test'
txt = re.sub(r'^\s*?', '^', txt)

substitutes twice, consistent with the \Z behavior.
History
Date User Action Args
2022-04-11 14:59:28adminsetgithub: 84208
2021-09-22 18:18:55bsammonsetmessages: + msg402461
2021-09-22 18:08:34bsammonsetnosy: + bsammon
messages: + msg402459
2020-08-08 16:23:594waynedsetnosy: + 4wayned
messages: + msg375050
2020-03-20 18:06:18WayneDsetmessages: + msg364699
2020-03-20 17:56:29WayneDsetmessages: + msg364697
2020-03-20 17:46:26mrabarnettsetstatus: open -> closed
resolution: duplicate
messages: + msg364694

stage: resolved
2020-03-20 17:21:13WayneDcreate