Issue 40027: re.sub inconsistency beginning with 3.7

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84208

classification

Title:	re.sub inconsistency beginning with 3.7
Type:	behavior	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:		Nosy List:	4wayned, WayneD, bsammon, ezio.melotti, mrabarnett
Priority:	normal	Keywords:

Created on 2020-03-20 17:21 by WayneD, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
sub-bug.py	WayneD, 2020-03-20 17:21	Test program to demonstrate re.sub bug.

Messages (7)
msg364688 - (view)	Author: Wayne Davison (WayneD)	Date: 2020-03-20 17:21
There is an inconsistency in re.sub() when substituting at the end of a string using a prior match with a '' qualifier: the substitution now occurs twice. For example: txt = re.sub(r'\s\Z', "\n", txt) This should work like txt.rstrip() + "\n", but beginning in 3.7, the re.sub version now matches twice and changes any non-empty whitespace into "\n\n" instead of "\n". (If there is no trailing whitespace it only matches once.) The bug is the same if '$' is used instead of '\Z', but it does not happen if an actual character is specified (e.g. a substitution of r'\s*x' does not substitute twice if x has preceding whitespace). I tested 2.7.17, 3.6.9, 3.7.7, 3.8.2, and 3.9.0a4, and it starts to fail in 3.7.7 and beyond. Attached is a test program.
msg364694 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2020-03-20 17:46
Duplicate of Issue39687. See https://docs.python.org/3/library/re.html#re.sub and https://docs.python.org/3/whatsnew/3.7.html#changes-in-the-python-api.
msg364697 - (view)	Author: Wayne Davison (WayneD)	Date: 2020-03-20 17:56
This is not the same thing because the match is anchored, so it is not adjacent to the prior match -- it is the same match. I think that r'\s\Z' should behave the same way as r'\sx' due to the anchor point. The current behavior is matching the same \Z twice.
msg364699 - (view)	Author: Wayne Davison (WayneD)	Date: 2020-03-20 18:06
Another argument in favor of this being a bug, this does not exhibit the same doubling: txt = ' test' txt = re.sub(r'^\s*', '^', txt) That always substitutes once.
msg375050 - (view)	Author: Wayne Davison (4wayned)	Date: 2020-08-08 16:23
Can this bug please be reopened and fixed? This is an anchored substitution, and so should never match more than once.
msg402459 - (view)	Author: Brian (bsammon)	Date: 2021-09-22 18:08
I just ran into this change in behavior myself. It's worth noting that the new behavior appears to match perl's behavior: # perl -e 'print(("he" =~ s/e*\Z/ah/rg), "\n")' hahah
msg402461 - (view)	Author: Brian (bsammon)	Date: 2021-09-22 18:18
txt = ' test' txt = re.sub(r'^\s', '^', txt) substitutes once because the is greedy. txt = ' test' txt = re.sub(r'^\s*?', '^', txt) substitutes twice, consistent with the \Z behavior.

History
Date	User	Action	Args
2022-04-11 14:59:28	admin	set	github: 84208
2021-09-22 18:18:55	bsammon	set	messages: + msg402461
2021-09-22 18:08:34	bsammon	set	nosy: + bsammon messages: + msg402459
2020-08-08 16:23:59	4wayned	set	nosy: + 4wayned messages: + msg375050
2020-03-20 18:06:18	WayneD	set	messages: + msg364699
2020-03-20 17:56:29	WayneD	set	messages: + msg364697
2020-03-20 17:46:26	mrabarnett	set	status: open -> closed resolution: duplicate messages: + msg364694 stage: resolved
2020-03-20 17:21:13	WayneD	create