Issue 30720: re.sub substitution match group contains wrong value after unmatched pattern was processed

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/74905

classification

Title:	re.sub substitution match group contains wrong value after unmatched pattern was processed
Type:		Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	William Budd, ezio.melotti, mrabarnett, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2017-06-21 02:38 by William Budd, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg296506 - (view)	Author: William Budd (William Budd)	Date: 2017-06-21 02:38
pattern = re.compile('<div>(<p>.*?</p>)</div>', flags=re.DOTALL) ---------------------------------------------------------------- # This works as expected in the following case: print(re.sub(pattern, '\\1', '<div><p>foo</p></div>\n' '<div><p>bar</p>123456789</div>\n')) # which outputs: <p>foo</p> <div><p>bar</p>123456789</div> ---------------------------------------------------------------- # However, it does NOT work as I expect in this case: print(re.sub(pattern, '\\1', '<div><p>foo</p>123456789</div>\n' '<div><p>bar</p></div>\n')) # actual output: <p>foo</p>123456789</div> <div><p>bar</p> # expected output: <div><p>foo</p>123456789</div> <p>bar</p> ---------------------------------------------------------------- It seems that pattern matching/substitution iterations only go haywire once the matching iteration immediately prior to it turned out not to be a match. Maybe some internal variable is not cleaned up properly in an edge(?) case triggered by the example above?
msg296508 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-21 03:26
It works correctly. It finds a substring that starts with '<div><p>' and ends with '</p></div>' from left to right. The leftmost found substring starts from index 0 and ends before the final '\n'. Overlapped substrings are not found.
msg296509 - (view)	Author: William Budd (William Budd)	Date: 2017-06-21 03:32
I don't understand... Isn't the "?" in ".?" supposed to make the "." matching non-greedy, hence matching the first "</p>" rather than the last "</p>"?
msg296510 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-21 03:47
Yes, it is non-greedy. But it needs matching not just '</p>', but '</p></div>'. After finding the first '</p>' it doesn't see the '</div>' after it and continue searching until found '</p></div>'.
msg296511 - (view)	Author: William Budd (William Budd)	Date: 2017-06-21 04:04
I now see you're right of course. Not a bug after all. Thank you. I mistakenly assumed that the group boundary ")" would delimit the end of the non-greedy match group. I.e., ".?</p>" versus ".?</p></div>". I don't see a way to accomplish the "even less greedy" variant I'm looking for though...
msg296515 - (view)	Author: William Budd (William Budd)	Date: 2017-06-21 04:50
Doh! This has a really easy solution, doesn't it; just replace "." with "[^<]": re.compile('<div>(<p>[^<]*?</p>)</div>', flags=re.DOTALL). Sorry about the noise.
msg296527 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-21 06:54
Atomic groups can help you: '<div>((?><p>.?</p>))</div>'. But this feature is not supported in the re module yet (see issue433030). You can use the third-party regex module which is compatible with the re module and supports atomic grouping. >>> import regex as re >>> pattern = re.compile('<div>((?><p>.?</p>))</div>', flags=re.DOTALL) >>> print(re.sub(pattern, '\\1', ... '<div><p>foo</p>123456789</div>\n' ... '<div><p>bar</p></div>\n')) <div><p>foo</p>123456789</div> <p>bar</p>

History
Date	User	Action	Args
2022-04-11 14:58:47	admin	set	github: 74905
2017-06-21 06:54:30	serhiy.storchaka	set	messages: + msg296527
2017-06-21 04:50:57	William Budd	set	messages: + msg296515
2017-06-21 04:04:51	William Budd	set	messages: + msg296511
2017-06-21 03:47:08	serhiy.storchaka	set	messages: + msg296510
2017-06-21 03:32:13	William Budd	set	messages: + msg296509
2017-06-21 03:26:30	serhiy.storchaka	set	status: open -> closed nosy: + serhiy.storchaka messages: + msg296508 resolution: not a bug stage: resolved
2017-06-21 02:38:32	William Budd	create