Issue 1647489: zero-length match confuses re.finditer()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44519

classification

Title:	zero-length match confuses re.finditer()
Type:	behavior	Stage:	resolved
Components:	Library (Lib), Regular Expressions	Versions:	Python 3.7, Python 3.6, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	THRlWiTi, denversc, ezio.melotti, isoschiz, jfrechet, mrabarnett, niemeyer, rsc, serhiy.storchaka, timehorse
Priority:	normal	Keywords:	patch

Created on 2007-01-29 22:35 by jfrechet, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 4471	merged	serhiy.storchaka, 2017-11-19 23:36
PR 4678	closed	serhiy.storchaka, 2017-12-02 17:32

Messages (15)
msg31129 - (view)	Author: Jacques Frechet (jfrechet)	Date: 2007-01-29 22:35
Hi! re.finditer() seems to incorrectly increment the current position immediately after matching a zero-length substring. For example: >>> [m.groups() for m in re.finditer(r'(^z)\|(\w+)', 'abc')] [('', None), (None, 'bc')] What happened to the 'a'? I expected this result: [('', None), (None, 'abc')] Perl agrees with me: % perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z)\|(\w+)/g' "",undef undef,"abc" "",undef Similarly, if I remove the ^: >>> [m.groups() for m in re.finditer(r'(z)\|(\w+)', 'abc')] [('', None), ('', None), ('', None), ('', None)] Now all of the letters have fallen through the cracks! I expected this result: [('', None), (None, 'abc'), ('', None)] Again, perl agrees: % perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z)\|(\w+)/g' "",undef undef,"abc" "",undef If this bug has already been reported, I apologize -- I wasn't able to find it here. I haven't looked at the code for the re module, but this seems like the sort of bug that might have been accidentally introduced in order to try to prevent the same zero-length match from being returned forever. Thanks, Jacques
msg73719 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2008-09-24 15:37
This also affects re.findall().
msg73737 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2008-09-24 17:59
What should: [m.groups() for m in re.finditer(r'(^z)\|(^q)\|(\w+)', 'abc')] return? Should the second group also yield a zero-width match before the third group is tried? I think it probably should. Does Perl?
msg73741 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2008-09-24 18:14
Hmmm. This strikes me as a bug, beyond the realm of Issue 3262. The two items may be related, but the dropping of the 'a' seems like unexpected behaviour that I doubt any current code is expecting to occur. Clearly, what is going on is that the Engine starts scanning at the 'a', finds the Zero-Width match and, having found a match, increments its pointer within the input string, thus skipping the 'a' when it matches 'bc'. If it is indeed a bug, I think this should be considered for inclusion in Python 2.6 rather than being part of the new Engine Design in Issue 3626. I think the solution would simply be to not increment the ptr (which points to the input string) when findall / finditer encounters a Zero-Width match.
msg73742 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2008-09-24 18:21
Never mind inclusion in 2.6 as no-one has repeated this bug in re-world examples yet so it's going to have to wait for the Regexp 2.7 engine in issue 2636.
msg73746 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2008-09-24 18:33
Ah, I see the problem, if ptr is not incremented, then it will keep matching the first expression, (^z), so it would have to both 'skip' the 'a' and NOT skip the 'a'. Hmm. You're right, Matthew, this is pretty complicated. Now, for your expression, Matthew, r'(z)\|(^q)\|(\w+)', Perl gives: "",undef,undef undef,undef,"abc" "",undef,undef Meaning it doesn't even bother matching the ^q since the ^z* matches first. This seems the logical behaviour and fits with the idea that a Zero-Width match would both only match once and NOT consume any characters. An internal flag would just have to be created to tell the 2 find functions whether the current value of ptr would allow for a "No Zero-Width Match" option on second go-around.
msg73755 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2008-09-24 21:14
What about r'(^z)\|(q)\|(\w+)'? I could imagine that the first group could match only at the start of the string, but if the second group doesn't have that restriction then it could match the second time, and only after that could the third match, if you see what I mean. (The previous example had (^q*) so it couldn't match because the first group has already matched at the start of the string and we've already advanced beyond that, even though by no characters!)
msg73765 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2008-09-24 23:34
FYI, I posted msg73737 after finding that the fix for the original case was really very simple, but then thought about whether it would behave as expected when there were more zero-width matches, hence the later posts.
msg73789 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2008-09-25 13:08
Perl gives this result for your new expression: "",undef,undef undef,undef,"abc" undef,"",undef I think it has to do with not thinking of a string as a sequence of characters, but as a sequence of characters separated by null-space. Null-space is can be captured, but ONLY if it is part of a zero-width match, and once captured, it can no longer be captured by another zero-width expression. This is in keeping which what I see as Perl's behaviour, namely that the (q) group never participates in the first match because, initially the (^z) captures it. OTOH, when it gets to the null-space AFTER the 'abc' capture, the (^z) cannot participate because it has a "at-beginning" restriction. The evaluator then moves on to the (q), which has no such restriction and this time it matches, consuming the final null-space.
msg73792 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2008-09-25 13:55
I have to report that the fix appears to be successful: >>> print [m.groups() for m in re.finditer(r'(^z)\|(\w+)', 'abc')] [('', None), (None, 'abc')] >>> print re.findall(r"(^z)\|(\w+)", "abc") [('', ''), ('', 'abc')] >>> print [m.groups() for m in re.finditer(r"(^z)\|(q)\|(\w+)", "abc")] [('', None, None), (None, None, 'abc'), (None, '', None)] >>> print re.findall(r"(^z)\|(q)\|(\w+)", "abc") [('', '', ''), ('', '', 'abc'), ('', '', '')] The patch is regex_2.6rc2+7.diff.
msg73809 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2008-09-25 19:22
Matthew, I'll try to merge all your diffs with the current repository over the weekend. Having done the first, I know where code differs between your implementation, mine and the base, so I can apply your patch, and then a patch that restores my changes so the rest of the merges should be easy! :)
msg132827 - (view)	Author: Denver Coneybeare (denversc) *	Date: 2011-04-03 01:59
I just re-tested this issue in trunk at changeset 053bc5ca199b and the issue is still exactly reproducible as originally reported. That is, the match to the empty string skips a character of the match: >>> import re >>> [m.groups() for m in re.finditer(r'(^z*)\|(\w+)', 'abc')] [('', None), (None, 'bc')]
msg187318 - (view)	Author: Martin Morrison (isoschiz) *	Date: 2013-04-19 00:21
This is still an issue today: >>> import re >>> [m.groups() for m in re.finditer(r'(^z*)\|(\w+)', 'abc')] [('', None), (None, 'bc')]
msg221979 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2014-06-30 20:15
How does "the Regexp 2.7 engine in issue 2636" from msg73742 deal with this situation?
msg307556 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-12-04 12:29
New changeset 70d56fb52582d9d3f7c00860d6e90570c6259371 by Serhiy Storchaka in branch 'master': bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. (#4471) https://github.com/python/cpython/commit/70d56fb52582d9d3f7c00860d6e90570c6259371

History
Date	User	Action	Args
2022-04-11 14:56:22	admin	set	github: 44519
2018-03-14 17:17:12	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2017-12-04 12:29:09	serhiy.storchaka	set	messages: + msg307556
2017-12-02 17:32:36	serhiy.storchaka	set	pull_requests: + pull_request4587
2017-11-19 23:36:58	serhiy.storchaka	set	keywords: + patch stage: patch review pull_requests: + pull_request4404
2017-11-18 15:55:13	serhiy.storchaka	set	nosy: + ezio.melotti type: behavior components: + Library (Lib)
2017-11-18 15:55:00	serhiy.storchaka	set	assignee: niemeyer -> serhiy.storchaka nosy: + serhiy.storchaka versions: + Python 3.6, Python 3.7
2016-11-05 17:51:35	BreamoreBoy	set	nosy: - BreamoreBoy
2016-11-05 14:23:33	THRlWiTi	set	nosy: + THRlWiTi
2014-06-30 20:15:38	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg221979
2013-04-19 00:21:29	isoschiz	set	nosy: + isoschiz messages: + msg187318
2011-04-03 01:59:19	denversc	set	nosy: + denversc messages: + msg132827
2008-09-25 19:22:06	timehorse	set	messages: + msg73809
2008-09-25 13:55:38	mrabarnett	set	messages: + msg73792
2008-09-25 13:08:16	timehorse	set	messages: + msg73789
2008-09-24 23:34:19	mrabarnett	set	messages: + msg73765
2008-09-24 21:14:13	mrabarnett	set	messages: + msg73755
2008-09-24 18:33:08	timehorse	set	messages: + msg73746
2008-09-24 18:21:05	timehorse	set	messages: + msg73742 versions: + Python 2.7, - Python 2.5
2008-09-24 18:14:27	timehorse	set	messages: + msg73741
2008-09-24 17:59:56	mrabarnett	set	messages: + msg73737
2008-09-24 17:35:57	timehorse	set	nosy: + timehorse
2008-09-24 15:37:46	mrabarnett	set	nosy: + mrabarnett messages: + msg73719
2008-04-24 21:08:30	rsc	set	nosy: + rsc
2007-01-29 22:35:21	jfrechet	create