New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zero-length match confuses re.finditer() #44519
Comments
Hi! re.finditer() seems to incorrectly increment the current position immediately after matching a zero-length substring. For example: >>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'bc')] What happened to the 'a'? I expected this result: [('', None), (None, 'abc')] Perl agrees with me: % perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z*)|(\w+)/g' Similarly, if I remove the ^: >>> [m.groups() for m in re.finditer(r'(z*)|(\w+)', 'abc')]
[('', None), ('', None), ('', None), ('', None)] Now all of the letters have fallen through the cracks! I expected this result: [('', None), (None, 'abc'), ('', None)] Again, perl agrees: % perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z*)|(\w+)/g' If this bug has already been reported, I apologize -- I wasn't able to find it here. I haven't looked at the code for the re module, but this seems like the sort of bug that might have been accidentally introduced in order to try to prevent the same zero-length match from being returned forever. Thanks, |
This also affects re.findall(). |
What should:
return? Should the second group also yield a zero-width match before the |
Hmmm. This strikes me as a bug, beyond the realm of bpo-3262. The If it is indeed a bug, I think this should be considered for inclusion |
Never mind inclusion in 2.6 as no-one has repeated this bug in re-world |
Ah, I see the problem, if ptr is not incremented, then it will keep "",undef,undef Meaning it doesn't even bother matching the ^q* since the ^z* matches |
What about r'(^z*)|(q*)|(\w+)'? I could imagine that the first group |
FYI, I posted msg73737 after finding that the fix for the original case |
Perl gives this result for your new expression: "",undef,undef I think it has to do with not thinking of a string as a sequence of |
I have to report that the fix appears to be successful: >>> print [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'abc')]
>>> print re.findall(r"(^z*)|(\w+)", "abc")
[('', ''), ('', 'abc')]
>>> print [m.groups() for m in re.finditer(r"(^z*)|(q*)|(\w+)", "abc")]
[('', None, None), (None, None, 'abc'), (None, '', None)]
>>> print re.findall(r"(^z*)|(q*)|(\w+)", "abc")
[('', '', ''), ('', '', 'abc'), ('', '', '')] The patch is regex_2.6rc2+7.diff. |
Matthew, I'll try to merge all your diffs with the current repository |
I just re-tested this issue in trunk at changeset 053bc5ca199b and the issue is still exactly reproducible as originally reported. That is, the match to the empty string skips a character of the match: >>> import re
>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'bc')] |
This is still an issue today: >>> import re
>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'bc')] |
New changeset 70d56fb by Serhiy Storchaka in branch 'master': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: