Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zero-length match confuses re.finditer() #44519

Closed
jfrechet mannequin opened this issue Jan 29, 2007 · 15 comments
Closed

zero-length match confuses re.finditer() #44519

jfrechet mannequin opened this issue Jan 29, 2007 · 15 comments
Assignees
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error

Comments

@jfrechet
Copy link
Mannequin

jfrechet mannequin commented Jan 29, 2007

BPO 1647489
Nosy @ezio-melotti, @serhiy-storchaka
PRs
  • bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471
  • bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns (alternate version). #4678
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2018-03-14.17:17:12.339>
    created_at = <Date 2007-01-29.22:35:21.000>
    labels = ['expert-regex', 'type-bug', 'library', '3.7']
    title = 'zero-length match confuses re.finditer()'
    updated_at = <Date 2018-03-14.17:17:12.338>
    user = 'https://bugs.python.org/jfrechet'

    bugs.python.org fields:

    activity = <Date 2018-03-14.17:17:12.338>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2018-03-14.17:17:12.339>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2007-01-29.22:35:21.000>
    creator = 'jfrechet'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 1647489
    keywords = ['patch']
    message_count = 15.0
    messages = ['31129', '73719', '73737', '73741', '73742', '73746', '73755', '73765', '73789', '73792', '73809', '132827', '187318', '221979', '307556']
    nosy_count = 10.0
    nosy_names = ['niemeyer', 'jfrechet', 'rsc', 'timehorse', 'ezio.melotti', 'mrabarnett', 'THRlWiTi', 'denversc', 'serhiy.storchaka', 'isoschiz']
    pr_nums = ['4471', '4678']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1647489'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7']

    @jfrechet
    Copy link
    Mannequin Author

    jfrechet mannequin commented Jan 29, 2007

    Hi!

    re.finditer() seems to incorrectly increment the current position immediately after matching a zero-length substring. For example:

    >>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
    [('', None), (None, 'bc')]

    What happened to the 'a'? I expected this result:

    [('', None), (None, 'abc')]

    Perl agrees with me:

    % perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z*)|(\w+)/g'
    "",undef
    undef,"abc"
    "",undef

    Similarly, if I remove the ^:

    >>> [m.groups() for m in re.finditer(r'(z*)|(\w+)', 'abc')]
    [('', None), ('', None), ('', None), ('', None)]

    Now all of the letters have fallen through the cracks! I expected this result:

    [('', None), (None, 'abc'), ('', None)]

    Again, perl agrees:

    % perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z*)|(\w+)/g'
    "",undef
    undef,"abc"
    "",undef

    If this bug has already been reported, I apologize -- I wasn't able to find it here. I haven't looked at the code for the re module, but this seems like the sort of bug that might have been accidentally introduced in order to try to prevent the same zero-length match from being returned forever.

    Thanks,
    Jacques

    @jfrechet jfrechet mannequin assigned niemeyer Jan 29, 2007
    @jfrechet jfrechet mannequin added the topic-regex label Jan 29, 2007
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 24, 2008

    This also affects re.findall().

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 24, 2008

    What should:

    [m.groups() for m in re.finditer(r'(^z*)|(^q*)|(\w+)', 'abc')]
    

    return? Should the second group also yield a zero-width match before the
    third group is tried? I think it probably should. Does Perl?

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 24, 2008

    Hmmm. This strikes me as a bug, beyond the realm of bpo-3262. The
    two items may be related, but the dropping of the 'a' seems like
    unexpected behaviour that I doubt any current code is expecting to
    occur. Clearly, what is going on is that the Engine starts scanning at
    the 'a', finds the Zero-Width match and, having found a match,
    increments its pointer within the input string, thus skipping the 'a'
    when it matches 'bc'.

    If it is indeed a bug, I think this should be considered for inclusion
    in Python 2.6 rather than being part of the new Engine Design in Issue
    3626. I think the solution would simply be to not increment the ptr
    (which points to the input string) when findall / finditer encounters a
    Zero-Width match.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 24, 2008

    Never mind inclusion in 2.6 as no-one has repeated this bug in re-world
    examples yet so it's going to have to wait for the Regexp 2.7 engine in
    bpo-2636.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 24, 2008

    Ah, I see the problem, if ptr is not incremented, then it will keep
    matching the first expression, (^z*), so it would have to both 'skip'
    the 'a' and NOT skip the 'a'. Hmm. You're right, Matthew, this is
    pretty complicated. Now, for your expression, Matthew,
    r'(z*)|(^q*)|(\w+)', Perl gives:

    "",undef,undef
    undef,undef,"abc"
    "",undef,undef

    Meaning it doesn't even bother matching the ^q* since the ^z* matches
    first. This seems the logical behaviour and fits with the idea that a
    Zero-Width match would both only match once and NOT consume any
    characters. An internal flag would just have to be created to tell the
    2 find functions whether the current value of ptr would allow for a "No
    Zero-Width Match" option on second go-around.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 24, 2008

    What about r'(^z*)|(q*)|(\w+)'? I could imagine that the first group
    could match only at the start of the string, but if the second group
    doesn't have that restriction then it could match the second time, and
    only after that could the third match, if you see what I mean. (The
    previous example had (^q*) so it couldn't match because the first group
    has already matched at the start of the string and we've already
    advanced beyond that, even though by no characters!)

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 24, 2008

    FYI, I posted msg73737 after finding that the fix for the original case
    was really very simple, but then thought about whether it would behave
    as expected when there were more zero-width matches, hence the later posts.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 25, 2008

    Perl gives this result for your new expression:

    "",undef,undef
    undef,undef,"abc"
    undef,"",undef

    I think it has to do with not thinking of a string as a sequence of
    characters, but as a sequence of characters separated by null-space.
    Null-space is can be captured, but ONLY if it is part of a zero-width
    match, and once captured, it can no longer be captured by another
    zero-width expression. This is in keeping which what I see as Perl's
    behaviour, namely that the (q*) group never participates in the first
    match because, initially the (^z*) captures it. OTOH, when it gets to
    the null-space AFTER the 'abc' capture, the (^z*) cannot participate
    because it has a "at-beginning" restriction. The evaluator then moves
    on to the (q*), which has no such restriction and this time it matches,
    consuming the final null-space.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 25, 2008

    I have to report that the fix appears to be successful:

    >>> print [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
    [('', None), (None, 'abc')]
    >>> print re.findall(r"(^z*)|(\w+)", "abc")
    [('', ''), ('', 'abc')]
    >>> print [m.groups() for m in re.finditer(r"(^z*)|(q*)|(\w+)", "abc")]
    [('', None, None), (None, None, 'abc'), (None, '', None)]
    >>> print re.findall(r"(^z*)|(q*)|(\w+)", "abc")
    [('', '', ''), ('', '', 'abc'), ('', '', '')]

    The patch is regex_2.6rc2+7.diff.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 25, 2008

    Matthew, I'll try to merge all your diffs with the current repository
    over the weekend. Having done the first, I know where code differs
    between your implementation, mine and the base, so I can apply your
    patch, and then a patch that restores my changes so the rest of the
    merges should be easy! :)

    @denversc
    Copy link
    Mannequin

    denversc mannequin commented Apr 3, 2011

    I just re-tested this issue in trunk at changeset 053bc5ca199b and the issue is still exactly reproducible as originally reported. That is, the match to the empty string skips a character of the match:

    >>> import re
    >>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
    [('', None), (None, 'bc')]

    @isoschiz
    Copy link
    Mannequin

    isoschiz mannequin commented Apr 19, 2013

    This is still an issue today:

    >>> import re
    >>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
    [('', None), (None, 'bc')]

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jun 30, 2014

    How does "the Regexp 2.7 engine in bpo-2636" from msg73742 deal with this situation?

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Nov 18, 2017
    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Nov 18, 2017
    @serhiy-storchaka
    Copy link
    Member

    New changeset 70d56fb by Serhiy Storchaka in branch 'master':
    bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. (bpo-4471)
    70d56fb

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant