Title: re.finditer hangs on final empty match
Components: Regular Expressions Versions: Python 2.3
Status: closed Resolution: fixed
Assigned To: niemeyer Nosy List: effbot, kevinbutler, niemeyer
Created on 2003-10-03 15:01 by kevinbutler, last changed 2022-04-10 16:11 by admin. This issue is now closed.

sre.patch niemeyer, 2004-09-03 18:13 Applied patch.
Messages (4)
msg18533 - (view) Author: Kevin J. Butler (kevinbutler) Date: 2003-10-03 15:01
The iterator returned by re.finditer appears to not
terminate if the 
final match is empty, but rather keeps returning the
final (empty) match.

Is this a bug in _sre?  If so, I'll be happy to file
it, though fixing 
it is a bit beyond my _sre experience level at this
point.  The solution 
would appear to be to either a check for duplicate
match in, or to increment position by one after
returning an 
empty match (which should be OK, because if a non-empty
match started at 
that location, we would have returned it instead of the
empty match).

Code to illustrate the failure:

from re import finditer

last = None
for m in finditer( ".*", "asdf" ):
    if last == m.span():
        print "duplicate match:", last
    print, m.span()
    last = m.span()
asdf (0, 4)
 (4, 4)
duplicate match: (4, 4)

findall works:

print re.findall( ".*", "asdf" )
['asdf', '']

Workaround is to explicitly check for a duplicate span,
as I did above, 
or to check for a duplicate end(), which avoids the
final empty match

Seo Sanghyeon sent the following fix to python-dev list:

Attached one line patch fixes re.finditer bug reported by
Kevin J. Butler. I read cvs log to find out why this
code is
introduced, and it seems to be related to SF bug #581080.

But that bug didn't appear after my patch, so I wonder
why it was introduced in the first place. It seems beyond
my understanding. Please enlighten me.

To test:

import re
list(re.finditer('\s', 'a b'))
# expected: one item list
# bug: hang

#Kevin J. Butler
import re
list(re.finditer('.*', 'asdf'))
# expected: two item list (?)
# bug: hang

Seo Sanghyeon
Index: Modules/_sre.c
RCS file: /cvsroot/python/python/dist/src/Modules/_sre.c,v
retrieving revision 2.99
diff -c -r2.99 _sre.c
*** Modules/_sre.c	26 Jun 2003 14:41:08 -0000	2.99
--- Modules/_sre.c	2 Oct 2003 03:48:55 -0000
*** 3062,3069 ****
      match = pattern_new_match((PatternObject*)
                                 state, status);
!     if ((status == 0 || state->ptr == state->start) &&
!         state->ptr < state->end)
          state->start = (void*) ((char*) state->ptr +
          state->start = state->ptr;
--- 3062,3068 ----
      match = pattern_new_match((PatternObject*)
                                 state, status);
!     if (status == 0 || state->ptr == state->start)
          state->start = (void*) ((char*) state->ptr +
          state->start = state->ptr;
msg18534 - (view) Author: Kevin J. Butler (kevinbutler) Date: 2003-10-03 18:16
Logged In: YES 

The above patch does resolve the problem.

The code was introduced in rev 2.85
to resolve bug 581080
but removing this line does not re-introduce that bug.

Thanks, and kudos to Seo...
msg18535 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2004-09-03 12:04
Logged In: YES 

Still there in 2.4a3, as the following revised example shows:

import re

m = re.finditer(".*", "asdf")

print # this should raise an exception

Gustavo, can you look at this patch too?
msg18536 - (view) Author: Gustavo Niemeyer (niemeyer) * (Python committer) Date: 2004-09-03 18:13
Logged In: YES 

Patch applied and test cases added to check this bug and also for 
Kevin and Seo, thanks for the bug report and the fix. 
Fredrik, thanks for pointing me to the issue. 
Applied as: 
Lib/test/ 1.52 
Modules/_sre.c: 2.108 
Patch attached for reference. 
