classification
Title: RE (regular expression) matching stuck in loop
Type: Stage:
Components: Regular Expressions Versions: Python 2.4
process
Status: closed Resolution: duplicate
Dependencies: Superseder: the re module can perform poorly: O(2**n) versus O(n**2)
View: 1662581
Assigned To: niemeyer Nosy List: akuchling, fabinator, georg.brandl, mrabarnett, niemeyer, schmir
Priority: normal Keywords:

Created on 2006-09-27 01:23 by fabinator, last changed 2009-02-09 19:40 by mrabarnett. This issue is now closed.

Files
File name Uploaded Description Edit
re-loop.py akuchling, 2006-10-26 19:38 Test script
re-loop.py akuchling, 2006-10-26 19:58 Correct version -- remove some debugging hacks
Messages (6)
msg30013 - (view) Author: Fabien Devaux (fabinator) Date: 2006-09-27 01:23
See the code:
http://pastebin.ca/183613
the "finditer()" call don't seems to return.
Playing with the re can bypass the problem but it looks
like a bug.
(I'm really sorry if I did something wrong and didn't
notice)

note: I can reproduce it with python2.5
msg30014 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-10-26 19:38
Logged In: YES 
user_id=11375

Attaching the test script.  I've modified it to save a copy
of the HTML page's data so that running the example doesn't
require accessing a slow web site repeatedly.
msg30015 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-10-26 19:53
Logged In: YES 
user_id=11375

I haven't dug very far into the code, but suspect this isn't
a bug in the regex code.

The pattern uses lots of .*? subpatterns, and this often
means the pattern takes a long time to fail if it isn't
going to match.  The regex engine matches the <link> group,
and then there's a .*?, followed by <b>.  The engine looks
at every character and if it sees a <b>, tries another .*?.
 This is O(n**2) where n is the number of character in the
string being searched, and that string is 93,000 characters
long.  If you limit the string to 5K or so, the match fails
pretty quickly.

I strongly suggest working with the HTML.  You could run the
HTML through tidy to convert to XHTML and use ElementTree on
the resulting XML.
msg59628 - (view) Author: Ralf Schmitt (schmir) Date: 2008-01-09 21:37
With python 2.6 the program can be interrupted with ctrl-c (see
http://bugs.python.org/issue846388). I think this one should be closed
as a duplicate of: http://bugs.python.org/issue1662581
msg59632 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-01-09 22:37
Done.
msg81478 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-09 19:40
This problem has been addressed in issue #2636.

Extra checks have been added to reduce the amount of backtracking.
History
Date User Action Args
2009-02-09 19:40:28mrabarnettsetnosy: + mrabarnett
messages: + msg81478
2008-01-09 22:37:50georg.brandlsetstatus: open -> closed
resolution: duplicate
superseder: the re module can perform poorly: O(2**n) versus O(n**2)
messages: + msg59632
nosy: + georg.brandl
2008-01-09 21:37:12schmirsetnosy: + schmir
messages: + msg59628
2006-09-27 01:23:22fabinatorcreate