This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: RE (regular expression) matching stuck in loop
Type: Stage:
Components: Regular Expressions Versions: Python 2.4
Status: closed Resolution: duplicate
Dependencies: Superseder: the re module can perform poorly: O(2**n) versus O(n**2)
View: 1662581
Assigned To: niemeyer Nosy List: akuchling, fabinator, georg.brandl, mrabarnett, niemeyer, schmir
Priority: normal Keywords:

Created on 2006-09-27 01:23 by fabinator, last changed 2022-04-11 14:56 by admin. This issue is now closed.

File name Uploaded Description Edit akuchling, 2006-10-26 19:38 Test script akuchling, 2006-10-26 19:58 Correct version -- remove some debugging hacks
Messages (6)
msg30013 - (view) Author: Fabien Devaux (fabinator) Date: 2006-09-27 01:23
See the code:
the "finditer()" call don't seems to return.
Playing with the re can bypass the problem but it looks
like a bug.
(I'm really sorry if I did something wrong and didn't

note: I can reproduce it with python2.5
msg30014 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-10-26 19:38
Logged In: YES 

Attaching the test script.  I've modified it to save a copy
of the HTML page's data so that running the example doesn't
require accessing a slow web site repeatedly.
msg30015 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-10-26 19:53
Logged In: YES 

I haven't dug very far into the code, but suspect this isn't
a bug in the regex code.

The pattern uses lots of .*? subpatterns, and this often
means the pattern takes a long time to fail if it isn't
going to match.  The regex engine matches the <link> group,
and then there's a .*?, followed by <b>.  The engine looks
at every character and if it sees a <b>, tries another .*?.
 This is O(n**2) where n is the number of character in the
string being searched, and that string is 93,000 characters
long.  If you limit the string to 5K or so, the match fails
pretty quickly.

I strongly suggest working with the HTML.  You could run the
HTML through tidy to convert to XHTML and use ElementTree on
the resulting XML.
msg59628 - (view) Author: Ralf Schmitt (schmir) Date: 2008-01-09 21:37
With python 2.6 the program can be interrupted with ctrl-c (see I think this one should be closed
as a duplicate of:
msg59632 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-01-09 22:37
msg81478 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-02-09 19:40
This problem has been addressed in issue #2636.

Extra checks have been added to reduce the amount of backtracking.
Date User Action Args
2022-04-11 14:56:20adminsetgithub: 44036
2009-02-09 19:40:28mrabarnettsetnosy: + mrabarnett
messages: + msg81478
2008-01-09 22:37:50georg.brandlsetstatus: open -> closed
resolution: duplicate
superseder: the re module can perform poorly: O(2**n) versus O(n**2)
messages: + msg59632
nosy: + georg.brandl
2008-01-09 21:37:12schmirsetnosy: + schmir
messages: + msg59628
2006-09-27 01:23:22fabinatorcreate