Author akuchling
Recipients
Date 2006-10-26.19:53:10
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=11375

I haven't dug very far into the code, but suspect this isn't
a bug in the regex code.

The pattern uses lots of .*? subpatterns, and this often
means the pattern takes a long time to fail if it isn't
going to match.  The regex engine matches the <link> group,
and then there's a .*?, followed by <b>.  The engine looks
at every character and if it sees a <b>, tries another .*?.
 This is O(n**2) where n is the number of character in the
string being searched, and that string is 93,000 characters
long.  If you limit the string to 5K or so, the match fails
pretty quickly.

I strongly suggest working with the HTML.  You could run the
HTML through tidy to convert to XHTML and use ElementTree on
the resulting XML.
History
Date User Action Args
2007-08-23 14:43:06adminlinkissue1566086 messages
2007-08-23 14:43:06admincreate