classification
Title: re.findall hangs python completely
Type: Stage:
Components: Regular Expressions Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: niemeyer Nosy List: abakker, georg.brandl, gregsmith, niemeyer, schmir
Priority: normal Keywords:

Created on 2007-06-14 12:05 by abakker, last changed 2008-03-21 20:48 by georg.brandl. This issue is now closed.

Files
File name Uploaded Description Edit
retest.py abakker, 2007-06-14 12:05 script to illustrate problem
page.html abakker, 2007-06-14 12:06 The HTML to be parsed
Messages (8)
msg32323 - (view) Author: Arno Bakker (abakker) Date: 2007-06-14 12:05
Running a re.findall() on 40 KB of HTML appears to hang python completely. It hogs the CPU (perhaps not unexpected) but other python threads do not continue and pressing Ctrl-C does not trigger a KeyboardInterrupt. Only a SIGQUIT (Ctrl-\) can kill it.

Attached is a small script to illustrate the problem, and the data file that causes it to hang. Using 40 KB of random data does let it get past the first findall. It creates a Thread that should printout hashes continuously, however, as soon as the MainThread hits the findall the printing stops.

Occurs on Python-2.4.4 (direct from www.python.org) and 2.5.1 (2.5.1-0ubuntu1 from Feisty)
msg32324 - (view) Author: Arno Bakker (abakker) Date: 2007-06-14 12:06
File Added: page.html
msg32325 - (view) Author: Gregory Smith (gregsmith) Date: 2007-06-18 17:23
First off, don't expect other threads to run during re execution. Multi-threading in python is mainly to allow one thread to run while the others are waiting for I/O or doing a time.sleep() or something specific like that. Switching between runnable threads only occurs in interpreter loop.
There may exceptions to allow switching during some really long core operations (a mutex needs to be released and taken again) but it has to be done under certain conditions so that threads won't mess each other's data up.

So, on to the r.e.: first, try changing all the .*? to just .*  -- the ? is redundant and may be increasing the runtime by expanding the number of permutations that are being tried.

But I think your real trouble is all of these :  img src=\"(.*?)\"
This allows the second " to match with anything at all between, including any number of quoted strings.
 Your combination of several of these may be causing the RE engine to spend a huge amount of time looking at many different combinations for the first few .*?, all of which fail by the time you get to the last one.

Try   img src=\"([^"]*)\"  instead; this will only match the pair of " with no " in between.

Likewise, in .*?> the .* will match any number of '>' chars if this is needed to make the whole thing match, which is probably not what you want.

You might get it to work just by turning off 'greedy' matching for '*'.








msg32326 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-06-19 12:44
This is quite normal for regular expressions with a lot of backtracking permutations to try, and a big string to search in.

You should try to optimize your REs -- wrt. the threads, re doesn't release the GIL while searching, that's another
bug report.
msg32327 - (view) Author: Arno Bakker (abakker) Date: 2007-06-19 14:03
Is that GIL & searching problem reported separately somewhere, otherwise I'm hereby submitting that bug ;o)
msg32328 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-06-19 15:29
It's a feature request, not a bug, but it is submitted, yes :)
msg63534 - (view) Author: Ralf Schmitt (schmir) Date: 2008-03-14 19:10
ctrl-c should now work with python 2.5.
msg64279 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-03-21 20:48
Closing as fixed, then.
History
Date User Action Args
2008-03-21 20:48:39georg.brandlsetstatus: open -> closed
resolution: fixed
messages: + msg64279
2008-03-14 19:10:56schmirsetmessages: + msg63534
2008-03-14 19:10:23schmirsetnosy: + schmir
2007-06-14 12:05:52abakkercreate