classification
Title: re.search extreme slowness (looks like hang/livelock), searching for patterns containing .* in a large string
Type: performance Stage: resolved
Components: Regular Expressions Versions: Python 3.6, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: benspiller, ezio.melotti, mrabarnett, serhiy.storchaka, tim.peters
Priority: normal Keywords:

Created on 2019-02-06 16:17 by benspiller, last changed 2019-02-18 14:47 by serhiy.storchaka. This issue is now closed.

Messages (4)
msg334949 - (view) Author: Ben Spiller (benspiller) * Date: 2019-02-06 16:17
These work fine and return instantly:
python -c "import re;  re.compile('.*x').match('y'*(1000*100))"
python -c "import re;  re.compile('x').search('y'*(1000*100))"
python -c "import re;  re.compile('.*x').search('y'*(1000*10))"

This hangs / freezes / livelocks indefinitely, with lots of CPU usage:
python -c "import re;  re.compile('.*x').search('y'*(1000*100))"

Admittedly performing a search() with a pattern starting .* isn't useful, however it's worth fixing as:
- it's easily done by inexperienced developers, or users interacting with code that's far removed from the actual regex call
- the failure mode of hanging forever (with the GIL held, of course) is quite severe (took us a lot of debugging with gdb before we figured out where our complex multi-threaded python program was hanging!), and 
- the fact that the behaviour is different based on the length of the string being matched suggests there is some kind of underlying bug in how the buffer is handled whcih might also affect other, more reasonable regex use cases
msg334953 - (view) Author: Ben Spiller (benspiller) * Date: 2019-02-06 16:48
Correction to original report - it doesn't hang indefinitely, it just takes a really long time. Specifically, looks like it's quadratic in the length of the input string. Increase the size of the input string to 1000*1000 and it's really really slow. 

I don't know for sure if it's possible to implement regexes in a way that avoids this pathological behaviour, but it's certainly quite risky that an otherwise working bit of code using a pattern containing .* can hang/livelock an application for an arbitrary  amount of time if passed a larger-than-expected (but actually not that big) input string.
msg334956 - (view) Author: Ben Spiller (benspiller) * Date: 2019-02-06 16:57
Running this command:
time python -c "import re;  re.compile('y.*x').search('y'*(N))"

It's clearly quadratic:
N=100,000 time=7s
N=200,000 time=18s
N=400,000 time=110s
N=1,000,000 time=690s

This illustrates how a simple program that's working correctly can quickly degrade to a very long period of unresponsiveness after some fairly modest increases in size of input string.
msg334958 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-02-06 17:15
Yes, it's quadratic time.  If the string being searched has N characters, first it fails to find "x" in all N of 'em, then `.*` advances by one and it fails to find "x" in the trailing N-1 characters, then again in the trailing N-2, and so on.  N + N-1 + N-2 + ... + 1 is quadratic in N.

That's how this kind of regexp engine works.  And it's mild, as such things go:  you can also create poor regexps that take time _exponential_ in N that fail to match certain strings.

It's unlikely this will change without replacing Python's regexp implementation entirely.  For why, see Jeffrey Friedl's book "Mastering Regular Expressions" (published by O'Reilly).  That book also spells out techniques for crafting regexps that don't suck ;-)  It's not a small topic, alas.
History
Date User Action Args
2019-02-18 14:47:57serhiy.storchakasetstatus: open -> closed
nosy: + serhiy.storchaka

resolution: wont fix
stage: resolved
2019-02-06 17:15:25tim.peterssetnosy: + tim.peters
messages: + msg334958
2019-02-06 16:57:17benspillersetmessages: + msg334956
2019-02-06 16:48:45benspillersettype: crash -> performance
messages: + msg334953
title: re.search livelock/hang, searching for patterns starting .* in a large string -> re.search extreme slowness (looks like hang/livelock), searching for patterns containing .* in a large string
2019-02-06 16:17:37benspillercreate