classification
Title: the re module can perform poorly: O(2**n) versus O(n**2)
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: aaronsw, gregory.p.smith, josiahcarlson, mark.dickinson, mrabarnett, pitrou, rsc, schmir, terry.reedy, timehorse, witten, yarkot
Priority: normal Keywords:

Created on 2007-02-17 23:39 by gregory.p.smith, last changed 2017-04-24 23:40 by gregory.p.smith. This issue is now closed.

Messages (16)
msg55016 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2007-02-17 23:39
in short, the re module can degenerate to really really horrid performance.  See this for how and why:

 http://swtch.com/~rsc/regexp/regexp1.html

exponential decline instead of squared.

I don't have a patch so i'm filing this bug as a starting point for future work.  The Modules/_sre.c files implementation could be updated to use the parallel stepping Thompson approach instead of recursive backtracking.

filing this as a bug until me or someone else comes up with a patch.
msg55017 - (view) Author: Josiah Carlson (josiahcarlson) * Date: 2007-02-22 08:51
I would file this under "feature request"; the current situation isn't so much buggy, as slow.  While you can produce a segfault with the current regular expression engine (due to stack overflow), you can do the same thing with regular Python on Linux (with sys.setrecursionlimit), ctypes, etc., and none of those are considered as buggy.

My only concern with such a change is that it may or may not change the semantics of the repeat operators '*' and '+', which are currently defined as "greedy".  If I skimmed the article correctly late at night, switching to a Thompson family regular expression engine may result in those operators no longer being greedy.  Please correct me if I am wrong.
msg55018 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2007-02-22 22:30
yeah this is better as a feature request.  certianly low priority either way.

-nothing- I propose doing would change the syntax or behaviour of existing regular expressions at all.  Doing so would be a disaster.  thompson nfa does not imply changing the behaviour.

anyways its a lot more than a simple "patch" to change the re module to not use backtracking so i expect this to languish unless someone has a of free time and motivation all at once. :)
msg56256 - (view) Author: Aaron Swartz (aaronsw) Date: 2007-10-07 13:55
Just a note for those who think this is a purely theoretical issue:

We've been using the python-markdown module on our web app for a while,
only to notice the app has been repeatedly going down. After tracking
down the culprit, we found that a speech from Hamlet passed to one of
the Markdown regular expressions caused this exponential behavior,
freezing up the app.
msg59627 - (view) Author: Ralf Schmitt (schmir) Date: 2008-01-09 21:26
here are two other bug reports about the same issue:

http://bugs.python.org/issue1448325

http://bugs.python.org/issue1566086
msg69491 - (view) Author: Yarko Tymciurak (yarkot) Date: 2008-07-09 23:05
Not sure if this is a real-world case of this in particular, but possibly:
http://groups.google.com/group/web2py/browse_thread/thread/59ff2e31698bced6/9bbae2d482d11b88
msg81479 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-09 19:45
This has been addressed in issue #2636.
msg81484 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-09 20:23
> This has been addressed in issue #2636.

Are you sure about this?  Does the proposed new regex engine use Thompson 
NFAs, or some variant thereof?
msg81516 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-09 23:40
The new code includes some extra checks which, although not foolproof,
certainly reduce the amount of backtracking in a lot of cases.
msg94319 - (view) Author: Dan Helfman (witten) Date: 2009-10-21 19:28
Here's an easy way to trigger the poor performance. Tested with 2.5,
2.6, and 2.7:

re.compile( '(\s+.*)*x' ).search( 'a ' * 30 )
msg94322 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-10-21 20:24
I'm still tinkering with my regex engine (issue #2636).

Some timings:

re.compile(r'(\s+.*)*x').search('a ' * 25)
20.23secs

regex.compile(r'(\s+.*)*x').search('a ' * 25)
0.10secs
msg129423 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-02-25 20:42
Another example from #11307
import re
r = re.compile(r'(\w+)*=.*')
r.match("abcdefghijklmnopqrstuvwxyz")
msg176467 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-11-27 08:47
Given the number of times this comes up, I think it's a least worth an upgrade from 'low' priority to 'normal' priority.
msg189352 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-16 10:49
Note this can be used for denials of service: see http://bugs.python.org/issue17980
msg189405 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-05-16 20:27
The recommendation for anyone using regular expressions on hostile input is to (a) don't do that. (b) use a better regexp without this possible behavior and (c) use something like re2 (there's a Python binding at https://github.com/axiak/pyre2) which is a regular expression engine that this cannot happen to.

fixing this within python requires a complete rewrite and replacement of the re module with one that uses a different approach.  see the other work on the MRAB regex module and discussion surrounding that.  that is a non trivial task and it is fixing other more important things (unicode correctness!) than this...

Given that, I don't actually expect this issue to ever be fixed.

IMNSHO: People shouldn't abuse regexes and get themselves into this situation in the first place. ;)

discussion should really happen on python-ideas.
msg292245 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-04-24 23:40
Note that https://pypi.python.org/pypi/re2 exists today as well and offers a re module compatible interface.  I haven't tried it.
History
Date User Action Args
2017-04-24 23:40:02gregory.p.smithsetmessages: + msg292245
2017-04-24 23:39:26gregory.p.smithsetversions: + Python 3.7, - Python 3.4
2013-05-16 20:27:53gregory.p.smithsetstatus: open -> closed
resolution: wont fix
messages: + msg189405
2013-05-16 10:49:32pitrousetnosy: + pitrou
messages: + msg189352
2012-11-27 08:47:23mark.dickinsonsetpriority: low -> normal

messages: + msg176467
versions: + Python 3.4, - Python 3.3
2012-11-27 08:45:44mark.dickinsonlinkissue16563 superseder
2012-11-07 17:43:13mark.dickinsonlinkissue13723 superseder
2012-11-07 17:35:54mark.dickinsonlinkissue16430 superseder
2011-02-25 20:42:44terry.reedylinkissue11307 superseder
2011-02-25 20:42:32terry.reedysetnosy: + terry.reedy

messages: + msg129423
versions: + Python 3.3, - Python 2.7
2010-07-29 13:28:00georg.brandllinkissue9414 superseder
2009-10-21 20:24:25mrabarnettsetmessages: + msg94322
2009-10-21 19:28:42wittensetnosy: + witten
messages: + msg94319
2009-02-09 23:40:58mrabarnettsetmessages: + msg81516
2009-02-09 20:23:05mark.dickinsonsetnosy: + mark.dickinson
messages: + msg81484
2009-02-09 19:45:07mrabarnettsetnosy: + mrabarnett
messages: + msg81479
2008-09-27 14:42:27timehorsesetnosy: + timehorse
versions: + Python 2.7
2008-07-09 23:05:29yarkotsetnosy: + yarkot
messages: + msg69491
2008-04-24 21:05:17rscsetnosy: + rsc
2008-01-09 22:37:50georg.brandllinkissue1566086 superseder
2008-01-09 21:26:35schmirsetnosy: + schmir
messages: + msg59627
2007-10-07 13:55:51aaronswsetnosy: + aaronsw
messages: + msg56256
2007-02-17 23:39:24gregory.p.smithcreate