Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the re module can perform poorly: O(2**n) versus O(n**2) #44592

Closed
gpshead opened this issue Feb 17, 2007 · 17 comments
Closed

the re module can perform poorly: O(2**n) versus O(n**2) #44592

gpshead opened this issue Feb 17, 2007 · 17 comments
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@gpshead
Copy link
Member

gpshead commented Feb 17, 2007

BPO 1662581
Nosy @terryjreedy, @gpshead, @josiahcarlson, @mdickinson, @pitrou, @yarko

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-05-16.20:27:53.879>
created_at = <Date 2007-02-17.23:39:24.000>
labels = ['interpreter-core', 'type-feature', '3.7']
title = 'the re module can perform poorly: O(2**n) versus O(n**2)'
updated_at = <Date 2021-04-12.01:41:22.501>
user = 'https://github.com/gpshead'

bugs.python.org fields:

activity = <Date 2021-04-12.01:41:22.501>
actor = 'gregory.p.smith'
assignee = 'none'
closed = True
closed_date = <Date 2013-05-16.20:27:53.879>
closer = 'gregory.p.smith'
components = ['Interpreter Core']
creation = <Date 2007-02-17.23:39:24.000>
creator = 'gregory.p.smith'
dependencies = []
files = []
hgrepos = []
issue_num = 1662581
keywords = []
message_count = 17.0
messages = ['55016', '55017', '55018', '56256', '59627', '69491', '81479', '81484', '81516', '94319', '94322', '129423', '176467', '189352', '189405', '292245', '390808']
nosy_count = 12.0
nosy_names = ['aaronsw', 'terry.reedy', 'gregory.p.smith', 'josiahcarlson', 'mark.dickinson', 'pitrou', 'rsc', 'timehorse', 'schmir', 'mrabarnett', 'yarkot', 'witten']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue1662581'
versions = ['Python 3.7']

@gpshead
Copy link
Member Author

gpshead commented Feb 17, 2007

in short, the re module can degenerate to really really horrid performance. See this for how and why:

http://swtch.com/~rsc/regexp/regexp1.html

exponential decline instead of squared.

I don't have a patch so i'm filing this bug as a starting point for future work. The Modules/_sre.c files implementation could be updated to use the parallel stepping Thompson approach instead of recursive backtracking.

filing this as a bug until me or someone else comes up with a patch.

@gpshead gpshead added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Feb 17, 2007
@josiahcarlson
Copy link
Mannequin

josiahcarlson mannequin commented Feb 22, 2007

I would file this under "feature request"; the current situation isn't so much buggy, as slow. While you can produce a segfault with the current regular expression engine (due to stack overflow), you can do the same thing with regular Python on Linux (with sys.setrecursionlimit), ctypes, etc., and none of those are considered as buggy.

My only concern with such a change is that it may or may not change the semantics of the repeat operators '*' and '+', which are currently defined as "greedy". If I skimmed the article correctly late at night, switching to a Thompson family regular expression engine may result in those operators no longer being greedy. Please correct me if I am wrong.

@gpshead
Copy link
Member Author

gpshead commented Feb 22, 2007

yeah this is better as a feature request. certianly low priority either way.

-nothing- I propose doing would change the syntax or behaviour of existing regular expressions at all. Doing so would be a disaster. thompson nfa does not imply changing the behaviour.

anyways its a lot more than a simple "patch" to change the re module to not use backtracking so i expect this to languish unless someone has a of free time and motivation all at once. :)

@aaronsw
Copy link
Mannequin

aaronsw mannequin commented Oct 7, 2007

Just a note for those who think this is a purely theoretical issue:

We've been using the python-markdown module on our web app for a while,
only to notice the app has been repeatedly going down. After tracking
down the culprit, we found that a speech from Hamlet passed to one of
the Markdown regular expressions caused this exponential behavior,
freezing up the app.

@schmir
Copy link
Mannequin

schmir mannequin commented Jan 9, 2008

here are two other bug reports about the same issue:

http://bugs.python.org/issue1448325

http://bugs.python.org/issue1566086

@yarko
Copy link
Mannequin

yarko mannequin commented Jul 9, 2008

Not sure if this is a real-world case of this in particular, but possibly:
http://groups.google.com/group/web2py/browse_thread/thread/59ff2e31698bced6/9bbae2d482d11b88

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Feb 9, 2009

This has been addressed in issue bpo-2636.

@mdickinson
Copy link
Member

This has been addressed in issue bpo-2636.

Are you sure about this? Does the proposed new regex engine use Thompson
NFAs, or some variant thereof?

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Feb 9, 2009

The new code includes some extra checks which, although not foolproof,
certainly reduce the amount of backtracking in a lot of cases.

@witten
Copy link
Mannequin

witten mannequin commented Oct 21, 2009

Here's an easy way to trigger the poor performance. Tested with 2.5,
2.6, and 2.7:

re.compile( '(\s+.*)*x' ).search( 'a ' * 30 )

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Oct 21, 2009

I'm still tinkering with my regex engine (issue bpo-2636).

Some timings:

re.compile(r'(\s+.*)*x').search('a ' * 25)
20.23secs

regex.compile(r'(\s+.*)*x').search('a ' * 25)
0.10secs

@terryjreedy
Copy link
Member

Another example from bpo-11307
import re
r = re.compile(r'(\w+)*=.*')
r.match("abcdefghijklmnopqrstuvwxyz")

@mdickinson
Copy link
Member

Given the number of times this comes up, I think it's a least worth an upgrade from 'low' priority to 'normal' priority.

@pitrou
Copy link
Member

pitrou commented May 16, 2013

Note this can be used for denials of service: see http://bugs.python.org/issue17980

@gpshead
Copy link
Member Author

gpshead commented May 16, 2013

The recommendation for anyone using regular expressions on hostile input is to (a) don't do that. (b) use a better regexp without this possible behavior and (c) use something like re2 (there's a Python binding at https://github.com/axiak/pyre2) which is a regular expression engine that this cannot happen to.

fixing this within python requires a complete rewrite and replacement of the re module with one that uses a different approach. see the other work on the MRAB regex module and discussion surrounding that. that is a non trivial task and it is fixing other more important things (unicode correctness!) than this...

Given that, I don't actually expect this issue to ever be fixed.

IMNSHO: People shouldn't abuse regexes and get themselves into this situation in the first place. ;)

discussion should really happen on python-ideas.

@gpshead gpshead closed this as completed May 16, 2013
@gpshead gpshead closed this as completed May 16, 2013
@gpshead gpshead added 3.7 (EOL) end of life labels Apr 24, 2017
@gpshead
Copy link
Member Author

gpshead commented Apr 24, 2017

Note that https://pypi.python.org/pypi/re2 exists today as well and offers a re module compatible interface. I haven't tried it.

@gpshead
Copy link
Member Author

gpshead commented Apr 12, 2021

https://pypi.org/project/pyre2/ seems to be a maintained version of that for use on modern Python interpreters.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants