classification
Title: Pathological regex behaviour
Type: resource usage Stage: resolved
Components: Regular Expressions Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jpakkane, mrabarnett, serhiy.storchaka, tim.peters
Priority: normal Keywords:

Created on 2017-04-23 19:26 by jpakkane, last changed 2017-11-16 14:56 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
retest.py jpakkane, 2017-04-23 19:26
Messages (4)
msg292181 - (view) Author: Jussi Pakkanen (jpakkane) Date: 2017-04-23 19:26
Attached is a script that runs a single regex against one line of text taking over 12 seconds.

If you run the exact same regex in Perl it finishes immediately.

The slowness has something to do with spaces. If you replace consecutive spaces in the input with one, the evaluation is immediate.

This bug was originally discovered here: https://bugzilla.gnome.org/show_bug.cgi?id=781569
msg292183 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2017-04-23 20:04
If 'ignores' is '', you get this:

    (?:\b(?:extern|G_INLINE_FUNC|%s)\s*)

which can match an empty string, and it's tried repeatedly.

That's inadvisable.

There's also:

    (?:\s+|\*)+

which can match whitespace in multiple ways.

That's inadvisable too.

If the pattern really doesn't match the string (and it doesn't!), then it won't find out until it has tried _all_ of the possibilities.

Some implementations, such as Perl's, have extra checks to try to reduce the problem.
msg292237 - (view) Author: Jussi Pakkanen (jpakkane) Date: 2017-04-24 19:55
This is slow even when ignores is set to a non-empty value. It's not as slow but the real slowdown is in the whitespace regex. Here is a minimal sample:

input = '                              abc'
re.search(r'(\s+)+d', input)
msg292238 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-04-24 20:33
Yes, that example takes time exponential in the number of blanks to (fail to) match - each time you add a blank to `input`, it essentially doubles the time required.

It's _possible_ for an implementation to deduce that `(\s+)+` is an insanely inefficient way to spell `\s+`, like it's _possible_ for an implementation to deduce that 10**10**10 - 10**10**10 is an insanely inefficient way to spell 0.

Python's does not.  To understand what's going on, Friedl's book "Mastering Regular Expressions" is an excellent source.
History
Date User Action Args
2017-11-16 14:56:22serhiy.storchakasetstatus: open -> closed
nosy: + serhiy.storchaka

resolution: wont fix
stage: resolved
2017-04-24 20:33:56tim.peterssetnosy: + tim.peters
messages: + msg292238
2017-04-24 19:55:25jpakkanesetmessages: + msg292237
2017-04-23 20:04:05mrabarnettsetmessages: + msg292183
2017-04-23 19:26:02jpakkanecreate