classification
Title: Regular expression match does not return
Type: Stage:
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: crouleau, ezio.melotti, mrabarnett, serhiy.storchaka, tim.peters
Priority: normal Keywords:

Created on 2012-07-31 18:08 by crouleau, last changed 2012-07-31 21:16 by tim.peters. This issue is now closed.

Files
File name Uploaded Description Edit
RegexBug.py crouleau, 2012-07-31 18:08
Messages (7)
msg167024 - (view) Author: Caleb Rouleau (crouleau) Date: 2012-07-31 18:08
Version info: 2.7.1 (r271:86832, Feb  7 2011, 11:33:02) [MSC v.1500 64 bit (AMD64)]

The program included never prints "done" because it never returns from re.match(). 

 -- Caleb Rouleau
msg167028 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2012-07-31 18:59
That's because it uses a pathological regular expression (catastrophic backtracking).

The problem lies here: (\\?[\w\.\-]+)+
msg167031 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-07-31 19:14
Matthew is right:  the nested quantifiers can cause this to take a very long time when the regexp doesn't match.  Note that the example cannot match, because nothing in the regexp can match the space before "warning" in the example string.  But the nested quantifiers cause it to _try_ an enormous number of futile attempts.

Under Python 2.7.1, it eventually does return, but it took over 15 minutes when I tried it on my laptop.

Friedl's book "Mastering Regular Expressions" is a book-length treatment of how to write regexps that don't "take forever" when they fail to match, and that's highly recommended.  Or start a discussion on comp.lang.python, and I'm sure someone will help you flesh out exactly what it is you do and don't want to match, and how to write a regexp that performs well on both matching and non-matching text (the bug tracker isn't an appropriate place for this).
msg167035 - (view) Author: Caleb Rouleau (crouleau) Date: 2012-07-31 19:44
Thanks for the help. Apologies for the poor understanding of regular expressions. Closing this issue.
msg167038 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-31 19:48
Make a distinction between a large number of infinity. You have a bad regexp, the matching time depends exponentially on the length of the string. Try with short strings. Use the regexp r"(\w:)(\\?[\w\.\-]+)((\\[\w\.\-]+)*)(\.[\w ]+): ".

It's not a bug.
msg167042 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2012-07-31 19:58
It's probably inappropriate for me to mention that the alternative 'regex' module on PyPI completes promptly, so I won't. :-)
msg167054 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-07-31 21:16
Matthew, yes, PyPy's regex module implements regular expressions of the "computer science" (as opposed to POSIX) sense.  See Friedl's book for a full explanation.  Short course is that regex's flavor of regexp matching is linear-time, but cannot support "advanced" features like backreferences.
History
Date User Action Args
2012-07-31 21:16:17tim.peterssetmessages: + msg167054
2012-07-31 19:58:12mrabarnettsetmessages: + msg167042
2012-07-31 19:48:36serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg167038
2012-07-31 19:44:56crouleausetstatus: open -> closed

messages: + msg167035
2012-07-31 19:14:55tim.peterssetresolution: not a bug

messages: + msg167031
nosy: + tim.peters
2012-07-31 18:59:38mrabarnettsetmessages: + msg167028
2012-07-31 18:08:57crouleaucreate