classification
Title: re.sub stalls forever on an unmatched non-greedy case
Type: performance Stage: resolved
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Robert Lujo, ezio.melotti, gdr@garethrees.org, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-04-04 07:00 by Robert Lujo, last changed 2017-04-04 20:49 by mrabarnett. This issue is now closed.

Messages (5)
msg291107 - (view) Author: Robert Lujo (Robert Lujo) Date: 2017-04-04 07:00
Hello, 

I assume I have hit some bug/misbehaviour in re module. I will provide you "working" example:

import re
RE_C_COMMENTS    = re.compile(r"/\*(.|\s)*?\*/", re.MULTILINE|re.DOTALL|re.UNICODE)
text = "Special section /* valves:\n\n\nsilicone\n\n\n\n\n\n\nHarness:\n\n\nmetal and plastic fibre\n\n\n\n\n\n\nInner frame:\n\n\nmultibutylene\n\n\n\n\n\n\nWeight:\n\n\n147 g\n\n\n\n\n\n\n\n\n\n\n\n\n\nSelection guide\n"

and then this command takes forever:
RE_C_COMMENTS.sub(" ", text, re.MULTILINE|re.DOTALL|re.UNICODE)

and the same problem you can notice on first 90 chars, it takes 10s on my machine:
RE_C_COMMENTS.sub(" ", text[:90], re.MULTILINE|re.DOTALL|re.UNICODE)

Some clarification: I try to remove the C style comments from text with non-greedy regular expression, and in this case start of comment (/*) is found, and end of comment (*/) can not be found. Notice the multiline and other re options.

Python versions used: 

'2.7.11 (default, Jan 22 2016, 16:30:50) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]' / macOs 10.12.13

and:
'2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]' -> 
Linux 84-Ubuntu SMP Wed Feb 1 17:20:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
msg291109 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2017-04-04 08:03
The problem here is that both "." and "\s" match a whitespace character, and because you have the re.DOTALL flag turned on this includes "\n", and so the number of different ways in which (.|\s)* can be matched against a string is exponential in the number of whitespace characters in the string.

It is best to design your regular expression so as to limit the number of different ways it can match. Here I recommend the expression:

    /\*(?:[^*]|\*[^/])*\*/

which can match in only one way.
msg291110 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2017-04-04 08:06
See also issue28690, issue212521, issue753711, issue1515829, etc.
msg291111 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-04 08:09
This is a well known issue called catastrophic backtracking. It can't be solved with the current implementation of the regular expression engine. The best you can rewrite your regular expression. Even replacing "(.|\s)" with just "." can help.
msg291140 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2017-04-04 20:49
A slightly shorter form:

    /\*(?:(?!\*/).)*\*/

Basically it's:

    match start

    while not match end:
        consume character

    match end

If the "match end" is a single character, you can use a negated character set, for example:

    [^\n]*

otherwise you need a negative lookahead, for example:

    (?:(?!\r\n).)*
History
Date User Action Args
2017-04-04 20:49:12mrabarnettsetmessages: + msg291140
2017-04-04 08:09:37serhiy.storchakasetstatus: open -> closed

nosy: + serhiy.storchaka
messages: + msg291111

resolution: wont fix
stage: resolved
2017-04-04 08:06:43gdr@garethrees.orgsetmessages: + msg291110
2017-04-04 08:03:16gdr@garethrees.orgsetnosy: + gdr@garethrees.org
messages: + msg291109
2017-04-04 07:00:42Robert Lujocreate