This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Regex causes python to hang up? / loop infinite?
Type: behavior Stage:
Components: Regular Expressions Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, computercrustie
Priority: normal Keywords:

Created on 2008-06-17 07:16 by computercrustie, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
re_problem.py computercrustie, 2008-06-17 07:16 Example regex
Messages (7)
msg68304 - (view) Author: André Fritzsche (computercrustie) Date: 2008-06-17 07:15
After struggling around with my code for nearly 1 hour now, I found out
that one of my regular expressions with a special string causes python
to hang up - not really hang up, because the processor usage is at
nearly 100%, so I think the regex machine is looping infinite.

Here is the regex-string:

re_exc_line = re.compile (
        # ignore everything before the first match
        r'^.*' +
        # first group (includes second | third)
        r'(?:' +
         # second group "(line) (file)"
         r'(?:' +
          # (text to ignore, line [number])
          r'\([^,]+\s*,\s*line\s+(?P<line1>\d+)\)' +
          # any text ([filename]) any text
          r'.*\((?:(?P<file1>[^)]+))*\).*' +
         # end of second group
         r')' +
        # or
        r'|' +
         # third group "(file) (line)"
         r'(?:' +
          # ([filename])
          r'\((?:(?P<file2>[^)]+))*\)' +
          # any text (text to ignore, line [number]) any text
          r'.*\([^,]+\s*,\s*line\s+(?P<line2>\d+)\).*' +
          # end of third group
         r')' +
        # end of first group
        r')' +
        # any text after it
        r'.*$'
        , re.I
    )

It should match either the construct:

1. """some optional text (text to ignore, line [12]) ([any_filename])
followed by optional text"""

or:

2. """some optional text ([any_filename]) (text to ignore, line [12])
followed by optional text"""

If first text matches, it is put into 'line1' and 'file1' and if the
second one matches into 'line2' and 'file2' of the groupdict.

For the upper both examples everything is ok, but having the following
string (I had to change some pathnames, because they contained customer
names):
msg = (
r"Error: Error during parsing: invalid syntax " +
r"(D:\Projects\retest\ver_700\lib\_test\test_sapekl.py, line 14) " +
r"-- Error during parsing: invalid syntax " + 
r"(D:\projects\retest\ver_700\modules\sapekl\__init__.py, line 21) " +
r"-- Attempted relative import in non-package, or beyond toplevel " +
r"package")

used with the upper regex:

re_exc_line.match(msg)

is running for two hours now (on a 3Ghz Machine)!

I've attached everything as an example file and hope, I could help you.
msg68306 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-06-17 07:52
To optimize your query, you could remove '^.*' and '.*$', and replace
match() with search().
Now it returns instantly...
msg68308 - (view) Author: André Fritzsche (computercrustie) Date: 2008-06-17 08:14
Thank you for this answer.
It solves my problem, but I think that the issues ist still existing -
or not? (The regex is running on - 3 hours now)
msg68311 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-06-17 10:06
Are you sure your regexp will return what you want?

The best match for the first part of the alternative is
("14",
 "D:\projects\retest\ver_700\modules\sapekl\__init__.py, line 21"
)
The best match for the second part is
("D:\Projects\retest\ver_700\lib\_test\test_sapekl.py, line 14",
 "21"
)

IOW, don't forget that the * operator will first try the longest
possible match.

Also, there seem to be an extra * here:
          # any text ([filename]) any text
          r'.*\((?:(?P<file1>[^)]+))*\).*' +
                                    ^
This alone can make the number of combinations explodes.
msg68314 - (view) Author: André Fritzsche (computercrustie) Date: 2008-06-17 11:13
Further I was, because the upper listed string wasn't expected for this
code (until it occured the first time ;-) )

Normally there has been only one occurence of "(file) (.., line)" or
"(.., line) (file)" per string, so the regex did quite do what I
expected - never the less you are right - instead of the '*' operator I
should have used a '?' operator for the repetition.

So far many thanks for your recommendations - but the question is if it
is ok that a regex may block a process such a long time?
msg68315 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-06-17 11:44
Yes, this can happen.

See http://www.regular-expressions.info/catastrophic.html

I am sure your regexp belongs to the same category.
msg68316 - (view) Author: André Fritzsche (computercrustie) Date: 2008-06-17 12:11
Thanks for the link, it was very interesting to read what can happen in
some circumstances.

I think, the first two chapters can match to the problem.

So the type of this issue should be feature request ;-)

Never the less I learned something new, so the invested time wasn't wasted.

Greez
History
Date User Action Args
2022-04-11 14:56:35adminsetgithub: 47378
2008-06-17 12:11:09computercrustiesetmessages: + msg68316
2008-06-17 11:45:20amaury.forgeotdarcsetstatus: open -> closed
resolution: not a bug
messages: + msg68315
2008-06-17 11:13:17computercrustiesetmessages: + msg68314
2008-06-17 10:07:02amaury.forgeotdarcsetmessages: + msg68311
2008-06-17 08:14:47computercrustiesetmessages: + msg68308
2008-06-17 07:52:10amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg68306
2008-06-17 07:16:21computercrustiecreate