classification
Title: Interpreter gets stuck while applying a regex pattern
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Sateesh Kumar, SilentGhost, ezio.melotti, mrabarnett, remi.lapeyre, tim.peters
Priority: normal Keywords:

Created on 2019-02-07 17:33 by Sateesh Kumar, last changed 2019-02-08 02:17 by tim.peters. This issue is now closed.

Files
File name Uploaded Description Edit
gdb_trace Sateesh Kumar, 2019-02-07 17:33
Messages (4)
msg335029 - (view) Author: Sateesh Kumar (Sateesh Kumar) Date: 2019-02-07 17:33
The python interpreter gets stuck while applying a compiled regex pattern against a given string.

The regex matching doesn't get stuck if uncompiled regex pattern is used, or if the flag "re.IGNORECASE" is not used for regex match. 

Below code snippets gives more details.

* Variant-1
# Python process gets stuck

~/workbench$ cat match.py 
import re
pattern = "^[_a-z0-9-]+([\.'_a-z0-9-]+)*@[a-z0-9]+([\.a-z0-9-]+)*(\.[a-z]{2,4})$"
compiled_pattern = re.compile(pattern, re.IGNORECASE)
val = "Z230-B900_X-Suite_Migration_Shared_Volume"
re.match(compiled_pattern, val)

~/workbench$ python match.py 
^^^ The interpreter gets stuck.

* Variant-2 (Using uncompiled pattern) 
# python interpreter doesn't get stuck

~/workbench$ cat match.py
import re
pattern = "^[_a-z0-9-]+([\.'_a-z0-9-]+)*@[a-z0-9]+([\.a-z0-9-]+)*(\.[a-z]{2,4})$"
compiled_pattern = re.compile(pattern, re.IGNORECASE)
val = "Z230-B900_X-Suite_Migration_Shared_Volume"
re.match(pattern, val)

* Variant-3 (Using compiled pattern, but without flag re.IGNORECASE)
# Python interpreter doesn't get stuck

~/workbench$ cat match.py
import re
pattern = "^[_a-z0-9-]+([\.'_a-z0-9-]+)*@[a-z0-9]+([\.a-z0-9-]+)*(\.[a-z]{2,4})$"
compiled_pattern = re.compile(pattern)
val = "Z230-B900_X-Suite_Migration_Shared_Volume"
re.match(compiled_pattern, val)

# Platform details
~/workbench$ python -V
Python 2.7.12

~/workbench$ uname -a
Linux ubuntu16-template 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Though above details are from Python/2.7 I see similar beahviour with Python/3.5.2 too.
msg335035 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-02-07 19:22
In your variant 2 you're not using re.IGNORECASE flag, if you do you're likely to encounter the same behaviour as for the compiled pattern. At least I do on python3.6
msg335046 - (view) Author: Sateesh Kumar (Sateesh Kumar) Date: 2019-02-07 22:44
@SilentGhost, You are right, when I pass the flag "re.IGNORECASE" for example in Variant-2 the python process do gets stuck.  

Here is the revised code for Variant-2:

* Variant-2
%cat match.py
import re
pattern = "^[_a-z0-9-]+([\.'_a-z0-9-]+)*@[a-z0-9]+([\.a-z0-9-]+)*(\.[a-z]{2,4})$"
val = "Z230-B900_X-Suite_Migration_Shared_Volume"
re.match(pattern, val, re.IGNORECASE)

~/workbench$ python match.py 
^^^ The interpreter gets stuck.

Thanks for the correction. I have also modified the title of this issue accordingly.
msg335053 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-02-08 02:17
Without re.IGNORECASE, the leading

    ^[_a-z0-9-]+

can't even match the first character (`val` starts with uppercase Z), so it fails instantly.

With re.IGNORECASE, it's not "stuck", but is taking a verrrrrry long time to try an enormous number of (ultimately doomed) possibilities due to the way the regexp is written.

This is due to using nested quantifiers for no apparent reason.  For any character class C,

    (C+)*

matches the same set of strings as

    C*

but the former way can _try_ to match in an exponential (in the length of the string) number of ways.  So replace

    ([\.'_a-z0-9-]+)*
and
    ([\.a-z0-9-]+)*

with

    [\.'_a-z0-9-]*
and
    [\.a-z0-9-]*

and it fails to match `val` quickly (even with re.IGNORECASE).

For more on this (which applies to many regexp implementations, not just Python's), here's a start:

https://www.mathworks.com/matlabcentral/answers/95953-why-can-nested-quantifiers-in-regexp-can-cause-inefficient-failures-in-matlab-6-5-r13

The "Mastering Regular Expressions" book referenced in that answer is an excellent book-length treatment of this (and related) topic(s).
History
Date User Action Args
2019-02-08 02:17:30tim.peterssetstatus: open -> closed

nosy: + tim.peters
messages: + msg335053

resolution: not a bug
stage: resolved
2019-02-07 22:44:14Sateesh Kumarsetmessages: + msg335046
title: Interpreter gets stuck while applying a compiled regex pattern -> Interpreter gets stuck while applying a regex pattern
2019-02-07 19:22:13SilentGhostsetnosy: + SilentGhost, ezio.melotti, mrabarnett
messages: + msg335035

components: + Regular Expressions
type: crash -> behavior
2019-02-07 17:56:32remi.lapeyresetnosy: + remi.lapeyre
2019-02-07 17:33:27Sateesh Kumarcreate