This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Strange regex cycle
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Dennis Sweeney, Matt Miller, steven.daprano, tim.peters
Priority: normal Keywords:

Created on 2020-06-05 20:49 by Matt Miller, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (7)
msg370784 - (view) Author: Matt Miller (Matt Miller) Date: 2020-06-05 20:49
I was evaluating a few regular expressions for parsing URL.  One such expression (https://daringfireball.net/2010/07/improved_regex_for_matching_urls) causes the `re.Pattern` to exhibit some strange behavior (notice the stripped characters in the `repr`):

```
>>> STR_RE_URL = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"""
>>> print(re.compile(STR_RE_URL))
re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\, re.IGNORECASE)
```

The reason I started looking at this was because the following string causes the same `re.Pattern` object's `.search()` method to loop forever for some reason:

```
>>> weird_str = """AY:OhQOhQNhQLdLAX78N'7M&6K%4K#4K#7N&9P(JcHOiQE^=8P'F_DJdLC\@9P&D\;IdKHbJ@Z8AY7@Y7AY7B[9E_<Ha?G`>Jc@Jc:F_1PjRRlSOiLKeAKeAGa=D^:F`=Ga=Fa<MhHRmSRlSJc7Ga1Ic3Kd4Jc3Ga0<V&?Y*D]-Hb1Mg7D^/;S@+@)"""
>>> url_pat.search(weird_str)
```

The `.search(weird_str)` will never exit.


I assume the `.search()` taking forever is is an error in the expression but the fact that it causes the `repr` to strip some characters was something I thought should be looked into.

I have not tested this on any other versions of Python.
msg370803 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-06-06 00:53
> notice the stripped characters in the `repr`

Er, no. Your regex looks like line noise, and it hurts my brain to look at it :-)

If you have spotted a difference, can you tell us what characters are stripped? When I try running it, I don't get any characters stripped at all:

    py> import re
    py> STR_RE_URL = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"""
    py> r = re.compile(STR_RE_URL)
    py> r.pattern == STR_RE_URL
    True
msg370805 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-06-06 00:58
Wait, I'm sorry, do you mean this?

    py> repr(r)[13:-16]
    '?i)\\\\b((?:[a-z][\\\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\\\d{0,3}[.]|[a-z0-9.\\\\-]+[.][a-z]{2,4}/)(?:[^\\\\s()<>]+|\\\\(([^\\\\s()<>]+|(\\\\([^\\\\s()<>]+\\\\)))*\\\\))+(?:\\\\(([^\\\\s()<>]+|(\\\\([^\\\\s()<>]+\\\\)))*\\\\)|[^\\\\s`!()\\'

Referring to the pattern being truncated in the repr? I would assume that's intentional.
msg370806 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2020-06-06 01:10
It looks like only the first 200 characters of the input string's repr are used as the compiled pattern's repr for some reason:

    https://github.com/python/cpython/blob/master/Modules/_sre.c#L1294

I don't know if there is a good reason, especially since this violates eval(repr(pattern)) == pattern in a bad way:

    >>> eval(repr(re.compile(STR_RE_URL)))
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<string>", line 1
        re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\, re.IGNORECASE)
                                                                                                                                                                                                                                        ^
    SyntaxError: EOL while scanning string literal
msg370808 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-06-06 01:21
The repr truncates the pattern string, for display, if it's "too long". The only visual clue about that, though, is that the display is missing the pattern string's closing quote, as in the output you showed here. If you look at url_pat.pattern, though, you'll see that nothing has been lost.

I'm not sure why it does that.  As I vaguely recall, some years ago there was a crusade to limit maximum repr sizes because long output was considered to be "a security issue" (e.g., DoS attacks vis tricking logging/auditing facilities into writing giant strings when recording reprs).

In any case, that's all there is to that part.

For the rest, it's exceedingly unlikely that there's actually an infinite loop. Instead there's a messy regexp with multiple nested quantifiers, which are notorious for exhibiting exponential-time behavior and especially in non-matching cases. They can be rewritten to have linear-time behavior instead, but it's an effort I personally have no interest in pursuing here. See Jeffrey Friedl's "Mastering Regular Expressions" book for detailed explanations.

The reason I have no interest: it's almost always a losing idea to try to parse any aspect of HTML with regexps. Use an HTML parser instead (or for URLs specifically, see urllib.parse).
msg370810 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-06-06 02:04
Note that the relatively tiny pattern here extracts just a small piece of the regexp in question. As the output shows, increase the length of a string it fails to match by one character, and the time taken to fail approximately doubles: exponential-time behavior.

>>> import re
>>> c = re.compile(r'(?:[^\s()<>]+)+x')
>>> from time import perf_counter as now
>>> size = 1
>>> while True:
...     s = 'a' * size
...     start = now()
...     junk = c.search(s)
...     finish = now()
...     print(size, finish - start)
...     size += 1
	
1 9.900000009110954e-06
2 1.1800000009998257e-05
3 1.1200000017197453e-05
4 1.2099999992187804e-05
5 1.5200000007098424e-05
6 1.5699999977414336e-05
7 2.119999999194988e-05
8 3.39000000053602e-05
9 4.9599999982774534e-05
10 7.779999998547282e-05
11 0.00014810000001830304
12 0.00034099999999170905
13 0.0006348999999943317
14 0.0012191000000143504
15 0.002482499999985066
16 0.004694100000023127
17 0.009342199999991863
18 0.01954169999999067
19 0.03880150000000526
20 0.0762141000000156
21 0.1472148999999945
22 0.27771670000001336
23 0.6491722000000095
24 1.3553119999999979
25 2.229829699999982
26 4.986566299999993
27 9.567925599999995
28 19.09181079999999
29 42.363349
30 83.57493059999999
31 158.88249489999998
...

The machine was far from quiet while running this, but it doesn't matter: the _point_ is dead obvious regardless.
msg370869 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-06-07 03:36
Closing this as "Won't Fix", since the possibility of exponential-time behavior with naively written nested quantifiers is well known, and there are no plans to "do something" about that.
History
Date User Action Args
2022-04-11 14:59:32adminsetgithub: 85056
2020-06-07 03:36:24tim.peterssetstatus: open -> closed
resolution: wont fix
messages: + msg370869

stage: resolved
2020-06-06 02:04:04tim.peterssetmessages: + msg370810
2020-06-06 01:21:10tim.peterssetnosy: + tim.peters
messages: + msg370808
2020-06-06 01:10:41Dennis Sweeneysetnosy: + Dennis Sweeney
messages: + msg370806
2020-06-06 00:58:53steven.dapranosetmessages: + msg370805
2020-06-06 00:53:47steven.dapranosetnosy: + steven.daprano
messages: + msg370803
2020-06-05 20:49:23Matt Millercreate