Issue 40879: Strange regex cycle

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/85056

classification

Title:	Strange regex cycle
Type:		Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	Dennis Sweeney, Matt Miller, steven.daprano, tim.peters
Priority:	normal	Keywords:

Created on 2020-06-05 20:49 by Matt Miller, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (7)
msg370784 - (view)	Author: Matt Miller (Matt Miller)	Date: 2020-06-05 20:49
I was evaluating a few regular expressions for parsing URL. One such expression (https://daringfireball.net/2010/07/improved_regex_for_matching_urls) causes the `re.Pattern` to exhibit some strange behavior (notice the stripped characters in the `repr`): ``` >>> STR_RE_URL = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}\|[a-z0-9%])\|www\d{0,3}[.]\|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+\|\(([^\s()<>]+\|(\([^\s()<>]+\)))\))+(?:\(([^\s()<>]+\|(\([^\s()<>]+\)))\)\|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""" >>> print(re.compile(STR_RE_URL)) re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}\|[a-z0-9%])\|www\\d{0,3}[.]\|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+\|\\(([^\\s()<>]+\|(\\([^\\s()<>]+\\)))\\))+(?:\\(([^\\s()<>]+\|(\\([^\\s()<>]+\\)))\\)\|[^\\s`!()\, re.IGNORECASE) ``` The reason I started looking at this was because the following string causes the same `re.Pattern` object's `.search()` method to loop forever for some reason: ``` >>> weird_str = """AY:OhQOhQNhQLdLAX78N'7M&6K%4K#4K#7N&9P(JcHOiQE^=8P'F_DJdLC\@9P&D\;IdKHbJ@Z8AY7@Y7AY7B[9E_<Ha?G`>Jc@Jc:F_1PjRRlSOiLKeAKeAGa=D^:F`=Ga=Fa<MhHRmSRlSJc7Ga1Ic3Kd4Jc3Ga0<V&?Y*D]-Hb1Mg7D^/;S@+@)""" >>> url_pat.search(weird_str) ``` The `.search(weird_str)` will never exit. I assume the `.search()` taking forever is is an error in the expression but the fact that it causes the `repr` to strip some characters was something I thought should be looked into. I have not tested this on any other versions of Python.
msg370803 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2020-06-06 00:53
> notice the stripped characters in the `repr` Er, no. Your regex looks like line noise, and it hurts my brain to look at it :-) If you have spotted a difference, can you tell us what characters are stripped? When I try running it, I don't get any characters stripped at all: py> import re py> STR_RE_URL = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}\|[a-z0-9%])\|www\d{0,3}[.]\|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+\|\(([^\s()<>]+\|(\([^\s()<>]+\)))\))+(?:\(([^\s()<>]+\|(\([^\s()<>]+\)))\)\|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""" py> r = re.compile(STR_RE_URL) py> r.pattern == STR_RE_URL True
msg370805 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2020-06-06 00:58
Wait, I'm sorry, do you mean this? py> repr(r)[13:-16] '?i)\\\\b((?:[a-z][\\\\w-]+:(?:/{1,3}\|[a-z0-9%])\|www\\\\d{0,3}[.]\|[a-z0-9.\\\\-]+[.][a-z]{2,4}/)(?:[^\\\\s()<>]+\|\\\\(([^\\\\s()<>]+\|(\\\\([^\\\\s()<>]+\\\\)))\\\\))+(?:\\\\(([^\\\\s()<>]+\|(\\\\([^\\\\s()<>]+\\\\)))\\\\)\|[^\\\\s`!()\\' Referring to the pattern being truncated in the repr? I would assume that's intentional.
msg370806 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2020-06-06 01:10
It looks like only the first 200 characters of the input string's repr are used as the compiled pattern's repr for some reason: https://github.com/python/cpython/blob/master/Modules/_sre.c#L1294 I don't know if there is a good reason, especially since this violates eval(repr(pattern)) == pattern in a bad way: >>> eval(repr(re.compile(STR_RE_URL))) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1 re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}\|[a-z0-9%])\|www\\d{0,3}[.]\|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+\|\\(([^\\s()<>]+\|(\\([^\\s()<>]+\\)))\\))+(?:\\(([^\\s()<>]+\|(\\([^\\s()<>]+\\)))\\)\|[^\\s`!()\, re.IGNORECASE) ^ SyntaxError: EOL while scanning string literal
msg370808 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-06-06 01:21
The repr truncates the pattern string, for display, if it's "too long". The only visual clue about that, though, is that the display is missing the pattern string's closing quote, as in the output you showed here. If you look at url_pat.pattern, though, you'll see that nothing has been lost. I'm not sure why it does that. As I vaguely recall, some years ago there was a crusade to limit maximum repr sizes because long output was considered to be "a security issue" (e.g., DoS attacks vis tricking logging/auditing facilities into writing giant strings when recording reprs). In any case, that's all there is to that part. For the rest, it's exceedingly unlikely that there's actually an infinite loop. Instead there's a messy regexp with multiple nested quantifiers, which are notorious for exhibiting exponential-time behavior and especially in non-matching cases. They can be rewritten to have linear-time behavior instead, but it's an effort I personally have no interest in pursuing here. See Jeffrey Friedl's "Mastering Regular Expressions" book for detailed explanations. The reason I have no interest: it's almost always a losing idea to try to parse any aspect of HTML with regexps. Use an HTML parser instead (or for URLs specifically, see urllib.parse).
msg370810 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-06-06 02:04
Note that the relatively tiny pattern here extracts just a small piece of the regexp in question. As the output shows, increase the length of a string it fails to match by one character, and the time taken to fail approximately doubles: exponential-time behavior. >>> import re >>> c = re.compile(r'(?:[^\s()<>]+)+x') >>> from time import perf_counter as now >>> size = 1 >>> while True: ... s = 'a' * size ... start = now() ... junk = c.search(s) ... finish = now() ... print(size, finish - start) ... size += 1 1 9.900000009110954e-06 2 1.1800000009998257e-05 3 1.1200000017197453e-05 4 1.2099999992187804e-05 5 1.5200000007098424e-05 6 1.5699999977414336e-05 7 2.119999999194988e-05 8 3.39000000053602e-05 9 4.9599999982774534e-05 10 7.779999998547282e-05 11 0.00014810000001830304 12 0.00034099999999170905 13 0.0006348999999943317 14 0.0012191000000143504 15 0.002482499999985066 16 0.004694100000023127 17 0.009342199999991863 18 0.01954169999999067 19 0.03880150000000526 20 0.0762141000000156 21 0.1472148999999945 22 0.27771670000001336 23 0.6491722000000095 24 1.3553119999999979 25 2.229829699999982 26 4.986566299999993 27 9.567925599999995 28 19.09181079999999 29 42.363349 30 83.57493059999999 31 158.88249489999998 ... The machine was far from quiet while running this, but it doesn't matter: the _point_ is dead obvious regardless.
msg370869 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-06-07 03:36
Closing this as "Won't Fix", since the possibility of exponential-time behavior with naively written nested quantifiers is well known, and there are no plans to "do something" about that.

History
Date	User	Action	Args
2022-04-11 14:59:32	admin	set	github: 85056
2020-06-07 03:36:24	tim.peters	set	status: open -> closed resolution: wont fix messages: + msg370869 stage: resolved
2020-06-06 02:04:04	tim.peters	set	messages: + msg370810
2020-06-06 01:21:10	tim.peters	set	nosy: + tim.peters messages: + msg370808
2020-06-06 01:10:41	Dennis Sweeney	set	nosy: + Dennis Sweeney messages: + msg370806
2020-06-06 00:58:53	steven.daprano	set	messages: + msg370805
2020-06-06 00:53:47	steven.daprano	set	nosy: + steven.daprano messages: + msg370803
2020-06-05 20:49:23	Matt Miller	create