classification
Title: Simple regex appears to take exponential time in length of input
Type: performance Stage:
Components: Regular Expressions Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: brezniczky, ezio.melotti, mrabarnett
Priority: normal Keywords:

Created on 2021-07-21 15:21 by brezniczky, last changed 2021-07-21 16:17 by brezniczky.

Files
File name Uploaded Description Edit
regex_test.py brezniczky, 2021-07-21 15:21 Regex time consumption demonstration
Messages (5)
msg397948 - (view) Author: János Brezniczky (brezniczky) Date: 2021-07-21 15:21
I don't know if it's normal, but it's impractical.
I seem to have come by an expression consuming o(c^n) CPU cycles with c around 2.

Regex:
\*([^*]+)*\*

Resulted in times (in seconds) of 

0.17   (* is there anybody out?)
0.34 1
0.69 12
1.36 123
2.73  1234
5.44  12345
11.1  123456  (* is there anybody123456 out?)


Please see the source for more details/repro.
msg397949 - (view) Author: János Brezniczky (brezniczky) Date: 2021-07-21 15:25
Might be related to https://bugs.python.org/issue43075, as in would trigger ReDOSability had my syntax highlighting-related changes (which eventually broke golden master tests, fortunately :) ) hit the public.

(Meaning my reduced - possibly minimal - corner case stems from a practical scenario as well.)
msg397951 - (view) Author: János Brezniczky (brezniczky) Date: 2021-07-21 15:58
I'd also raise for consideration the introduction a (default?) timeout on regexes, similarly to how such a feature seems available in .NET. 

Given the DOS vector vs. occasionally non-trivially complex expressions, this could draw developer attention to this security aspect and stimulate the evolution of a more secure ecosystem.

https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.matchtimeout?view=net-5.0
msg397952 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2021-07-21 16:03
It's called "catastrophic backtracking". Think of the number of ways it could match, say, 4 characters: 4, 3+1, 2+2, 2+1+1, 1+3, 1+2+1, 1+1+2, 1+1+1+1. Now try 5 characters...
msg397953 - (view) Author: János Brezniczky (brezniczky) Date: 2021-07-21 16:17
Hello! Thanks for the reply - you are probably right, and I accept that it's normal. However, there may be numerous frameworks getting built on the concept of collaboratively contributed regexes.

(Python likes prototyping, prototyping likes portable programming constructs, regular expressions are portable, for this and other factors and as past practice suggests, these will be a temptingly plausible choice as a basis for frameworks, etc.)

So I do accept it's normal (I haven't been deeply into regexes for cca 10 years, but I vaguely recall what's complex depends a lot on the type of engine, maybe A or B? pass on that one).

However people make mistakes that attackers seek...

(Not to lose that point, I am otherwise okay to let it go as "not a bug" if you're certain! I guess something should be done though.)
History
Date User Action Args
2021-07-21 16:17:46brezniczkysetmessages: + msg397953
2021-07-21 16:03:33mrabarnettsetmessages: + msg397952
2021-07-21 15:58:05brezniczkysetmessages: + msg397951
2021-07-21 15:25:52brezniczkysetmessages: + msg397949
2021-07-21 15:21:57brezniczkycreate