This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Regex hangs indefinitely
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.8
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jblangston, mrabarnett, tim.peters
Priority: normal Keywords:

Created on 2022-02-03 18:40 by jblangston, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg412450 - (view) Author: J.B. Langston (jblangston) Date: 2022-02-03 18:40
The following code will cause Python's regex engine to hang apparently indefinitely: 

import re
message = "Flushed to [BigTableReader(path='/data/cassandra/data/log/logEntry_202202-e68971800b2711ecaf770d5fa3f5ae87/md-112-big-Data.db')] (1 sstables, 8,650MiB), biggest 8,650MiB, smallest 8,650MiB"
regex = re.compile(r"Flushed to \[(?P<sstables>[^]]+)+\] \((?P<sstable_count>[^ ]+) sstables, (?P<total_size>[^)]+)\), biggest (?P<biggest_size>[^,]+), smallest (?P<smallest_size>[^ ]+)( \((?P<duration>\d+)ms\))?")
regex.match(message)

This may be a case of exponential backtracking similar to #35915 or #30973. Both of these issues have been closed as Wont Fix, and I suspect my issue is similar. The use of commas for decimal points in the input string was not anticipated but happened due to localization of the logs that the message came from.  The regex works properly when the decimal point is a period.

I will try to rewrite my regex to address this specific issue, but it's hard to anticipate every possible input and craft a bulletproof regex, so something like this kind of thing can be used for a denial of service attack (intentional or not). In this case the regex was used in an automated import process and caused the process to back up for many hours before someone noticed.  Maybe a solution could be to add a timeout option to the regex engine so it will give up and throw an exception if the regex executes for longer than the configured timeout.
msg412452 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2022-02-03 19:11
That pattern has:

    (?P<sstables>[^]]+)+

Is that intentional? It looks wrong to me.
msg412456 - (view) Author: J.B. Langston (jblangston) Date: 2022-02-03 19:31
Yes, it is supposed to match everything up to the closing ] in this substring: 

[BigTableReader(path='/data/cassandra/data/log/logEntry_202202-e68971800b2711ecaf770d5fa3f5ae87/md-112-big-Data.db')]

Quoting from the re docs:

To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[\]{}] and []()[{}] will both match a parenthesis.

The docs don't specifically state the case of a negated set using ^, but I have used this construction many times and never had a problem with it.

Furthermore, it is not what caused the regex to hang.  That was caused by "(?P<biggest_size>[^,]+)," and changing it to "(?P<biggest_size>.+?)," fixed the problem.
msg412458 - (view) Author: J.B. Langston (jblangston) Date: 2022-02-03 19:38
Sorry, on rereading your message I guess you were referring to the extra +, not the [^]]. The extra + after the ) was not intentional, and after removing it, the regex no longer hangs.

I still think it would be nice to have a timeout setting on the regex so it can't hang up an entire process.
msg412493 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-02-04 05:21
Introducing some kind of optional timeout is too involved to just drop in without significant discussion and design effort first. If you want to pursue this, please bring it up on the python-ideas mailing list.

My first take: it wouldn't really help, because nobody would use it until after it was too late ;-) So, in the absence of someone jumping on it right away, I'm closing this one too as "won't fix". If the idea gains traction, we can, of course, reopen this.

BTW, Jeffrey Friedl's book "Mastering Regular Expressions" teaches how to write bulletproof regexps. It's not always obvious at first glance, but it's an art that can be mastered.
History
Date User Action Args
2022-04-11 14:59:55adminsetgithub: 90785
2022-02-04 05:21:56tim.peterssetstatus: open -> closed

nosy: + tim.peters
messages: + msg412493

resolution: wont fix
stage: resolved
2022-02-03 19:38:22jblangstonsetmessages: + msg412458
2022-02-03 19:31:48jblangstonsetmessages: + msg412456
2022-02-03 19:11:21mrabarnettsetmessages: + msg412452
2022-02-03 18:40:17jblangstoncreate