Issue 30148: Pathological regex behaviour

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/74334

classification

Title:	Pathological regex behaviour
Type:	resource usage	Stage:	resolved
Components:	Regular Expressions	Versions:

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, jpakkane, mrabarnett, serhiy.storchaka, tim.peters
Priority:	normal	Keywords:

Created on 2017-04-23 19:26 by jpakkane, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
retest.py	jpakkane, 2017-04-23 19:26

Messages (4)
msg292181 - (view)	Author: Jussi Pakkanen (jpakkane)	Date: 2017-04-23 19:26
Attached is a script that runs a single regex against one line of text taking over 12 seconds. If you run the exact same regex in Perl it finishes immediately. The slowness has something to do with spaces. If you replace consecutive spaces in the input with one, the evaluation is immediate. This bug was originally discovered here: https://bugzilla.gnome.org/show_bug.cgi?id=781569
msg292183 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2017-04-23 20:04
If 'ignores' is '', you get this: (?:\b(?:extern\|G_INLINE_FUNC\|%s)\s) which can match an empty string, and it's tried repeatedly. That's inadvisable. There's also: (?:\s+\|\)+ which can match whitespace in multiple ways. That's inadvisable too. If the pattern really doesn't match the string (and it doesn't!), then it won't find out until it has tried _all_ of the possibilities. Some implementations, such as Perl's, have extra checks to try to reduce the problem.
msg292237 - (view)	Author: Jussi Pakkanen (jpakkane)	Date: 2017-04-24 19:55
This is slow even when ignores is set to a non-empty value. It's not as slow but the real slowdown is in the whitespace regex. Here is a minimal sample: input = ' abc' re.search(r'(\s+)+d', input)
msg292238 - (view)	Author: Tim Peters (tim.peters) *	Date: 2017-04-24 20:33
Yes, that example takes time exponential in the number of blanks to (fail to) match - each time you add a blank to `input`, it essentially doubles the time required. It's _possible_ for an implementation to deduce that `(\s+)+` is an insanely inefficient way to spell `\s+`, like it's _possible_ for an implementation to deduce that 101010 - 101010 is an insanely inefficient way to spell 0. Python's does not. To understand what's going on, Friedl's book "Mastering Regular Expressions" is an excellent source.

History
Date	User	Action	Args
2022-04-11 14:58:45	admin	set	github: 74334
2017-11-16 14:56:22	serhiy.storchaka	set	status: open -> closed nosy: + serhiy.storchaka resolution: wont fix stage: resolved
2017-04-24 20:33:56	tim.peters	set	nosy: + tim.peters messages: + msg292238
2017-04-24 19:55:25	jpakkane	set	messages: + msg292237
2017-04-23 20:04:05	mrabarnett	set	messages: + msg292183
2017-04-23 19:26:02	jpakkane	create