Issue 46065: re.findall takes forever and never ends

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90223

classification

Title:	re.findall takes forever and never ends
Type:	behavior	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.9

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, gdr@garethrees.org, mrabarnett, ramzitra, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2021-12-13 13:26 by ramzitra, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
res.zip	ramzitra, 2021-12-13 13:26

Messages (7)
msg408453 - (view)	Author: Ramzi Trabelsi (ramzitra)	Date: 2021-12-13 13:26
parsing emails from this text took forever and never ends. Here the code and the file res.html is attached. The Behavior is same on Windows 10, 11 and Ubuntu 18.04 CODE: import re pattern_email = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,3}" with open("res.html","r",encoding="utf-8") as FF: TEXT = FF.read() matched_email = re.findall(pattern_email,TEXT)
msg408455 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-12-13 15:06
It ends, but it tooks several minutes to complete. It is a limitation of the regular expression implementation in Python. Your input contains a sequence of 588431 characters which match the pattern [a-zA-Z0-9_.+-] not following by '@'. The engine finds the first character in this sequence, then scans 588431 characters matching this pattern, but does not find '@' after them. So it backtracks, steps back by one character and tries to match '@', fails, and continue stepping back until returns to the initial character. 588431 steps forward and 588431 steps back are needed to find that no matches starting at this position. So it steps forward and try the proce3dure from a new position. No it does 588430 steps in both direction. Totally it needs to do 588431+588430+588429+...+1 ~ 588431**2/2 ~ 173e9 steps. It takes a long time.
msg408456 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-12-13 15:13
The simplest example is: re.search('a@', 'a'*100000)
msg408457 - (view)	Author: Ramzi Trabelsi (ramzitra)	Date: 2021-12-13 15:27
thanks for the answer. Is there any workaround for this ?
msg408459 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-12-13 15:43
Limit the number of repetitions. For example use "{1,100}" (or what is the expected maximal length of email) instead of "+".
msg408909 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2021-12-19 16:10
The way to avoid this behaviour is to disallow the attempts at matching that you know are going to fail. As Serhiy described above, if the search fails starting at the first character of the string, it will move forward and try again starting at the second character. But you know that this new attempt must fail, so you can force the regular expression engine to discard the attempt immediately. Here's an illustration in a simpler setting, where we are looking for all strings of 'a' followed by 'b': >>> import re >>> from timeit import timeit >>> text = 'a' * 100000 >>> timeit(lambda:re.findall(r'a+b', text), number=1) 6.643531181000014 We know that any successful match must be preceded by a character other than 'a' (or the beginning of the string), so we can reject many unsuccessful matches like this: >>> timeit(lambda:re.findall(r'(?:^\|[^a])(a+b)', text), number=1) 0.003743481000014981 In your case, a successful match must be preceded by [^a-zA-Z0-9_.+-] (or the beginning of the string).
msg408911 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2021-12-19 16:42
This kind of question is frequently asked (#3128, #29977, #28690, #30973, #1737127, etc.), and so maybe it deserves an answer somewhere in the Python documentation.

History
Date	User	Action	Args
2022-04-11 14:59:53	admin	set	github: 90223
2021-12-19 16:42:38	gdr@garethrees.org	set	status: open -> closed resolution: wont fix messages: + msg408911 stage: resolved
2021-12-19 16:10:44	gdr@garethrees.org	set	nosy: + gdr@garethrees.org messages: + msg408909
2021-12-13 15:43:18	serhiy.storchaka	set	messages: + msg408459
2021-12-13 15:27:16	ramzitra	set	messages: + msg408457
2021-12-13 15:13:22	serhiy.storchaka	set	messages: + msg408456
2021-12-13 15:06:09	serhiy.storchaka	set	messages: + msg408455
2021-12-13 13:32:45	ned.deily	set	nosy: + ezio.melotti, serhiy.storchaka, mrabarnett, - ronaldoussoren, ned.deily type: crash -> behavior components: + Regular Expressions
2021-12-13 13:31:51	ramzitra	set	components: - macOS
2021-12-13 13:26:28	ramzitra	create