This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.findall takes forever and never ends
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.9
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, gdr@garethrees.org, mrabarnett, ramzitra, serhiy.storchaka
Priority: normal Keywords:

Created on 2021-12-13 13:26 by ramzitra, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
res.zip ramzitra, 2021-12-13 13:26
Messages (7)
msg408453 - (view) Author: Ramzi Trabelsi (ramzitra) Date: 2021-12-13 13:26
parsing emails from this text took forever and never ends. Here the code 
 and the file res.html is attached.
The Behavior is same on Windows 10, 11 and Ubuntu 18.04

CODE:

import re
pattern_email      = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,3}"
with open("res.html","r",encoding="utf-8") as FF:
	TEXT = FF.read()
matched_email          = re.findall(pattern_email,TEXT)
msg408455 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-12-13 15:06
It ends, but it tooks several minutes to complete.

It is a limitation of the regular expression implementation in Python. Your input contains a sequence of 588431 characters which match the pattern [a-zA-Z0-9_.+-] not following by '@'. The engine finds the first character in this sequence, then scans 588431 characters matching this pattern, but does not find '@' after them. So it backtracks, steps back by one character and tries to match '@', fails, and continue stepping back until returns to the initial character. 588431 steps forward and 588431 steps back are needed to find that no matches starting at this position. So it steps forward and try the proce3dure from a new position. No it does 588430 steps in both direction. Totally it needs to do 588431+588430+588429+...+1 ~ 588431**2/2 ~ 173e9 steps. It takes a long time.
msg408456 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-12-13 15:13
The simplest example is:

re.search('a@', 'a'*100000)
msg408457 - (view) Author: Ramzi Trabelsi (ramzitra) Date: 2021-12-13 15:27
thanks for the answer. Is there any workaround for this ?
msg408459 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-12-13 15:43
Limit the number of repetitions. For example use "{1,100}" (or what is the expected maximal length of email) instead of "+".
msg408909 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2021-12-19 16:10
The way to avoid this behaviour is to disallow the attempts at matching that you know are going to fail. As Serhiy described above, if the search fails starting at the first character of the string, it will move forward and try again starting at the second character. But you know that this new attempt must fail, so you can force the regular expression engine to discard the attempt immediately.

Here's an illustration in a simpler setting, where we are looking for all strings of 'a' followed by 'b':

    >>> import re
    >>> from timeit import timeit
    >>> text = 'a' * 100000
    >>> timeit(lambda:re.findall(r'a+b', text), number=1)
    6.643531181000014

We know that any successful match must be preceded by a character other than 'a' (or the beginning of the string), so we can reject many unsuccessful matches like this:

    >>> timeit(lambda:re.findall(r'(?:^|[^a])(a+b)', text), number=1)
    0.003743481000014981

In your case, a successful match must be preceded by [^a-zA-Z0-9_.+-] (or the beginning of the string).
msg408911 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2021-12-19 16:42
This kind of question is frequently asked (#3128, #29977, #28690, #30973, #1737127, etc.), and so maybe it deserves an answer somewhere in the Python documentation.
History
Date User Action Args
2022-04-11 14:59:53adminsetgithub: 90223
2021-12-19 16:42:38gdr@garethrees.orgsetstatus: open -> closed
resolution: wont fix
messages: + msg408911

stage: resolved
2021-12-19 16:10:44gdr@garethrees.orgsetnosy: + gdr@garethrees.org
messages: + msg408909
2021-12-13 15:43:18serhiy.storchakasetmessages: + msg408459
2021-12-13 15:27:16ramzitrasetmessages: + msg408457
2021-12-13 15:13:22serhiy.storchakasetmessages: + msg408456
2021-12-13 15:06:09serhiy.storchakasetmessages: + msg408455
2021-12-13 13:32:45ned.deilysetnosy: + ezio.melotti, serhiy.storchaka, mrabarnett, - ronaldoussoren, ned.deily
type: crash -> behavior
components: + Regular Expressions
2021-12-13 13:31:51ramzitrasetcomponents: - macOS
2021-12-13 13:26:28ramzitracreate