This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.findall() takes a long time (100% cup usage) on Python 3.6.10
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.6
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, mrabarnett, remi.lapeyre, serhiy.storchaka, srael
Priority: normal Keywords:

Created on 2020-05-04 09:54 by srael, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg368026 - (view) Author: Sergio Rael (srael) Date: 2020-05-04 09:54
I have found a deadlock using Python 3.6.10 that seems to have been solved on 3.7.x. probably related to capture groups. To reproduce the deadlock just do something like this:

re.findall(
    '\[et_pb_image(?:\w|=|"|\d|\.| |_|\/)*src="(https?:\/\/(?:www\.)?\w*\.\w*(?:\/|\w|\d|\.|-)*\.(?:png|jpg|jpeg|gif))"(?:\w|=|"|\d|\.| |_|\/|%|\|)*(?:\/?\])(?:\[\/et_pb_image\])?',
    '[et_pb_image _builder_version="3.27.2" src="https://www.somewhere.com/wp-content/uploads/2019/08/stabilizers.jpg" box_shadow_horizontal_tablet="0px" box_shadow_vertical_tablet="0px" box_shadow_blur_tablet="40px" box_shadow_spread_tablet="0px" z_index_tablet="500" url="https://youtu.be/fTrC5gkyYBM" url_new_window="on" /]',
)

I noticed that the problem is related to having two image urls on the content. The regex says to look only for the one starting with "src=" so the one starting with "url=" should be ignored. If "url=\"XXX\"" is removed from the tag it works fine.
msg368028 - (view) Author: Sergio Rael (srael) Date: 2020-05-04 10:02
Sorry, this is not a deadlock. Python puts the CPU to 100% of usage, but it takes so long that a I didn't know if it can finish the task.
msg368030 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2020-05-04 10:10
I don't think this is a deadlock rather it is certainly related to the number of '*' there is in your pattern, the regexp has to search an exponentially growing number of patterns. 

You could try a simple pattern to match your attribute and it should be faster.
msg368109 - (view) Author: Sergio Rael (srael) Date: 2020-05-05 07:39
Thank you for your reply Rémi.

I agree with you that the reason can be that the pattern is too complex. I just noticed that in Python 3.7 using the same pattern finish the searchall almost instantaneously, but in 3.6 the CPU goes to 100% and it takes ages to finish. In fact I don't know if this can finish at all because it takes so long that I had to stop it.
I tough it would be a good idea to let you know this behaviour. Of course, after this, I don't use 3.6 anymore.

Thanks again!
msg368119 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-05-05 09:18
It is hard to say what is the problem, but seems it was solved in 3.7. Either it was an optimization, or a bug fix which had such side effect. If it was a bug fix, it was one of backward incompatible bugfixes which are not backported to older versions.
History
Date User Action Args
2022-04-11 14:59:30adminsetgithub: 84676
2020-05-05 09:18:05serhiy.storchakasetstatus: open -> closed

nosy: + serhiy.storchaka
messages: + msg368119

resolution: out of date
stage: resolved
2020-05-05 07:39:49sraelsetmessages: + msg368109
2020-05-04 10:10:21remi.lapeyresetnosy: + remi.lapeyre
messages: + msg368030
2020-05-04 10:02:29sraelsetmessages: + msg368028
title: re.findall() deadlock on Python 3.6.10 -> re.findall() takes a long time (100% cup usage) on Python 3.6.10
2020-05-04 09:54:10sraelcreate