Message 408909 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gdr@garethrees.org
Recipients	ezio.melotti, gdr@garethrees.org, mrabarnett, ramzitra, serhiy.storchaka
Date	2021-12-19.16:10:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1639930244.48.0.146443461221.issue46065@roundup.psfhosted.org>
In-reply-to

Content
The way to avoid this behaviour is to disallow the attempts at matching that you know are going to fail. As Serhiy described above, if the search fails starting at the first character of the string, it will move forward and try again starting at the second character. But you know that this new attempt must fail, so you can force the regular expression engine to discard the attempt immediately. Here's an illustration in a simpler setting, where we are looking for all strings of 'a' followed by 'b': >>> import re >>> from timeit import timeit >>> text = 'a' * 100000 >>> timeit(lambda:re.findall(r'a+b', text), number=1) 6.643531181000014 We know that any successful match must be preceded by a character other than 'a' (or the beginning of the string), so we can reject many unsuccessful matches like this: >>> timeit(lambda:re.findall(r'(?:^\|[^a])(a+b)', text), number=1) 0.003743481000014981 In your case, a successful match must be preceded by [^a-zA-Z0-9_.+-] (or the beginning of the string).

The way to avoid this behaviour is to disallow the attempts at matching that you know are going to fail. As Serhiy described above, if the search fails starting at the first character of the string, it will move forward and try again starting at the second character. But you know that this new attempt must fail, so you can force the regular expression engine to discard the attempt immediately.

Here's an illustration in a simpler setting, where we are looking for all strings of 'a' followed by 'b':

    >>> import re
    >>> from timeit import timeit
    >>> text = 'a' * 100000
    >>> timeit(lambda:re.findall(r'a+b', text), number=1)
    6.643531181000014

We know that any successful match must be preceded by a character other than 'a' (or the beginning of the string), so we can reject many unsuccessful matches like this:

    >>> timeit(lambda:re.findall(r'(?:^|[^a])(a+b)', text), number=1)
    0.003743481000014981

In your case, a successful match must be preceded by [^a-zA-Z0-9_.+-] (or the beginning of the string).

History
Date	User	Action	Args
2021-12-19 16:10:44	gdr@garethrees.org	set	recipients: + gdr@garethrees.org, ezio.melotti, mrabarnett, serhiy.storchaka, ramzitra
2021-12-19 16:10:44	gdr@garethrees.org	set	messageid: <1639930244.48.0.146443461221.issue46065@roundup.psfhosted.org>
2021-12-19 16:10:44	gdr@garethrees.org	link	issue46065 messages
2021-12-19 16:10:44	gdr@garethrees.org	create