Message 348771 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	yannvgn
Recipients	ezio.melotti, mrabarnett, yannvgn
Date	2019-07-30.22:02:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1564524154.39.0.245394903935.issue37723@roundup.psfhosted.org>
In-reply-to

Content
On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7 Example (ipython): In [1]: import re In [2]: char_list = ''.join([chr(i) for i in range(0xffff)]) In [3]: long_char_list = char_list * 10 In [4]: pattern = f'[{re.escape(long_char_list)}]' In [5]: %time compiled = re.compile(pattern) The test was run on Amazon Linux AMI 2017.03. On Python 3.6.1, the regexp compiled in 2.6 seconds: CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s Wall time: 2.64 s On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case): CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s Wall time: 15min 9s Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6 The complexity of this function is on average O(N^2) but can be easily reduced to O(N). The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.

On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7

Example (ipython):

In [1]: import re
In [2]: char_list = ''.join([chr(i) for i in range(0xffff)])
In [3]: long_char_list = char_list * 10
In [4]: pattern = f'[{re.escape(long_char_list)}]'
In [5]: %time compiled = re.compile(pattern)

The test was run on Amazon Linux AMI 2017.03.

On Python 3.6.1, the regexp compiled in 2.6 seconds:
CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
Wall time: 2.64 s

On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case):
CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
Wall time: 15min 9s

Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6

The complexity of this function is on average O(N^2) but can be easily reduced to O(N).

The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.

History
Date	User	Action	Args
2019-07-30 22:02:34	yannvgn	set	recipients: + yannvgn, ezio.melotti, mrabarnett
2019-07-30 22:02:34	yannvgn	set	messageid: <1564524154.39.0.245394903935.issue37723@roundup.psfhosted.org>
2019-07-30 22:02:34	yannvgn	link	issue37723 messages
2019-07-30 22:02:33	yannvgn	create