Issue 37723: important performance regression on regular expression parsing

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81904

classification

Title:	important performance regression on regular expression parsing
Type:	performance	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, miss-islington, miss-islington, mrabarnett, serhiy.storchaka, yannvgn
Priority:	normal	Keywords:	patch

Created on 2019-07-30 22:02 by yannvgn, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 15030	merged	yannvgn, 2019-07-30 22:03
PR 15059	merged	miss-islington, 2019-07-31 18:50
PR 15060	merged	miss-islington, 2019-07-31 18:51

Messages (10)
msg348771 - (view)	Author: yannvgn (yannvgn) *	Date: 2019-07-30 22:02
On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7 Example (ipython): In [1]: import re In [2]: char_list = ''.join([chr(i) for i in range(0xffff)]) In [3]: long_char_list = char_list * 10 In [4]: pattern = f'[{re.escape(long_char_list)}]' In [5]: %time compiled = re.compile(pattern) The test was run on Amazon Linux AMI 2017.03. On Python 3.6.1, the regexp compiled in 2.6 seconds: CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s Wall time: 2.64 s On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case): CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s Wall time: 15min 9s Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6 The complexity of this function is on average O(N^2) but can be easily reduced to O(N). The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.
msg348789 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-07-31 07:58
Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code? Could you please show benchmarking results for different implementations and different sizes?
msg348813 - (view)	Author: yannvgn (yannvgn) *	Date: 2019-07-31 16:28
> Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code? > Could you please show benchmarking results for different implementations and different sizes? I can't precisely answer that, but sacremoses (a tokenization package) for example is strongly impacted. See https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853
msg348814 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-07-31 16:32
Oh, this is convincing.
msg348817 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2019-07-31 17:23
I've just had a look at _uniq, and the code surprises me. The obvious way to detect duplicates is with a set, but that requires the items to be hashable. Are they? Well, the first line of the function uses 'set', so they are. Why, then, isn't it using a set to detect the duplicates? How about this: def _uniq(items): newitems = [] seen = set() for item in items: if item not in seen: newitems.append(item) seen.add(item) return newitems
msg348821 - (view)	Author: yannvgn (yannvgn) *	Date: 2019-07-31 18:45
Hey Matthew, we decided to go for this, which is simpler and straightforward: def _uniq(items): return list(dict.fromkeys(items)) (see https://github.com/python/cpython/pull/15030)
msg348822 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-07-31 18:51
New changeset 9f55551f3df238e58315e724e50cb0d574d75b94 by Serhiy Storchaka (yannvgn) in branch 'master': bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) https://github.com/python/cpython/commit/9f55551f3df238e58315e724e50cb0d574d75b94
msg348823 - (view)	Author: miss-islington (miss-islington)	Date: 2019-07-31 20:22
New changeset 33b700ba8cbb128519442eeed8c8747ff73f4524 by Miss Islington (bot) in branch '3.7': bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) https://github.com/python/cpython/commit/33b700ba8cbb128519442eeed8c8747ff73f4524
msg348824 - (view)	Author: miss-islington (miss-islington)	Date: 2019-07-31 20:22
New changeset 77fcccb5321137456549b7f55b819f2c8a4c78a4 by Miss Islington (bot) in branch '3.8': bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) https://github.com/python/cpython/commit/77fcccb5321137456549b7f55b819f2c8a4c78a4
msg348975 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-08-04 07:41
Thank you for your contribution yannvgn.

History
Date	User	Action	Args
2022-04-11 14:59:18	admin	set	github: 81904
2019-08-04 07:41:20	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg348975 stage: patch review -> resolved
2019-07-31 20:22:38	miss-islington	set	nosy: + miss-islington messages: + msg348824
2019-07-31 20:22:38	miss-islington	set	nosy: + miss-islington messages: + msg348823
2019-07-31 18:51:06	serhiy.storchaka	set	messages: + msg348822
2019-07-31 18:51:01	miss-islington	set	pull_requests: + pull_request14809
2019-07-31 18:50:54	miss-islington	set	pull_requests: + pull_request14808
2019-07-31 18:45:08	yannvgn	set	messages: + msg348821
2019-07-31 17:23:32	mrabarnett	set	messages: + msg348817
2019-07-31 16:32:51	serhiy.storchaka	set	messages: + msg348814
2019-07-31 16:28:12	yannvgn	set	messages: + msg348813
2019-07-31 07:58:00	serhiy.storchaka	set	messages: + msg348789
2019-07-31 06:24:39	serhiy.storchaka	set	nosy: + serhiy.storchaka
2019-07-30 22:03:50	yannvgn	set	keywords: + patch stage: patch review pull_requests: + pull_request14788
2019-07-30 22:02:34	yannvgn	create