classification
Title: important performance regression on regular expression parsing
Type: performance Stage: resolved
Components: Regular Expressions Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, miss-islington, miss-islington, mrabarnett, serhiy.storchaka, yannvgn
Priority: normal Keywords: patch

Created on 2019-07-30 22:02 by yannvgn, last changed 2019-08-04 07:41 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15030 merged yannvgn, 2019-07-30 22:03
PR 15059 merged miss-islington, 2019-07-31 18:50
PR 15060 merged miss-islington, 2019-07-31 18:51
Messages (10)
msg348771 - (view) Author: yannvgn (yannvgn) * Date: 2019-07-30 22:02
On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7

Example (ipython):

In [1]: import re
In [2]: char_list = ''.join([chr(i) for i in range(0xffff)])
In [3]: long_char_list = char_list * 10
In [4]: pattern = f'[{re.escape(long_char_list)}]'
In [5]: %time compiled = re.compile(pattern)

The test was run on Amazon Linux AMI 2017.03.

On Python 3.6.1, the regexp compiled in 2.6 seconds:
CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
Wall time: 2.64 s

On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case):
CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
Wall time: 15min 9s

Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6

The complexity of this function is on average O(N^2) but can be easily reduced to O(N).

The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.
msg348789 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-31 07:58
Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?

Could you please show benchmarking results for different implementations and different sizes?
msg348813 - (view) Author: yannvgn (yannvgn) * Date: 2019-07-31 16:28
> Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?

> Could you please show benchmarking results for different implementations and different sizes?

I can't precisely answer that, but sacremoses (a tokenization package) for example is strongly impacted. See https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853
msg348814 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-31 16:32
Oh, this is convincing.
msg348817 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-07-31 17:23
I've just had a look at _uniq, and the code surprises me.

The obvious way to detect duplicates is with a set, but that requires the items to be hashable. Are they?

Well, the first line of the function uses 'set', so they are.

Why, then, isn't it using a set to detect the duplicates?

How about this:

def _uniq(items):
    newitems = []
    seen = set()
    for item in items:
        if item not in seen:
            newitems.append(item)
            seen.add(item)
    return newitems
msg348821 - (view) Author: yannvgn (yannvgn) * Date: 2019-07-31 18:45
Hey Matthew,

we decided to go for this, which is simpler and straightforward:

def _uniq(items):
    return list(dict.fromkeys(items))

(see https://github.com/python/cpython/pull/15030)
msg348822 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-31 18:51
New changeset 9f55551f3df238e58315e724e50cb0d574d75b94 by Serhiy Storchaka (yannvgn) in branch 'master':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
https://github.com/python/cpython/commit/9f55551f3df238e58315e724e50cb0d574d75b94
msg348823 - (view) Author: miss-islington (miss-islington) Date: 2019-07-31 20:22
New changeset 33b700ba8cbb128519442eeed8c8747ff73f4524 by Miss Islington (bot) in branch '3.7':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
https://github.com/python/cpython/commit/33b700ba8cbb128519442eeed8c8747ff73f4524
msg348824 - (view) Author: miss-islington (miss-islington) Date: 2019-07-31 20:22
New changeset 77fcccb5321137456549b7f55b819f2c8a4c78a4 by Miss Islington (bot) in branch '3.8':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
https://github.com/python/cpython/commit/77fcccb5321137456549b7f55b819f2c8a4c78a4
msg348975 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-04 07:41
Thank you for your contribution yannvgn.
History
Date User Action Args
2019-08-04 07:41:20serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg348975

stage: patch review -> resolved
2019-07-31 20:22:38miss-islingtonsetnosy: + miss-islington
messages: + msg348824
2019-07-31 20:22:38miss-islingtonsetnosy: + miss-islington
messages: + msg348823
2019-07-31 18:51:06serhiy.storchakasetmessages: + msg348822
2019-07-31 18:51:01miss-islingtonsetpull_requests: + pull_request14809
2019-07-31 18:50:54miss-islingtonsetpull_requests: + pull_request14808
2019-07-31 18:45:08yannvgnsetmessages: + msg348821
2019-07-31 17:23:32mrabarnettsetmessages: + msg348817
2019-07-31 16:32:51serhiy.storchakasetmessages: + msg348814
2019-07-31 16:28:12yannvgnsetmessages: + msg348813
2019-07-31 07:58:00serhiy.storchakasetmessages: + msg348789
2019-07-31 06:24:39serhiy.storchakasetnosy: + serhiy.storchaka
2019-07-30 22:03:50yannvgnsetkeywords: + patch
stage: patch review
pull_requests: + pull_request14788
2019-07-30 22:02:34yannvgncreate