tokenize spends a lot of time in `re.compile(...)` #87180

asottile · 2021-01-24T08:34:14Z

BPO	43014
Nosy	@stevendaprano, @serhiy-storchaka, @asottile, @pablogsal, @isidentical
PRs	bpo-43014: Improve performance of tokenize by 20-30% #24311 bpo-43014: Limit the max size of the cache in the tokenize module to 512 #24313
Files	out.pstats out.svg out2.pstats out2.svg out3.pstats out3.svg

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-01-24.09:23:55.930>
created_at = <Date 2021-01-24.08:34:14.386>
labels = ['library', '3.9', '3.10', 'performance']
title = 'tokenize spends a lot of time in `re.compile(...)`'
updated_at = <Date 2021-01-24.17:32:38.720>
user = 'https://github.com/asottile'

bugs.python.org fields:

activity = <Date 2021-01-24.17:32:38.720>
actor = 'pablogsal'
assignee = 'none'
closed = True
closed_date = <Date 2021-01-24.09:23:55.930>
closer = 'BTaskaya'
components = ['Library (Lib)']
creation = <Date 2021-01-24.08:34:14.386>
creator = 'Anthony Sottile'
dependencies = []
files = ['49759', '49760', '49761', '49762', '49763', '49764']
hgrepos = []
issue_num = 43014
keywords = ['patch']
message_count = 6.0
messages = ['385572', '385573', '385574', '385575', '385576', '385577']
nosy_count = 5.0
nosy_names = ['steven.daprano', 'serhiy.storchaka', 'Anthony Sottile', 'pablogsal', 'BTaskaya']
pr_nums = ['24311', '24313']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue43014'
versions = ['Python 3.9', 'Python 3.10']

asottile · 2021-01-24T08:34:14Z

I did some profiling (attached a few files here with svgs) of running this script:

import io
import tokenize

# picked as the second longest file in cpython
with open('Lib/test/test_socket.py', 'rb') as f:
    bio = io.BytesIO(f.read())


def main():
    for _ in range(10):
        bio.seek(0)
        for _ in tokenize.tokenize(bio.readline):
            pass

if __name__ == '__main__':
    exit(main())

the first profile is before the optimization, the second is after the optimization

The optimization takes the execution from ~6300ms to ~4500ms on my machine (representing a 28% - 39% improvement depending on how you calculate it)

(I'll attach the pstats and svgs after creation, seems I can only attach one file at once)

asottile · 2021-01-24T08:47:45Z

admittedly anecdotal but here's another data point in addition to the profiles attached

test.test_tokenize suite before:

$ ./python -m test.test_tokenize
..............................................................................

Ran 78 tests in 77.148s

OK

test.test_tokenize suite after:

$ ./python -m test.test_tokenize
..............................................................................

Ran 78 tests in 61.269s

OK

asottile · 2021-01-24T09:00:07Z

attached out3.pstats / out3.svg which represent the optimization using lru_cache instead

isidentical · 2021-01-24T09:23:21Z

New changeset 15bd9ef by Anthony Sottile in branch 'master':
bpo-43014: Improve performance of tokenize.tokenize by 20-30%
15bd9ef

stevendaprano · 2021-01-24T09:38:57Z

Just for the record:

The optimization takes the execution from ~6300ms to ~4500ms on my machine (representing a 28% - 39% improvement depending on how you calculate it)

The correct answer is 28%, which uses the initial value as the base: (6300-4500)/6300 ≈ 28%. You are starting at 6300ms and speeding it up by 28%:

>>> 6300 - 28/100*6300
4536.0

Using 4500 as the base would only make sense if you were calculating a slowdown from 4500ms to 6300ms: we started at 4500 and *increase* the time by 39%:

>>> 4500 + 39/100*4500
6255.0

Hope this helps.

serhiy-storchaka · 2021-01-24T12:09:16Z

re.compile() already uses caching. But it is less efficient for some reasons.

To Steven: the time is *reduced* by 28%, but the speed is *increased* by 39%.

asottile mannequin added 3.9 only security fixes 3.10 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Jan 24, 2021

isidentical closed this as completed Jan 24, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize spends a lot of time in `re.compile(...)` #87180

tokenize spends a lot of time in `re.compile(...)` #87180

asottile mannequin commented Jan 24, 2021

asottile mannequin commented Jan 24, 2021

asottile mannequin commented Jan 24, 2021

asottile mannequin commented Jan 24, 2021

isidentical commented Jan 24, 2021

stevendaprano commented Jan 24, 2021

serhiy-storchaka commented Jan 24, 2021

tokenize spends a lot of time in re.compile(...) #87180

tokenize spends a lot of time in re.compile(...) #87180

Comments

asottile mannequin commented Jan 24, 2021

asottile mannequin commented Jan 24, 2021

asottile mannequin commented Jan 24, 2021

asottile mannequin commented Jan 24, 2021

isidentical commented Jan 24, 2021

stevendaprano commented Jan 24, 2021

serhiy-storchaka commented Jan 24, 2021

tokenize spends a lot of time in `re.compile(...)` #87180

tokenize spends a lot of time in `re.compile(...)` #87180