This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenize spends a lot of time in `re.compile(...)`
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.10, Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Anthony Sottile, BTaskaya, pablogsal, serhiy.storchaka, steven.daprano
Priority: normal Keywords: patch

Created on 2021-01-24 08:34 by Anthony Sottile, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
out.pstats Anthony Sottile, 2021-01-24 08:34
out.svg Anthony Sottile, 2021-01-24 08:34
out2.pstats Anthony Sottile, 2021-01-24 08:34
out2.svg Anthony Sottile, 2021-01-24 08:34
out3.pstats Anthony Sottile, 2021-01-24 08:59
out3.svg Anthony Sottile, 2021-01-24 09:00
Pull Requests
URL Status Linked Edit
PR 24311 closed Anthony Sottile, 2021-01-24 08:38
PR 24313 closed pablogsal, 2021-01-24 17:32
Messages (6)
msg385572 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2021-01-24 08:34
I did some profiling (attached a few files here with svgs) of running this script:

```python
import io
import tokenize

# picked as the second longest file in cpython
with open('Lib/test/test_socket.py', 'rb') as f:
    bio = io.BytesIO(f.read())


def main():
    for _ in range(10):
        bio.seek(0)
        for _ in tokenize.tokenize(bio.readline):
            pass

if __name__ == '__main__':
    exit(main())
```


the first profile is before the optimization, the second is after the optimization

The optimization takes the execution from ~6300ms to ~4500ms on my machine (representing a 28% - 39% improvement depending on how you calculate it)

(I'll attach the pstats and svgs after creation, seems I can only attach one file at once)
msg385573 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2021-01-24 08:47
admittedly anecdotal but here's another data point in addition to the profiles attached

test.test_tokenize suite before:

$ ./python -m test.test_tokenize
..............................................................................
----------------------------------------------------------------------
Ran 78 tests in 77.148s

OK


test.test_tokenize suite after:

$ ./python -m test.test_tokenize
..............................................................................
----------------------------------------------------------------------
Ran 78 tests in 61.269s

OK
msg385574 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2021-01-24 09:00
attached out3.pstats / out3.svg which represent the optimization using lru_cache instead
msg385575 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2021-01-24 09:23
New changeset 15bd9efd01e44087664e78bf766865a6d2e06626 by Anthony Sottile in branch 'master':
bpo-43014: Improve performance of tokenize.tokenize by 20-30%
https://github.com/python/cpython/commit/15bd9efd01e44087664e78bf766865a6d2e06626
msg385576 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-01-24 09:38
Just for the record:

> The optimization takes the execution from ~6300ms to ~4500ms on my machine (representing a 28% - 39% improvement depending on how you calculate it)

The correct answer is 28%, which uses the initial value as the base: (6300-4500)/6300 ≈ 28%. You are starting at 6300ms and speeding it up by 28%:

>>> 6300 - 28/100*6300
4536.0

Using 4500 as the base would only make sense if you were calculating a slowdown from 4500ms to 6300ms: we started at 4500 and *increase* the time by 39%:

>>> 4500 + 39/100*4500
6255.0


Hope this helps.
msg385577 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-01-24 12:09
re.compile() already uses caching. But it is less efficient for some reasons.

To Steven: the time is *reduced* by 28%, but the speed is *increased* by 39%.
History
Date User Action Args
2022-04-11 14:59:40adminsetgithub: 87180
2021-01-24 17:32:38pablogsalsetnosy: + pablogsal

pull_requests: + pull_request23132
2021-01-24 12:09:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg385577
2021-01-24 09:38:57steven.dapranosetnosy: + steven.daprano
messages: + msg385576
2021-01-24 09:23:55BTaskayasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-01-24 09:23:21BTaskayasetnosy: + BTaskaya
messages: + msg385575
2021-01-24 09:00:06Anthony Sottilesetfiles: + out3.svg

messages: + msg385574
2021-01-24 08:59:45Anthony Sottilesetfiles: + out3.pstats
2021-01-24 08:47:45Anthony Sottilesetmessages: + msg385573
2021-01-24 08:38:32Anthony Sottilesetkeywords: + patch
stage: patch review
pull_requests: + pull_request23130
2021-01-24 08:34:35Anthony Sottilesetfiles: + out2.svg
2021-01-24 08:34:29Anthony Sottilesetfiles: + out2.pstats
2021-01-24 08:34:23Anthony Sottilesetfiles: + out.svg
2021-01-24 08:34:14Anthony Sottilecreate