Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize spends a lot of time in re.compile(...) #87180

Closed
asottile mannequin opened this issue Jan 24, 2021 · 6 comments
Closed

tokenize spends a lot of time in re.compile(...) #87180

asottile mannequin opened this issue Jan 24, 2021 · 6 comments
Labels
3.9 only security fixes 3.10 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@asottile
Copy link
Mannequin

asottile mannequin commented Jan 24, 2021

BPO 43014
Nosy @stevendaprano, @serhiy-storchaka, @asottile, @pablogsal, @isidentical
PRs
  • bpo-43014: Improve performance of tokenize by 20-30% #24311
  • bpo-43014: Limit the max size of the cache in the tokenize module to 512 #24313
  • Files
  • out.pstats
  • out.svg
  • out2.pstats
  • out2.svg
  • out3.pstats
  • out3.svg
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-01-24.09:23:55.930>
    created_at = <Date 2021-01-24.08:34:14.386>
    labels = ['library', '3.9', '3.10', 'performance']
    title = 'tokenize spends a lot of time in `re.compile(...)`'
    updated_at = <Date 2021-01-24.17:32:38.720>
    user = 'https://github.com/asottile'

    bugs.python.org fields:

    activity = <Date 2021-01-24.17:32:38.720>
    actor = 'pablogsal'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-01-24.09:23:55.930>
    closer = 'BTaskaya'
    components = ['Library (Lib)']
    creation = <Date 2021-01-24.08:34:14.386>
    creator = 'Anthony Sottile'
    dependencies = []
    files = ['49759', '49760', '49761', '49762', '49763', '49764']
    hgrepos = []
    issue_num = 43014
    keywords = ['patch']
    message_count = 6.0
    messages = ['385572', '385573', '385574', '385575', '385576', '385577']
    nosy_count = 5.0
    nosy_names = ['steven.daprano', 'serhiy.storchaka', 'Anthony Sottile', 'pablogsal', 'BTaskaya']
    pr_nums = ['24311', '24313']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue43014'
    versions = ['Python 3.9', 'Python 3.10']

    @asottile
    Copy link
    Mannequin Author

    asottile mannequin commented Jan 24, 2021

    I did some profiling (attached a few files here with svgs) of running this script:

    import io
    import tokenize
    
    # picked as the second longest file in cpython
    with open('Lib/test/test_socket.py', 'rb') as f:
        bio = io.BytesIO(f.read())
    
    
    def main():
        for _ in range(10):
            bio.seek(0)
            for _ in tokenize.tokenize(bio.readline):
                pass
    
    if __name__ == '__main__':
        exit(main())

    the first profile is before the optimization, the second is after the optimization

    The optimization takes the execution from ~6300ms to ~4500ms on my machine (representing a 28% - 39% improvement depending on how you calculate it)

    (I'll attach the pstats and svgs after creation, seems I can only attach one file at once)

    @asottile asottile mannequin added 3.9 only security fixes 3.10 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Jan 24, 2021
    @asottile
    Copy link
    Mannequin Author

    asottile mannequin commented Jan 24, 2021

    admittedly anecdotal but here's another data point in addition to the profiles attached

    test.test_tokenize suite before:

    $ ./python -m test.test_tokenize
    ..............................................................................

    Ran 78 tests in 77.148s

    OK

    test.test_tokenize suite after:

    $ ./python -m test.test_tokenize
    ..............................................................................

    Ran 78 tests in 61.269s

    OK

    @asottile
    Copy link
    Mannequin Author

    asottile mannequin commented Jan 24, 2021

    attached out3.pstats / out3.svg which represent the optimization using lru_cache instead

    @isidentical
    Copy link
    Sponsor Member

    New changeset 15bd9ef by Anthony Sottile in branch 'master':
    bpo-43014: Improve performance of tokenize.tokenize by 20-30%
    15bd9ef

    @stevendaprano
    Copy link
    Member

    Just for the record:

    The optimization takes the execution from ~6300ms to ~4500ms on my machine (representing a 28% - 39% improvement depending on how you calculate it)

    The correct answer is 28%, which uses the initial value as the base: (6300-4500)/6300 ≈ 28%. You are starting at 6300ms and speeding it up by 28%:

    >>> 6300 - 28/100*6300
    4536.0

    Using 4500 as the base would only make sense if you were calculating a slowdown from 4500ms to 6300ms: we started at 4500 and *increase* the time by 39%:

    >>> 4500 + 39/100*4500
    6255.0

    Hope this helps.

    @serhiy-storchaka
    Copy link
    Member

    re.compile() already uses caching. But it is less efficient for some reasons.

    To Steven: the time is *reduced* by 28%, but the speed is *increased* by 39%.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants