Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of list and tuple repeat methods #91247

Closed
eendebakpt mannequin opened this issue Mar 22, 2022 · 14 comments
Closed

Improve performance of list and tuple repeat methods #91247

eendebakpt mannequin opened this issue Mar 22, 2022 · 14 comments
Labels
3.11 bug and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@eendebakpt
Copy link
Mannequin

eendebakpt mannequin commented Mar 22, 2022

BPO 47091
Nosy @eendebakpt
PRs
  • gh-91247: improve performance of list and tuple repeat #32045
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-03-22.12:37:50.817>
    labels = ['interpreter-core', '3.11']
    title = 'Improve performance of list and tuple repeat methods'
    updated_at = <Date 2022-03-22.18:55:53.004>
    user = 'https://github.com/eendebakpt'

    bugs.python.org fields:

    activity = <Date 2022-03-22.18:55:53.004>
    actor = 'pieter.eendebak'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2022-03-22.12:37:50.817>
    creator = 'pieter.eendebak'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 47091
    keywords = ['patch']
    message_count = 2.0
    messages = ['415762', '415805']
    nosy_count = 1.0
    nosy_names = ['pieter.eendebak']
    pr_nums = ['32045']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue47091'
    versions = ['Python 3.11']

    @eendebakpt
    Copy link
    Mannequin Author

    eendebakpt mannequin commented Mar 22, 2022

    Approach is similar to #31856 and #31999

    @eendebakpt eendebakpt mannequin added 3.11 bug and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Mar 22, 2022
    @eendebakpt
    Copy link
    Mannequin Author

    eendebakpt mannequin commented Mar 22, 2022

    The special case of a repeat with n=1 does not use memcpy. An implementation with it showed no performance improvement.

    @eendebakpt
    Copy link
    Contributor

    eendebakpt commented Apr 12, 2022

    After rebasing to main and compiling with optimizations enabled, there are some new performance results. There are two PRs improving the performance, in particular for the case of a small list (e.g. 2 to 10 elements) and a large number of repeats.

    Pyperformance results show no improvements, microbenchmarks do, details in the next post.

    Pyperformance results: main against #91482

    Note: some benchmarks are missing as they fail on the test system

    float: Mean +- std dev: [base] 96.6 ms +- 1.6 ms -> [patch] 97.7 ms +- 3.3 ms: 1.01x slower
    hexiom: Mean +- std dev: [base] 7.69 ms +- 0.04 ms -> [patch] 7.64 ms +- 0.04 ms: 1.01x faster
    json_dumps: Mean +- std dev: [base] 15.1 ms +- 0.2 ms -> [patch] 15.0 ms +- 0.2 ms: 1.01x faster
    json_loads: Mean +- std dev: [base] 27.7 us +- 0.3 us -> [patch] 27.5 us +- 0.2 us: 1.01x faster
    logging_silent: Mean +- std dev: [base] 131 ns +- 4 ns -> [patch] 132 ns +- 1 ns: 1.01x slower
    logging_simple: Mean +- std dev: [base] 7.46 us +- 0.12 us -> [patch] 7.34 us +- 0.09 us: 1.02x faster
    meteor_contest: Mean +- std dev: [base] 124 ms +- 1 ms -> [patch] 126 ms +- 1 ms: 1.02x slower
    nbody: Mean +- std dev: [base] 111 ms +- 1 ms -> [patch] 113 ms +- 4 ms: 1.01x slower
    nqueens: Mean +- std dev: [base] 107 ms +- 1 ms -> [patch] 105 ms +- 1 ms: 1.02x faster
    pathlib: Mean +- std dev: [base] 22.8 ms +- 1.2 ms -> [patch] 23.8 ms +- 2.4 ms: 1.04x slower
    pickle: Mean +- std dev: [base] 10.6 us +- 0.4 us -> [patch] 11.0 us +- 0.8 us: 1.03x slower
    pickle_dict: Mean +- std dev: [base] 30.7 us +- 0.2 us -> [patch] 30.5 us +- 0.8 us: 1.01x faster
    pickle_list: Mean +- std dev: [base] 4.47 us +- 0.05 us -> [patch] 4.40 us +- 0.08 us: 1.02x faster
    pickle_pure_python: Mean +- std dev: [base] 370 us +- 2 us -> [patch] 372 us +- 3 us: 1.01x slower
    raytrace: Mean +- std dev: [base] 380 ms +- 9 ms -> [patch] 392 ms +- 16 ms: 1.03x slower
    regex_compile: Mean +- std dev: [base] 164 ms +- 2 ms -> [patch] 163 ms +- 2 ms: 1.01x faster
    regex_effbot: Mean +- std dev: [base] 3.52 ms +- 0.10 ms -> [patch] 3.48 ms +- 0.03 ms: 1.01x faster
    regex_v8: Mean +- std dev: [base] 29.2 ms +- 0.7 ms -> [patch] 30.0 ms +- 1.8 ms: 1.03x slower
    scimark_lu: Mean +- std dev: [base] 143 ms +- 1 ms -> [patch] 148 ms +- 6 ms: 1.03x slower
    scimark_sor: Mean +- std dev: [base] 154 ms +- 1 ms -> [patch] 153 ms +- 2 ms: 1.01x faster
    scimark_sparse_mat_mult: Mean +- std dev: [base] 6.32 ms +- 0.45 ms -> [patch] 5.69 ms +- 0.09 ms: 1.11x faster
    telco: Mean +- std dev: [base] 8.10 ms +- 0.28 ms -> [patch] 7.85 ms +- 0.21 ms: 1.03x faster
    unpickle: Mean +- std dev: [base] 15.0 us +- 0.2 us -> [patch] 15.2 us +- 0.7 us: 1.02x slower
    unpickle_list: Mean +- std dev: [base] 5.85 us +- 0.06 us -> [patch] 5.96 us +- 0.06 us: 1.02x slower
    unpickle_pure_python: Mean +- std dev: [base] 299 us +- 2 us -> [patch] 304 us +- 11 us: 1.02x slower
    
    Benchmark hidden because not significant (21): 2to3, chaos, deltablue, fannkuch, go, logging_format, pidigits, pyflate, python_startup, python_startup_no_site, regex_dna, richards, scimark_fft, scimark_monte_carlo, spectral_norm, sqlite_synth, unpack_sequence, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process
    
    Geometric mean: 1.00x slower
    
    Pyperformance results: main against #32045

    Note: some benchmarks are missing as they fail on the test system

    2to3: Mean +- std dev: [base] 321 ms +- 3 ms -> [patch_no_sp] 326 ms +- 4 ms: 1.02x slower
    fannkuch: Mean +- std dev: [base] 517 ms +- 13 ms -> [patch_no_sp] 542 ms +- 19 ms: 1.05x slower
    float: Mean +- std dev: [base] 96.6 ms +- 1.6 ms -> [patch_no_sp] 100 ms +- 3 ms: 1.04x slower
    go: Mean +- std dev: [base] 165 ms +- 1 ms -> [patch_no_sp] 168 ms +- 5 ms: 1.02x slower
    hexiom: Mean +- std dev: [base] 7.69 ms +- 0.04 ms -> [patch_no_sp] 8.17 ms +- 0.35 ms: 1.06x slower
    json_dumps: Mean +- std dev: [base] 15.1 ms +- 0.2 ms -> [patch_no_sp] 14.9 ms +- 0.1 ms: 1.02x faster
    json_loads: Mean +- std dev: [base] 27.7 us +- 0.3 us -> [patch_no_sp] 27.3 us +- 0.3 us: 1.02x faster
    logging_format: Mean +- std dev: [base] 8.00 us +- 0.09 us -> [patch_no_sp] 8.07 us +- 0.13 us: 1.01x slower
    logging_silent: Mean +- std dev: [base] 131 ns +- 4 ns -> [patch_no_sp] 138 ns +- 5 ns: 1.05x slower
    logging_simple: Mean +- std dev: [base] 7.46 us +- 0.12 us -> [patch_no_sp] 7.39 us +- 0.15 us: 1.01x faster
    meteor_contest: Mean +- std dev: [base] 124 ms +- 1 ms -> [patch_no_sp] 128 ms +- 1 ms: 1.03x slower
    nbody: Mean +- std dev: [base] 111 ms +- 1 ms -> [patch_no_sp] 113 ms +- 4 ms: 1.01x slower
    nqueens: Mean +- std dev: [base] 107 ms +- 1 ms -> [patch_no_sp] 111 ms +- 4 ms: 1.04x slower
    pickle: Mean +- std dev: [base] 10.6 us +- 0.4 us -> [patch_no_sp] 10.8 us +- 0.5 us: 1.02x slower
    pickle_dict: Mean +- std dev: [base] 30.7 us +- 0.2 us -> [patch_no_sp] 31.4 us +- 1.0 us: 1.02x slower
    pidigits: Mean +- std dev: [base] 211 ms +- 2 ms -> [patch_no_sp] 211 ms +- 1 ms: 1.00x faster
    pyflate: Mean +- std dev: [base] 533 ms +- 7 ms -> [patch_no_sp] 555 ms +- 21 ms: 1.04x slower
    raytrace: Mean +- std dev: [base] 380 ms +- 9 ms -> [patch_no_sp] 396 ms +- 19 ms: 1.04x slower
    regex_compile: Mean +- std dev: [base] 164 ms +- 2 ms -> [patch_no_sp] 168 ms +- 5 ms: 1.03x slower
    regex_dna: Mean +- std dev: [base] 239 ms +- 2 ms -> [patch_no_sp] 242 ms +- 7 ms: 1.01x slower
    richards: Mean +- std dev: [base] 57.1 ms +- 1.3 ms -> [patch_no_sp] 58.9 ms +- 2.0 ms: 1.03x slower
    scimark_lu: Mean +- std dev: [base] 143 ms +- 1 ms -> [patch_no_sp] 144 ms +- 4 ms: 1.01x slower
    scimark_monte_carlo: Mean +- std dev: [base] 87.3 ms +- 2.4 ms -> [patch_no_sp] 88.7 ms +- 4.2 ms: 1.02x slower
    scimark_sparse_mat_mult: Mean +- std dev: [base] 6.32 ms +- 0.45 ms -> [patch_no_sp] 6.01 ms +- 0.25 ms: 1.05x faster
    unpack_sequence: Mean +- std dev: [base] 52.7 ns +- 2.1 ns -> [patch_no_sp] 52.1 ns +- 0.6 ns: 1.01x faster
    unpickle_pure_python: Mean +- std dev: [base] 299 us +- 2 us -> [patch_no_sp] 296 us +- 5 us: 1.01x faster
    
    Benchmark hidden because not significant (20): chaos, deltablue, pathlib, pickle_list, pickle_pure_python, python_startup, python_startup_no_site, regex_effbot, regex_v8, scimark_fft, scimark_sor, spectral_norm, sqlite_synth, telco, unpickle, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process
    
    Geometric mean: 1.01x slower
    

    @eendebakpt
    Copy link
    Contributor

    eendebakpt commented Apr 12, 2022

    A microbenchmark with a wide range of tests shows a performance improvement of about 12%, with some benchmarks improving much more.
    The benchmark:

    import pyperf
    runner = pyperf.Runner()
    
    setup='a=[1,2,3]; a1=[1,]; t=(1,2,3,4)'
    
    for n in [1,2,10,1000]:
    	runner.timeit(name=f"list(100) repeat {n}", stmt=f"x=a*{n}; y=a*{n}", setup=f'a=[1.]*100')
    
    	runner.timeit(name=f"list(3) repeat {n}", stmt=f"x=a*{n}; y=a*{n}", setup=setup)
    
    	runner.timeit(name=f"list(1) repeat {n}",
    		  stmt=f"x=a1*{n};",
    		  setup=setup)
    
    	runner.timeit(name=f"list(3) repeat inplace {n}",
    		  stmt=f"a=[1,2,None]; a*={n}",
    		  setup=setup)
    		  
    	runner.timeit(name=f"tuple(4) repeat {n}",
    		  stmt=f"x=t*{n}; y=t*{n}",
    		  setup=setup)
    runner.timeit(name=f"[control] list pop+append", stmt=f"x=a.pop(); a.append(1)", setup=setup)
    
    Results for #91247 (has specializations)
    list(100) repeat 1: Mean +- std dev: [base_test] 793 ns +- 6 ns -> [patch_test_no_sp] 796 ns +- 9 ns: 1.00x slower
    list(3) repeat 1: Mean +- std dev: [base_test] 142 ns +- 2 ns -> [patch_test_no_sp] 136 ns +- 6 ns: 1.04x faster
    list(1) repeat 1: Mean +- std dev: [base_test] 60.8 ns +- 0.4 ns -> [patch_test_no_sp] 60.4 ns +- 0.8 ns: 1.01x faster
    list(3) repeat inplace 1: Mean +- std dev: [base_test] 75.3 ns +- 2.2 ns -> [patch_test_no_sp] 71.9 ns +- 0.5 ns: 1.05x faster
    tuple(4) repeat 1: Mean +- std dev: [base_test] 47.5 ns +- 1.2 ns -> [patch_test_no_sp] 48.5 ns +- 0.4 ns: 1.02x slower
    list(100) repeat 2: Mean +- std dev: [base_test] 1.28 us +- 0.01 us -> [patch_test_no_sp] 1.27 us +- 0.02 us: 1.01x faster
    list(3) repeat 2: Mean +- std dev: [base_test] 158 ns +- 3 ns -> [patch_test_no_sp] 151 ns +- 1 ns: 1.05x faster
    list(1) repeat 2: Mean +- std dev: [base_test] 62.9 ns +- 0.8 ns -> [patch_test_no_sp] 63.3 ns +- 0.5 ns: 1.01x slower
    list(3) repeat inplace 2: Mean +- std dev: [base_test] 106 ns +- 3 ns -> [patch_test_no_sp] 110 ns +- 1 ns: 1.04x slower
    tuple(4) repeat 2: Mean +- std dev: [base_test] 140 ns +- 1 ns -> [patch_test_no_sp] 135 ns +- 3 ns: 1.04x faster
    list(100) repeat 10: Mean +- std dev: [base_test] 4.32 us +- 0.01 us -> [patch_test_no_sp] 4.22 us +- 0.05 us: 1.02x faster
    list(3) repeat 10: Mean +- std dev: [base_test] 270 ns +- 3 ns -> [patch_test_no_sp] 202 ns +- 5 ns: 1.34x faster
    list(1) repeat 10: Mean +- std dev: [base_test] 70.4 ns +- 1.4 ns -> [patch_test_no_sp] 81.3 ns +- 1.5 ns: 1.15x slower
    list(3) repeat inplace 10: Mean +- std dev: [base_test] 148 ns +- 2 ns -> [patch_test_no_sp] 137 ns +- 1 ns: 1.08x faster
    tuple(4) repeat 10: Mean +- std dev: [base_test] 241 ns +- 15 ns -> [patch_test_no_sp] 249 ns +- 6 ns: 1.03x slower
    list(100) repeat 1000: Mean +- std dev: [base_test] 408 us +- 2 us -> [patch_test_no_sp] 419 us +- 1 us: 1.03x slower
    list(3) repeat 1000: Mean +- std dev: [base_test] 18.7 us +- 0.1 us -> [patch_test_no_sp] 5.46 us +- 0.17 us: 3.43x faster
    list(1) repeat 1000: Mean +- std dev: [base_test] 2.00 us +- 0.01 us -> [patch_test_no_sp] 1.96 us +- 0.01 us: 1.02x faster
    list(3) repeat inplace 1000: Mean +- std dev: [base_test] 4.82 us +- 0.04 us -> [patch_test_no_sp] 2.63 us +- 0.04 us: 1.83x faster
    tuple(4) repeat 1000: Mean +- std dev: [base_test] 9.16 us +- 0.34 us -> [patch_test_no_sp] 6.87 us +- 0.41 us: 1.33x faster
    
    Benchmark hidden because not significant (1): [control] list pop+append
    
    Geometric mean: 1.12x faster
    
    Results for #32045 (no specializations)
    list(100) repeat 1: Mean +- std dev: [base_test] 793 ns +- 6 ns -> [patch_test] 806 ns +- 8 ns: 1.02x slower
    list(3) repeat 1: Mean +- std dev: [base_test] 142 ns +- 2 ns -> [patch_test] 139 ns +- 4 ns: 1.02x faster
    list(1) repeat 1: Mean +- std dev: [base_test] 60.8 ns +- 0.4 ns -> [patch_test] 59.2 ns +- 0.5 ns: 1.03x faster
    tuple(4) repeat 1: Mean +- std dev: [base_test] 47.5 ns +- 1.2 ns -> [patch_test] 48.6 ns +- 1.7 ns: 1.02x slower
    list(3) repeat 2: Mean +- std dev: [base_test] 158 ns +- 3 ns -> [patch_test] 151 ns +- 2 ns: 1.04x faster
    list(3) repeat inplace 2: Mean +- std dev: [base_test] 106 ns +- 3 ns -> [patch_test] 110 ns +- 1 ns: 1.04x slower
    tuple(4) repeat 2: Mean +- std dev: [base_test] 140 ns +- 1 ns -> [patch_test] 138 ns +- 4 ns: 1.01x faster
    list(100) repeat 10: Mean +- std dev: [base_test] 4.32 us +- 0.01 us -> [patch_test] 4.23 us +- 0.01 us: 1.02x faster
    list(3) repeat 10: Mean +- std dev: [base_test] 270 ns +- 3 ns -> [patch_test] 207 ns +- 13 ns: 1.30x faster
    list(1) repeat 10: Mean +- std dev: [base_test] 70.4 ns +- 1.4 ns -> [patch_test] 72.1 ns +- 3.5 ns: 1.02x slower
    list(3) repeat inplace 10: Mean +- std dev: [base_test] 148 ns +- 2 ns -> [patch_test] 136 ns +- 1 ns: 1.08x faster
    tuple(4) repeat 10: Mean +- std dev: [base_test] 241 ns +- 15 ns -> [patch_test] 250 ns +- 2 ns: 1.04x slower
    list(100) repeat 1000: Mean +- std dev: [base_test] 408 us +- 2 us -> [patch_test] 422 us +- 5 us: 1.03x slower
    list(3) repeat 1000: Mean +- std dev: [base_test] 18.7 us +- 0.1 us -> [patch_test] 5.42 us +- 0.27 us: 3.45x faster
    list(1) repeat 1000: Mean +- std dev: [base_test] 2.00 us +- 0.01 us -> [patch_test] 2.00 us +- 0.01 us: 1.00x faster
    list(3) repeat inplace 1000: Mean +- std dev: [base_test] 4.82 us +- 0.04 us -> [patch_test] 2.81 us +- 0.03 us: 1.72x faster
    tuple(4) repeat 1000: Mean +- std dev: [base_test] 9.16 us +- 0.34 us -> [patch_test] 6.83 us +- 0.31 us: 1.34x faster
    [control] list pop+append: Mean +- std dev: [base_test] 27.2 ns +- 0.1 ns -> [patch_test] 27.7 ns +- 1.0 ns: 1.02x slower
    
    Benchmark hidden because not significant (3): list(3) repeat inplace 1, list(100) repeat 2, list(1) repeat 2
    
    Geometric mean: 1.12x faster
    

    @eendebakpt
    Copy link
    Contributor

    More microbenchmarks showing strange speedup:

    import pyperf
    runner = pyperf.Runner()
    for n in [1,2,4,8,10,20,100]:
    	runner.timeit(name=f"list({n}) repeat {1000}", stmt=f"x=a*{1000}", setup=f'a=list(range({n}))')
    

    Results (main against #91247)

    list(1) repeat 1000: Mean +- std dev: [b3] 2.00 us +- 0.01 us -> [p3] 2.00 us +- 0.02 us: 1.00x slower
    list(2) repeat 1000: Mean +- std dev: [b3] 3.61 us +- 0.01 us -> [p3] 2.11 us +- 0.01 us: 1.71x faster
    list(4) repeat 1000: Mean +- std dev: [b3] 4.48 us +- 0.04 us -> [p3] 3.35 us +- 0.21 us: 1.34x faster
    list(8) repeat 1000: Mean +- std dev: [b3] 7.21 us +- 0.18 us -> [p3] 7.30 us +- 0.23 us: 1.01x slower
    list(10) repeat 1000: Mean +- std dev: [b3] 8.83 us +- 0.26 us -> [p3] 9.34 us +- 0.37 us: 1.06x slower
    list(20) repeat 1000: Mean +- std dev: [b3] 18.2 us +- 0.9 us -> [p3] 19.3 us +- 0.9 us: 1.06x slower
    list(100) repeat 1000: Mean +- std dev: [b3] 95.8 us +- 1.5 us -> [p3] 104 us +- 6 us: 1.09x slowe
    

    For repeat=1000 most of the time is in copying the data (and not in checking input values etc.) The difference between main and both PRs is:

    Main

            // Now src chases after dest in the same buffer
            src = np->ob_item;
            while (dest < dest_end) {
                *dest++ = *src++;
            }
    

    PR

        while (copied < len_dest) {
            Py_ssize_t items_to_copy = Py_MIN(copied, len_dest - copied);
            memcpy(items + copied, items, sizeof(PyObject*) * items_to_copy);
            copied += items_to_copy;
        }
    

    Maybe the performance in main is not so good for n=2 or n=3 because the reading of the data and writing of the data is happening in the same memory blocks. It would explain the speedup for low n. For n=100 the PR seems slower, I have no explanation for that.

    @eendebakpt
    Copy link
    Contributor

    @thatbirdguythatuknownot Could you run the benchmark from #91247 (comment) on your system?

    @thatbirdguythatuknownot
    Copy link
    Contributor

    @thatbirdguythatuknownot Could you run the benchmark from #91247 (comment) on your system?

    Gonna test it out.

    @thatbirdguythatuknownot
    Copy link
    Contributor

    thatbirdguythatuknownot commented May 4, 2022

    @eendebakpt Do I do it like this?

    list(3) repeat 1: Mean +- std dev: [not_optimized] 145 ns +- 8 ns -> [optimized] 142 ns +- 3 ns: 1.02x faster
    list(1) repeat 1: Mean +- std dev: [not_optimized] 78.1 ns +- 2.9 ns -> [optimized] 66.0 ns +- 1.8 ns: 1.18x faster
    list(3) repeat inplace 1: Mean +- std dev: [not_optimized] 86.7 ns +- 1.6 ns -> [optimized] 86.0 ns +- 0.7 ns: 1.01x faster
    tuple(4) repeat 1: Mean +- std dev: [not_optimized] 85.6 ns +- 2.5 ns -> [optimized] 84.5 ns +- 1.0 ns: 1.01x faster
    list(100) repeat 2: Mean +- std dev: [not_optimized] 1.17 us +- 0.02 us -> [optimized] 1.07 us +- 0.01 us: 1.09x faster
    list(1) repeat 2: Mean +- std dev: [not_optimized] 80.4 ns +- 4.4 ns -> [optimized] 73.0 ns +- 3.0 ns: 1.10x faster
    list(3) repeat inplace 2: Mean +- std dev: [not_optimized] 123 ns +- 10 ns -> [optimized] 118 ns +- 3 ns: 1.04x faster
    list(100) repeat 10: Mean +- std dev: [not_optimized] 4.48 us +- 0.21 us -> [optimized] 3.54 us +- 0.28 us: 1.27x faster
    list(3) repeat 10: Mean +- std dev: [not_optimized] 248 ns +- 11 ns -> [optimized] 239 ns +- 13 ns: 1.04x faster
    list(1) repeat 10: Mean +- std dev: [not_optimized] 95.5 ns +- 7.1 ns -> [optimized] 88.7 ns +- 2.7 ns: 1.08x faster
    tuple(4) repeat 10: Mean +- std dev: [not_optimized] 229 ns +- 3 ns -> [optimized] 283 ns +- 20 ns: 1.23x slower
    list(3) repeat 1000: Mean +- std dev: [not_optimized] 10.3 us +- 0.1 us -> [optimized] 7.80 us +- 0.39 us: 1.32x faster
    list(1) repeat 1000: Mean +- std dev: [not_optimized] 1.56 us +- 0.01 us -> [optimized] 1.77 us +- 0.14 us: 1.14x slower
    list(3) repeat inplace 1000: Mean +- std dev: [not_optimized] 5.77 us +- 0.49 us -> [optimized] 3.88 us +- 0.10 us: 1.49x faster
    tuple(4) repeat 1000: Mean +- std dev: [not_optimized] 8.53 us +- 0.36 us -> [optimized] 11.3 us +- 0.8 us: 1.33x slower
    [control] list pop+append: Mean +- std dev: [not_optimized] 57.4 ns +- 1.7 ns -> [optimized] 61.3 ns +- 7.1 ns: 1.07x slower
    
    Benchmark hidden because not significant (5): list(100) repeat 1, list(3) repeat 2, tuple(4) repeat 2, list(3) repeat inplace 10, list(100) repeat 1000
    
    Geometric mean: 1.04x faster
    

    @eendebakpt
    Copy link
    Contributor

    @thatbirdguythatuknownot That was indeed what I meant. That was with your branch #92286? The results vary quite a lot, some make sense, others not (yet):

    • tuple(4) repeat 1000 and tuple(4) repeat 10 should have been unchanged (since your branch does not modify the tuple).
    • list(3) repeat 1000 is faster, as it should be. (the pointer chasing from current main works bad for small lists, since the read and write are going to the same memory block)
    • list(1) repeat 1, list(1) repeat 2, list(1) repeat 10 should have been unchanged, but are faster

    In any case both #91482 and your #92286 seem like an improvement over current main.

    @MojoVampire Could you review #91482? And perhaps comment on the performance benchmarks. They consistently show an overall improvement, but also some fluctuations what might be system/compiler dependent.

    @thatbirdguythatuknownot
    Copy link
    Contributor

    thatbirdguythatuknownot commented May 4, 2022

    @thatbirdguythatuknownot That was indeed what I meant. That was with your branch #92286?

    @eendebakpt I actually used a copy of #91247 to test it out. Was I supposed to do that with #92286?

    @thatbirdguythatuknownot
    Copy link
    Contributor

    thatbirdguythatuknownot commented May 4, 2022

    Keeping the Py_SIZE(a) == 1 fast path does seem to give a 4% boost in performance last time I checked.

    @eendebakpt
    Copy link
    Contributor

    @thatbirdguythatuknownot #91247 is the issue, I guess you tested either # 91482 (specializations) or # 32045 (no specializations). The fact that both the PRs have gh-91247 in the name makes it a bit confusing.

    Note that also main has the fast path for Py_SIZE(a)==1 , so not sure what made it faster in your run. I suspect this is really system dependent.

    @thatbirdguythatuknownot
    Copy link
    Contributor

    thatbirdguythatuknownot commented May 5, 2022

    @thatbirdguythatuknownot #91247 is the issue, I guess you tested either # 91482 (specializations) or # 32045 (no specializations). The fact that both the PRs have gh-91247 in the name makes it a bit confusing.

    I tested #32045.

    sweeneyde pushed a commit that referenced this issue Jul 26, 2022
    * Add _Py_memory_repeat function to pycore_list
    
    * Add _Py_RefcntAdd function to pycore_object
    
    * Use the new functions in tuplerepeat, list_repeat, and list_inplace_repeat
    @sweeneyde
    Copy link
    Member

    Thanks, @eendebakpt!

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 bug and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants