Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceval.c: implement fast path for integers with a single digit #66154

Closed
vstinner opened this issue Jul 11, 2014 · 111 comments
Closed

ceval.c: implement fast path for integers with a single digit #66154

vstinner opened this issue Jul 11, 2014 · 111 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@vstinner
Copy link
Member

BPO 21955
Nosy @malemburg, @rhettinger, @mdickinson, @pitrou, @vstinner, @skrah, @serhiy-storchaka, @1st1, @MojoVampire
PRs
  • bpo-21955: Change my nickname in BINARY_ADD comment #22481
  • Files
  • 21955.patch
  • bench_long.py
  • inline.patch
  • 21955_2.patch
  • bench_results.txt
  • fastint1.patch
  • fastint2.patch
  • fastint_alt.patch
  • fastintfloat_alt.patch
  • fastint4.patch
  • fastint5.patch
  • bench_long2.py
  • compare.txt
  • compare_to.txt
  • fastint5_2.patch
  • fastint5_3.patch
  • fastint5_4.patch
  • inline-2.patch
  • fastint6.patch
  • mpmath_bench.py
  • fastint6_inline2_json.tar.gz
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/1st1'
    closed_at = <Date 2016-10-20.10:11:39.156>
    created_at = <Date 2014-07-11.09:10:27.900>
    labels = ['interpreter-core', 'performance']
    title = 'ceval.c: implement fast path for integers with a single digit'
    updated_at = <Date 2020-10-01.16:57:44.688>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-10-01.16:57:44.688>
    actor = 'vstinner'
    assignee = 'yselivanov'
    closed = True
    closed_date = <Date 2016-10-20.10:11:39.156>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2014-07-11.09:10:27.900>
    creator = 'vstinner'
    dependencies = []
    files = ['35961', '35964', '35965', '36021', '41795', '41796', '41799', '41801', '41807', '41811', '41815', '41821', '41822', '41823', '41829', '41830', '41831', '41832', '41843', '41882', '45150']
    hgrepos = []
    issue_num = 21955
    keywords = ['patch']
    message_count = 111.0
    messages = ['222731', '222804', '222824', '222829', '222830', '222985', '223162', '223177', '223180', '223186', '223214', '223623', '223711', '223726', '238437', '238455', '258057', '258060', '258062', '259417', '259428', '259429', '259431', '259490', '259491', '259493', '259494', '259495', '259496', '259497', '259499', '259500', '259502', '259503', '259505', '259506', '259508', '259509', '259530', '259540', '259541', '259542', '259545', '259549', '259552', '259554', '259560', '259562', '259563', '259564', '259565', '259567', '259568', '259571', '259573', '259574', '259577', '259578', '259601', '259605', '259607', '259612', '259614', '259626', '259663', '259664', '259666', '259667', '259668', '259669', '259670', '259671', '259672', '259673', '259675', '259678', '259695', '259702', '259706', '259707', '259712', '259713', '259714', '259722', '259729', '259730', '259733', '259734', '259743', '259790', '259791', '259792', '259793', '259800', '259801', '259804', '259832', '259859', '259860', '259918', '259919', '259948', '259999', '264018', '264019', '279021', '279022', '279023', '279026', '279027', '377777']
    nosy_count = 13.0
    nosy_names = ['lemburg', 'rhettinger', 'mark.dickinson', 'pitrou', 'vstinner', 'casevh', 'skrah', 'Yury.Selivanov', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'zbyrne']
    pr_nums = ['22481']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue21955'
    versions = ['Python 3.6']

    @vstinner
    Copy link
    Member Author

    Python 2 has fast path in ceval.c for operations (a+b, a-b, etc.) on small integers ("int" type) if the operation does not overflow.

    We loose these fast-path in Python 3 when we dropped the int type in favor of the long type.

    Antoine Pitrou proposed a fast-path, but only for int singletons (integers in the range [-5; 255]): issue bpo-10044. His patch was rejected because it introduces undefined behaviour.

    I propose to reimplemenet Python 2 optimization for long with a single digit, which are the most common numbers.

    Pseudo-code for BINARY_ADD:
    ---

    if (PyLong_CheckExact(x) && Py_ABS(Py_SIZE(x)) == 1
        && PyLong_CheckExact(y) && Py_ABS(Py_SIZE(y)) == 1)
    {
       stwodigits a = ..., b = ...;
       stwodigits c;
       if (... a+b will not overflow ...) { 
          c = a + b;
          return PyLong_FromLongLong(c);
       }
    }
    /* fall back to PyNumber_Add() */

    The code can be copied from longobject.c, there are already fast-path for single digit numbers. See for example long_mul():
    ---

        /* fast path for single-digit multiplication */
        if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
            ....
        }

    As any other optimization, it should be proved to be faster with benchmarks.

    @vstinner vstinner added the performance Performance or resource usage label Jul 11, 2014
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Jul 11, 2014

    On: if (... a+b will not overflow ...) {

    Since you limited the optimization for addition to single digit numbers, at least for addition and subtraction, overflow is impossible. The signed twodigit you use for the result is guaranteed to be able to store far larger numbers than addition of single digits can produce. In fact, due to the extra wasted bit on large (30 bit) digits, if you used a fixed width 32 bit type for addition/subtraction, and a fixed width 64 bit type for multiplication, overflow would be impossible regardless of whether you used 15 or 30 bit digits.

    On a related note: Presumably you should check if the abs(size) <= 1 like in longobject.c, not == 1, or you omit the fast path for 0. Doesn't come up much, not worth paying extra to optimize, but it costs nothing to handle it.

    @serhiy-storchaka
    Copy link
    Member

    Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.

    @vstinner
    Copy link
    Member Author

    Serhiy Storchaka added the comment:

    Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.

    I read that Mark rejected the issue bpo-10044 because it introduces an
    undefined behaviour.

    @vstinner
    Copy link
    Member Author

    I'm not interested to work on this issue right now. If anyone is
    interested, please go ahead!

    @rhettinger
    Copy link
    Contributor

    There also used to be a fast path for binary subscriptions with integer indexes. I would like to see that performance regression fixed if it can be done cleanly.

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Jul 16, 2014

    So I'm trying something pretty similar to Victor's pseudo-code and just using timeit to look for speedups
    timeit('x+x', 'x=10', number=10000000)
    before:
    1.1934231410000393
    1.1988609210002323
    1.1998214110003573
    1.206968028999654
    1.2065417159997196

    after:
    1.1698650090002047
    1.1705158909999227
    1.1752884750003432
    1.1744818619999933
    1.1741297110002051
    1.1760422649999782

    Small improvement. Haven't looked at optimizing BINARY_SUBSCR yet.

    @serhiy-storchaka
    Copy link
    Member

    Thank you Zach. I found even small regression.

    Before:

    $ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
    1000000 loops, best of 3: 1.51 usec per loop

    After:

    $ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
    1000000 loops, best of 3: 1.6 usec per loop

    @vstinner
    Copy link
    Member Author

    bench_long.py: micro-benchmark for x+y. I confirm a slow down with 21955.patch. IMO you should at least inline PyLong_AsLong() which can be simplified if the number has 0 or 1 digit. Here is my patch "inline.patch" which is 21955.patch with PyLong_AsLong() inlined.

    Benchmark result (patch=21955.patch, inline=inline.patch):

    Common platform:
    Platform: Linux-3.14.8-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug
    CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
    Python unicode implementation: PEP-393
    Timer: time.perf_counter

    Platform of campaign orig:
    Date: 2014-07-16 10:04:27
    Python version: 3.5.0a0 (default:08b3ee523577, Jul 16 2014, 10:04:23) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
    SCM: hg revision=08b3ee523577 tag=tip branch=default date="2014-07-15 13:23 +0300"
    Timer precision: 40 ns

    Platform of campaign patch:
    Timer precision: 40 ns
    Date: 2014-07-16 10:04:01
    Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:02:12) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
    SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

    Platform of campaign inline:
    Timer precision: 31 ns
    Date: 2014-07-16 10:11:21
    Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:10:48) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
    SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

    --------------------+-------------+---------------+---------------
    Tests               |        orig |         patch |         inline
    --------------------+-------------+---------------+---------------
    1+2                 |   23 ns () |         24 ns |   21 ns (-12%)
    "1+2" ran 100 times | 1.61 us (
    ) | 1.74 us (+7%) | 1.39 us (-14%)
    --------------------+-------------+---------------+---------------
    Total               | 1.64 us (*) | 1.76 us (+7%) | 1.41 us (-14%)
    --------------------+-------------+---------------+---------------

    (I removed my message because I posted the wrong benchmark output, inline column was missing.)

    @serhiy-storchaka
    Copy link
    Member

    Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255).

    Before:

    $ ./python -m timeit -s "x=10"  "x+x"
    10000000 loops, best of 3: 0.143 usec per loop
    $ ./python -m timeit -s "x=1000"  "x+x"
    1000000 loops, best of 3: 0.247 usec per loop

    After:

    $ ./python -m timeit -s "x=10"  "x+x"
    10000000 loops, best of 3: 0.117 usec per loop
    $ ./python -m timeit -s "x=1000"  "x+x"
    1000000 loops, best of 3: 0.209 usec per loop

    All measures are made with modified timeit (bpo-21988).

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Jul 16, 2014

    Well, dont' I feel silly. I confirmed both my regression and the inline speedup using the benchmark Victor added. I wonder if I got my binaries backwards in my first test...

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Jul 22, 2014

    I did something similar to BINARY_SUBSCR after looking at the 2.7 source as Raymond suggested. Hopefully I got my binaries straight this time :) The new patch includes Victor's inlining and my new subscript changes.

    Platform of campaign orig:
    Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 15 2014, 18:11:28) [GCC 4.6.3]
    Timer precision: 6 ns
    Date: 2014-07-21 20:28:30

    Platform of campaign patch:
    Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 21 2014, 20:21:20) [GCC 4.6.3]
    Timer precision: 20 ns
    Date: 2014-07-21 20:28:39

    ---------------------+-------------+---------------
    Tests                |        orig |          patch
    ---------------------+-------------+---------------
    1+2                  |  118 ns () |  103 ns (-13%)
    "1+2" ran 100 times  | 7.28 us (
    ) | 5.93 us (-19%)
    x[1]                 |  120 ns () |   98 ns (-19%)
    "x[1]" ran 100 times | 7.35 us (
    ) | 5.31 us (-28%)
    ---------------------+-------------+---------------
    Total                | 14.9 us (*) | 11.4 us (-23%)
    ---------------------+-------------+---------------

    @pitrou
    Copy link
    Member

    pitrou commented Jul 23, 2014

    Please run the actual benchmark suite to get interesting numbers: http://hg.python.org/benchmarks

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Jul 23, 2014

    I ran the whole benchmark suite. There are a few that are slower: call_method_slots, float, pickle_dict, and unpack_sequence.

    Report on Linux zach-vbox 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686
    Total CPU cores: 1

    ### 2to3 ###
    24.789549 -> 24.809551: 1.00x slower

    ### call_method_slots ###
    Min: 1.743554 -> 1.780807: 1.02x slower
    Avg: 1.751735 -> 1.792814: 1.02x slower
    Significant (t=-26.32)
    Stddev: 0.00576 -> 0.01823: 3.1660x larger

    ### call_method_unknown ###
    Min: 1.828094 -> 1.739625: 1.05x faster
    Avg: 1.852225 -> 1.806721: 1.03x faster
    Significant (t=2.28)
    Stddev: 0.01874 -> 0.24320: 12.9783x larger

    ### call_simple ###
    Min: 1.353581 -> 1.263386: 1.07x faster
    Avg: 1.397946 -> 1.302046: 1.07x faster
    Significant (t=24.28)
    Stddev: 0.03667 -> 0.03154: 1.1629x smaller

    ### chaos ###
    Min: 1.199377 -> 1.115550: 1.08x faster
    Avg: 1.230859 -> 1.146573: 1.07x faster
    Significant (t=16.24)
    Stddev: 0.02663 -> 0.02525: 1.0544x smaller

    ### django_v2 ###
    Min: 2.682884 -> 2.633110: 1.02x faster
    Avg: 2.747521 -> 2.690486: 1.02x faster
    Significant (t=9.90)
    Stddev: 0.02744 -> 0.03010: 1.0970x larger

    ### fastpickle ###
    Min: 1.751475 -> 1.597340: 1.10x faster
    Avg: 1.771805 -> 1.613533: 1.10x faster
    Significant (t=64.81)
    Stddev: 0.01177 -> 0.01263: 1.0727x larger

    ### float ###
    Min: 1.254858 -> 1.293067: 1.03x slower
    Avg: 1.336045 -> 1.365787: 1.02x slower
    Significant (t=-3.30)
    Stddev: 0.04851 -> 0.04135: 1.1730x smaller

    ### json_dump_v2 ###
    Min: 17.871819 -> 16.968647: 1.05x faster
    Avg: 18.428747 -> 17.483397: 1.05x faster
    Significant (t=4.10)
    Stddev: 1.60617 -> 0.27655: 5.8078x smaller

    ### mako ###
    Min: 0.241614 -> 0.231678: 1.04x faster
    Avg: 0.253730 -> 0.240585: 1.05x faster
    Significant (t=8.93)
    Stddev: 0.01912 -> 0.01327: 1.4417x smaller

    ### mako_v2 ###
    Min: 0.225664 -> 0.213179: 1.06x faster
    Avg: 0.234850 -> 0.225984: 1.04x faster
    Significant (t=10.12)
    Stddev: 0.01379 -> 0.01391: 1.0090x larger

    ### meteor_contest ###
    Min: 0.777612 -> 0.758924: 1.02x faster
    Avg: 0.799580 -> 0.780897: 1.02x faster
    Significant (t=3.97)
    Stddev: 0.02482 -> 0.02212: 1.1221x smaller

    ### nbody ###
    Min: 0.969724 -> 0.883935: 1.10x faster
    Avg: 0.996416 -> 0.918375: 1.08x faster
    Significant (t=12.65)
    Stddev: 0.02426 -> 0.03627: 1.4951x larger

    ### nqueens ###
    Min: 1.142745 -> 1.128195: 1.01x faster
    Avg: 1.296659 -> 1.162443: 1.12x faster
    Significant (t=2.75)
    Stddev: 0.34462 -> 0.02680: 12.8578x smaller

    ### pickle_dict ###
    Min: 1.433264 -> 1.467394: 1.02x slower
    Avg: 1.468122 -> 1.506908: 1.03x slower
    Significant (t=-7.20)
    Stddev: 0.02695 -> 0.02691: 1.0013x smaller

    ### raytrace ###
    Min: 5.454853 -> 5.538799: 1.02x slower
    Avg: 5.530943 -> 5.676983: 1.03x slower
    Significant (t=-8.64)
    Stddev: 0.05152 -> 0.10791: 2.0947x larger

    ### regex_effbot ###
    Min: 0.205875 -> 0.194776: 1.06x faster
    Avg: 0.211118 -> 0.198759: 1.06x faster
    Significant (t=5.10)
    Stddev: 0.01305 -> 0.01112: 1.1736x smaller

    ### regex_v8 ###
    Min: 0.141628 -> 0.133819: 1.06x faster
    Avg: 0.147024 -> 0.140053: 1.05x faster
    Significant (t=2.72)
    Stddev: 0.01163 -> 0.01388: 1.1933x larger

    ### richards ###
    Min: 0.734472 -> 0.727501: 1.01x faster
    Avg: 0.760795 -> 0.743484: 1.02x faster
    Significant (t=3.50)
    Stddev: 0.02778 -> 0.02127: 1.3061x smaller

    ### silent_logging ###
    Min: 0.344678 -> 0.336087: 1.03x faster
    Avg: 0.357982 -> 0.347361: 1.03x faster
    Significant (t=2.76)
    Stddev: 0.01992 -> 0.01852: 1.0755x smaller

    ### simple_logging ###
    Min: 1.104831 -> 1.072921: 1.03x faster
    Avg: 1.146844 -> 1.117068: 1.03x faster
    Significant (t=4.02)
    Stddev: 0.03552 -> 0.03848: 1.0833x larger

    ### spectral_norm ###
    Min: 1.710336 -> 1.688910: 1.01x faster
    Avg: 1.872578 -> 1.738698: 1.08x faster
    Significant (t=2.35)
    Stddev: 0.40095 -> 0.03331: 12.0356x smaller

    ### tornado_http ###
    Min: 0.849374 -> 0.852209: 1.00x slower
    Avg: 0.955472 -> 0.916075: 1.04x faster
    Significant (t=4.82)
    Stddev: 0.07059 -> 0.04119: 1.7139x smaller

    ### unpack_sequence ###
    Min: 0.000030 -> 0.000020: 1.52x faster
    Avg: 0.000164 -> 0.000174: 1.06x slower
    Significant (t=-13.11)
    Stddev: 0.00011 -> 0.00013: 1.2256x larger

    ### unpickle_list ###
    Min: 1.333952 -> 1.212805: 1.10x faster
    Avg: 1.373228 -> 1.266677: 1.08x faster
    Significant (t=16.32)
    Stddev: 0.02894 -> 0.03597: 1.2428x larger

    @vstinner
    Copy link
    Member Author

    What's the status of this issue?

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Mar 18, 2015

    I haven't looked at it since I posted the benchmark results for 21955_2.patch.

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Jan 12, 2016

    Anybody still looking at this? I can take another stab at it if it's still in scope.

    @1st1
    Copy link
    Member

    1st1 commented Jan 12, 2016

    Anybody still looking at this? I can take another stab at it if it's still in scope.

    There were some visible speedups from your patch -- I think we should merge this optimization. Can you figure why unpack_sequence and other benchmarks were slower?

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Jan 12, 2016

    Can you figure why unpack_sequence and other benchmarks were slower?
    I didn't look really closely, A few of the slower ones were floating point heavy, which would incur the slow path penalty, but I can dig into unpack_sequence.

    @1st1
    Copy link
    Member

    1st1 commented Feb 2, 2016

    I'm assigning this patch to myself to commit it in 3.6 later.

    @1st1 1st1 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 2, 2016
    @1st1 1st1 self-assigned this Feb 2, 2016
    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Feb 2, 2016

    I took another look at this, and tried applying it to 3.6 and running the latest benchmarks. It applied cleanly, and the benchmark results were similar, this time unpack_sequence and spectral_norm were slower. Spectral norm makes sense, it's doing lots of FP addition. The unpack_sequence instruction looks like it already has optimizations for unpacking lists and tuples onto the stack, and running dis on the test showed that it's completely dominated calls to unpack_sequence, load_fast, and store_fast so I still don't know what's going on there.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 2, 2016

    Any change that increases the cache or branch predictor footprint of the evaluation loop may make the interpreter slower, even if the change doesn't seem related to a particular benchmark. That may be the reason here.

    @1st1
    Copy link
    Member

    1st1 commented Feb 2, 2016

    unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = to_unpack". This code doesn't even touch BINARY_SUBSCR or BINARY_ADD.

    Zach, could you please run your benchmarks in rigorous mode (perf.py -r)? I'd also suggest to experiment with putting the baseline cpython as a first arg and as a second -- maybe your machine runs the second interpreter slightly faster.

    @zbyrne
    Copy link
    Mannequin

    zbyrne mannequin commented Feb 3, 2016

    I ran 6 benchmarks on my work machine(not the same one as the last set) overnight.
    Two with just the BINARY_ADD change, two with the BINARY_SUBSCR change, and two with both.
    I'm attaching the output from all my benchmark runs, but here are the highlights
    In this table I've flipped the results for running the modified build as the reference, but in the new attachment, slower in the right column means faster, I think :)
    |------------------|---------------------------------------|-----------------------------------|
    |Build | Baseline Reference | Modified Reference |
    |------------------|--------------------|------------------|--------------------|--------------|
    | | Faster | Slower | Faster | Slower |
    |------------------|--------------------|------------------|--------------------|--------------|
    |BINARY_ADD | chameleon_v2 | etree_parse | chameleon_v2 | call_simple |
    | | chaos | nbody | fannkuch | nbody |
    | | django | normal_startup | normal_startup | pickle_dict |
    | | etree_generate | pickle_dict | nqueens | regex_v8 |
    | | fannkuch | pickle_list | regex_compile | |
    | | formatted_logging | regex_effbot | spectral_norm | |
    | | go | | unpickle_list | |
    | | json_load | | | |
    | | regex_compile | | | |
    | | simple_logging | | | |
    | | spectral_norm | | | |
    |------------------|--------------------|------------------|--------------------|--------------|
    |BINARY_SUBSCR | chameleon_v2 | call_simple | 2to3 | etree_parse |
    | | chaos | go | call_method_slots | json_dump_v2 |
    | | etree_generate | pickle_list | chaos | pickle_dict |
    | | fannkuch | telco | fannkuch | |
    | | fastpickle | | formatted_logging | |
    | | hexiom2 | | go | |
    | | json_load | | hexiom2 | |
    | | mako_v2 | | mako_v2 | |
    | | meteor_contest | | meteor_contest | |
    | | nbody | | nbody | |
    | | regex_v8 | | normal_startup | |
    | | spectral_norm | | nqueens | |
    | | | | pickle_list | |
    | | | | simple_logging | |
    | | | | spectral_norm | |
    | | | | telco | |
    |------------------|--------------------|------------------|--------------------|--------------|
    |BOTH | chameleon_v2 | call_simple | chameleon_v2 | fastpickle |
    | | chaos | etree_parse | choas | pickle_dict |
    | | etree_generate | pathlib | etree_generate | pickle_list |
    | | etree_process | pickle_list | etree_process | telco |
    | | fannkuch | | fannkuch | |
    | | fastunpickle | | float | |
    | | float | | formatted_logging | |
    | | formatted_logging | | go | |
    | | hexiom2 | | hexiom2 | |
    | | nbody | | nbody | |
    | | nqueens | | normal_startup | |
    | | regex_v8 | | nqueens | |
    | | spectral_norm | | simple_logging | |
    | | unpickle_list | | spectral_norm | |
    |------------------|--------------------|------------------|--------------------|--------------|

    unpack_sequence is nowhere to be seen and spectral_norm is faster now...

    @1st1
    Copy link
    Member

    1st1 commented Feb 3, 2016

    Attaching a new patch -- rewritten to optimize -, *, +, -=, *= and +=. I also removed the optimization of [] operator -- that should be done in a separate patch and in a separate issue.

    Some nano-benchmarks (best of 3):

    python -m timeit "sum([x + x + 1 for x in range(100)])"
    2.7: 7.71 3.5: 8.54 3.6: 7.33

    python -m timeit "sum([x - x - 1 for x in range(100)])"
    2.7: 7.81 3.5: 8.59 3.6: 7.57

    python -m timeit "sum([x * x * 1 for x in range(100)])"
    2.7: 9.28 3.5: 10.6 3.6: 9.44

    Python 3.6 vs 3.5 (spectral_norm, rigorous run):
    Min: 0.315917 -> 0.276785: 1.14x faster
    Avg: 0.321006 -> 0.284909: 1.13x faster

    Zach, thanks a lot for the research! I'm glad that unpack_sequence finally proved to be irrelevant. Could you please take a look at the updated patch?

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 3, 2016

    python -m timeit "sum([x * x * 1 for x in range(100)])"

    If you only want to benchmark x*y, x+y and list-comprehension, you
    should use a tuple for the iterator.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 3, 2016

    In this table I've flipped the results for running the modified build > as the reference, but in the new attachment, slower in the right
    column means faster, I think :)

    I don't understand what this table means (why 4 columns?). Can you explain what you did?

    @serhiy-storchaka
    Copy link
    Member

    I see two main trends: optimize most cases (optimize most operators for int and float, ex: fastint5_4.patch) versus optimize very few cases to limit changes and to limit effects on ceval.c (ex: inline-2.patch).

    I agree that may be optimizing very few cases is better. We need to collect the statistics of using different operations with different types in long run of tests or benchmarks. If say division is used 100 times less than addition, we shouldn't complicate ceval loop to optimize it.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 6, 2016

    Benchmark on inline-2.patch. No speedup, only slowdown.

    I'm now running benchmark on fastint5_4.patch.

    $ python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python.inline-2

    Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64
    Total CPU cores: 8

    ### json_load ###
    Min: 0.707290 -> 0.723411: 1.02x slower
    Avg: 0.707845 -> 0.724238: 1.02x slower
    Significant (t=-297.25)
    Stddev: 0.00026 -> 0.00049: 1.8696x larger

    ### regex_v8 ###
    Min: 0.066663 -> 0.070435: 1.06x slower
    Avg: 0.066947 -> 0.071378: 1.07x slower
    Significant (t=-17.98)
    Stddev: 0.00172 -> 0.00177: 1.0313x larger

    The following not significant results are hidden, use -v to show them:
    2to3, chameleon_v2, django_v3, fastpickle, fastunpickle, json_dump_v2, nbody, tornado_http.

    real 58m32.662s
    user 57m43.058s
    sys 0m47.428s

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 6, 2016

    Benchmark on fastint5_4.patch.

    python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python_fastint5_4

    Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64
    Total CPU cores: 8

    ### django_v3 ###
    Min: 0.563959 -> 0.578181: 1.03x slower
    Avg: 0.565383 -> 0.579137: 1.02x slower
    Significant (t=-152.48)
    Stddev: 0.00075 -> 0.00050: 1.4900x smaller

    ### fastunpickle ###
    Min: 0.551076 -> 0.563469: 1.02x slower
    Avg: 0.555481 -> 0.567028: 1.02x slower
    Significant (t=-27.05)
    Stddev: 0.00278 -> 0.00324: 1.1687x larger

    ### json_dump_v2 ###
    Min: 2.737429 -> 2.662615: 1.03x faster
    Avg: 2.754239 -> 2.685404: 1.03x faster
    Significant (t=55.63)
    Stddev: 0.00610 -> 0.01077: 1.7662x larger

    ### nbody ###
    Min: 0.228548 -> 0.212292: 1.08x faster
    Avg: 0.230082 -> 0.213574: 1.08x faster
    Significant (t=73.74)
    Stddev: 0.00175 -> 0.00139: 1.2567x smaller

    ### regex_v8 ###
    Min: 0.041323 -> 0.048099: 1.16x slower
    Avg: 0.041624 -> 0.049318: 1.18x slower
    Significant (t=-45.38)
    Stddev: 0.00123 -> 0.00116: 1.0613x smaller

    The following not significant results are hidden, use -v to show them:
    2to3, chameleon_v2, fastpickle, json_load, tornado_http.

    @1st1
    Copy link
    Member

    1st1 commented Feb 6, 2016

    regex_v8

    Min: 0.041323 -> 0.048099: 1.16x slower
    Avg: 0.041624 -> 0.049318: 1.18x slower

    I think this is a random fluctuation, that benchmark (and re lib) doesn't use the operators too much. It can't be THAT slower just because of optimizing a few binops.

    @1st1
    Copy link
    Member

    1st1 commented Feb 6, 2016

    You're also running a very small subset of all benchmarks available. Please try the '-b all' option. I'll also run benchmarks on my machines.

    @1st1
    Copy link
    Member

    1st1 commented Feb 6, 2016

    Alright, I ran a few benchmarks myself. In rigorous mode regex_v8 has the same performance on my 2013 Macbook Pro and an 8-years old i7 CPU (Linux).

    Here're results of "perf.py -b raytrace,spectral_norm,meteor_contest,nbody ../cpython/python.exe ../cpython-git/python.exe -r"

    fastint5:

    ### nbody ###
    Min: 0.227683 -> 0.197046: 1.16x faster
    Avg: 0.229366 -> 0.198889: 1.15x faster
    Significant (t=137.31)
    Stddev: 0.00170 -> 0.00142: 1.1977x smaller

    ### spectral_norm ###
    Min: 0.296840 -> 0.262279: 1.13x faster
    Avg: 0.299616 -> 0.265387: 1.13x faster
    Significant (t=74.52)
    Stddev: 0.00331 -> 0.00319: 1.0382x smaller

    The following not significant results are hidden, use -v to show them:
    meteor_contest, raytrace.

    ======

    inline-2:

    ### raytrace ###
    Min: 1.188825 -> 1.213788: 1.02x slower
    Avg: 1.199827 -> 1.227276: 1.02x slower
    Significant (t=-18.12)
    Stddev: 0.00559 -> 0.01408: 2.5184x larger

    ### spectral_norm ###
    Min: 0.296535 -> 0.277025: 1.07x faster
    Avg: 0.299044 -> 0.278071: 1.08x faster
    Significant (t=87.40)
    Stddev: 0.00220 -> 0.00097: 2.2684x smaller

    The following not significant results are hidden, use -v to show them:
    meteor_contest, nbody.

    @1st1
    Copy link
    Member

    1st1 commented Feb 7, 2016

    From what I can see there is no negative impact of the patch on stable macro benchmarks.

    There is quite a detectable positive impact on most of integer and float operations from my patch. 13-16% on nbody and spectral_norm benchmarks is still impressive. And you can see a huge improvement in various timeit micro-benchmarks.

    fastint5 is a very compact patch, that only touches the ceval.c file. It doesn't complicate the code, as the macro is very straightforward. Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it.

    @serhiy-storchaka
    Copy link
    Member

    Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics.

    @1st1
    Copy link
    Member

    1st1 commented Feb 7, 2016

    Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics.

    But what you will use to gather statistics data? Test suite isn't representative, and we already know what will benchmarks suite show. I can assist with writing some code for stats, but what's the plan?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 7, 2016

    bpo-26288 brought a great speedup for floats. With fastint5_4.patch *on top of bpo-26288* I see no improvement for floats and a big slowdown for _decimal.

    @casevh
    Copy link
    Mannequin

    casevh mannequin commented Feb 7, 2016

    Can I suggest the mpmath test suite as a good benchmark? I've used it to test the various optimizations in gmpy2 and it has always been a valuable real-world benchmark. And it is slower in Python 3 than Python 2....

    @pitrou
    Copy link
    Member

    pitrou commented Feb 7, 2016

    Be careful with test suites: first, they might exercise code that would never be a critical point for performance in a real-world application; second and most important, unittest seems to have gotten slower between 2.x and 3.x, so you would really be comparing apples to oranges.

    @1st1
    Copy link
    Member

    1st1 commented Feb 7, 2016

    Attaching another patch - fastint6.patch that only optimizes longs (no FP fast path).

    bpo-26288 brought a great speedup for floats. With fastint5_4.patch *on top of bpo-26288* I see no improvement for floats and a big slowdown for _decimal.

    What benchmark did you use? What were the numbers? I'm asking because before you benchmarked different patches that are conceptually similar to fastint5, and the result was that decimal was 5% faster with fast paths for just longs, and 6% slower with fast paths for longs & floats.

    Also, some quick timeit results (quite stable from run to run):

    -m timeit -s "x=2" "x + 10 + x * 20 + x* 10 + 20 -x"
    3.6: 0.150usec 3.6+fastint: 0.112usec

    -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)"
    3.6: 0.425usec 3.6+fastint: 0.302usec

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 8, 2016

    Yury Selivanov:

    Alright, I ran a few benchmarks myself. (...)
    From what I can see there is no negative impact of the patch on stable macro benchmarks.

    I'm disappointed by the results. In short, these patches have *no* impact on macro benchmarks, other than the two which are stressing the int and float types. Maybe we are just loosing our time on this issue...

    I understand that the patches are only useful to get xx% speedup (where xx% is smaller than 25%) if your whole application is blocked by numeric computations. If it's the case, I would suggest to move to PyPy, Numba, Cython, etc. I expect something more interesting than xx% faster, but a much more impressive speedup.

    http://speed.pypy.org/ : PyPy/CPython 2.7 for spectral_norm is 0.04: 25x faster. For nbody (nbody_modified), it's 0.09: 11x faster.

    With patches of this issue, the *best* speedup is only 1.16x faster... We are *very* far from 11x or 25x faster. It's not even 2x faster...

    Yury Selivanov:

    fastint5 is a very compact patch, that only touches the ceval.c file. It doesn't complicate the code, as the macro is very straightforward. Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it.

    According to my micro-benchmark msg259706, inline-2.patch is faster than fastint5_4.patch. I would suggest to "finish" the inline-2.patch to optimize other operations, and *maybe* add fast-path for float.

    On macrobenchmark, inline-2.patch is slower than fastint5_4.patch, but it was easy to expect since I only added fast-path for int-int and only for a few operators.

    The question is it is worth to get xx% speedup on one or two specific benchmarks where CPython really sucks compared to other languages and other implementations of Python...

    Stefan Krah:

    With fastint5_4.patch *on top of bpo-26288* I see no improvement for floats and a big slowdown for _decimal.

    How do you run your benchmark?

    Case Van Horsen:

    Can I suggest the mpmath test suite as a good benchmark?

    Where can we find this benchmark?

    Case Van Horsen:

    it has always been a valuable real-world benchmark

    What do you mean by "real-world benchmark"? :-)

    @casevh
    Copy link
    Mannequin

    casevh mannequin commented Feb 8, 2016

    mpmath is a library for arbitrary-precision floating-point arithmetic. It uses Python's native long type or gmpy2's mpz type for computations. It is available at https://pypi.python.org/pypi/mpmath.

    The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation:

    ~/src/mpmath-0.19/mpmath/tests$ time py36 runtests.py -local

    For example, I've tried to optimize gmpy2's handling of binary operations between its mpz type and short Python integers. I've found it to provide useful results: improvements that are significant on a micro-benchmark (say 20%) will usually cause a small but repeatable improvement. And some improvements that looked good on a micro-benchmark would slow down mpmath.

    I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%.

    I will run more tests tonight.

    @1st1
    Copy link
    Member

    1st1 commented Feb 8, 2016

    I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%.

    I will run more tests tonight.

    Please try to test fastint5 too (fast paths for long & floats, whereas fastint6 is only focused on longs).

    @casevh
    Copy link
    Mannequin

    casevh mannequin commented Feb 9, 2016

    I ran the mpmath test suite with the fastint6 and fastint5_4 patches.

    fastint6 results

    without gmpy: 0.25% faster
    with gmpy: 3% slower

    fastint5_4 results

    without gmpy: 1.5% slower
    with gmpy: 5.5% slower

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 9, 2016

    Case Van Horsen added the comment:

    I ran the mpmath test suite with the fastint6 and fastint5_4 patches.

    fastint6 results
    without gmpy: 0.25% faster
    with gmpy: 3% slower

    fastint5_4 results
    without gmpy: 1.5% slower
    with gmpy: 5.5% slower

    I'm more and more disappointed by this issue... If even a test
    stressing int & float is *slower* (or less than 1% faster) with a
    patch supposed to optimized them, what's the point? I'm also concerned
    by the slow-down for other types (gmpy types).

    Maybe we should just close the issue?

    @1st1
    Copy link
    Member

    1st1 commented Feb 9, 2016

    Maybe we should just close the issue?

    I'll take a closer look at gmpy later. Please don't close.

    @vstinner
    Copy link
    Member Author

    The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation:

    I extracted the slowest test (test_polyroots_legendre) and put it in a loop of 5 iterations: see attached mpmath_bench.py. I ran this benchmark on Linux with 4 isolated CPU (/sys/devices/system/cpu/isolated=2-3,6-7).
    http://haypo-notes.readthedocs.org/misc.html#reliable-micro-benchmarks

    On such setup, the benchmark looks stable. Example:

    Run #1/5: 12.28 sec
    Run #2/5: 12.27 sec
    Run #3/5: 12.29 sec
    Run #4/5: 12.28 sec
    Run #5/5: 12.30 sec

    test_polyroots_legendre (min of 5 runs):

    • Original: 12.51 sec
    • fastint5_4.patch: (min of 5 runs): 12.27 sec (-1.9%)
    • fastint6.patch: 12.21 sec (-2.4%)

    I ran tests without GMP, to stress the Python int type.

    I guess that the benchmark is dominated by CPU time spent on computing operations on large Python int, not by the time spent in ceval.c. So the speedup is low (2%). Such use case doesn't seem to benefit of micro optimization discussed in this issue.

    mpmath is an arbitrary-precision floating-point arithmetic using Python int (or GMP if available).

    @vstinner
    Copy link
    Member Author

    Maybe we should adopt a difference approach.

    There is something called "inline caching": put the cache between instructions, in the same memory block. Example of paper on CPython:

    "Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler (2009)
    https://www.sba-research.org/wp-content/uploads/publications/sac10.pdf

    Maybe we can build something on top of the issue bpo-26219 "implement per-opcode cache in ceval"?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 22, 2016

    bpo-14757 has an implementation of inline caching, which at least seemed to slow down some use cases. Then again, whenever someone posts a new speedup suggestion, it seems to slow down things I'm working on. At least Case van Horsen independently verified the phenomenon in this issue. :)

    @vstinner
    Copy link
    Member Author

    Between inline2.patch and fastint6.patch, it seems like inline2.patch is faster (between 9% and 12% faster than fastint6.patch).

    Microbenchmark on Python default (rev 554fb699af8c), compilation using LTO (./configure --with-lto), GCC 6.2.1 on Fedora 24, Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, perf 0.8.3 (dev version, just after 0.8.2).

    Commands:

    ./python -m perf timeit --name='x+y' -s 'x=1; y=2' 'x+y' --dup 1000 -v -o timeit-$branch.json
    ./python -m perf timeit --name=sum -s "R=range(100)" "[x + x + 1 for x in R]" --dup 1000 -v --append timeit-$branch.json

    Results:

    $ python3 -m perf compare_to timeit-master.json timeit-inline2.json
    sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.14x faster
    x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.29x faster
    
    $ python3 -m perf compare_to timeit-master.json timeit-fastint6.json 
    sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-fastint6] 6.09 us +- 0.11 us: 1.02x faster
    x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-fastint6] 12.7 ns +- 0.2 ns: 1.18x faster
    
    $ python3 -m perf compare_to timeit-fastint6.json  timeit-inline2.json
    sum: Median +- std dev: [timeit-fastint6] 6.09 us +- 0.11 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.12x faster
    x+y: Median +- std dev: [timeit-fastint6] 12.7 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.09x faster

    @vstinner
    Copy link
    Member Author

    Result of performance 0.3.3 (and perf 0.8.3).

    No major benchmark is faster. A few benchmarks seem to be event slower using fastint6.patch (but I don't really trust pybench).

    == fastint6.patch ==

    $ python3 -m perf compare_to master.json fastint6.json --group-by-speed --min-speed=5
    Slower (3):
    - pybench.ConcatUnicode: 52.7 ns +- 0.0 ns -> 56.1 ns +- 0.4 ns: 1.06x slower
    - pybench.ConcatStrings: 52.7 ns +- 0.3 ns -> 56.1 ns +- 0.1 ns: 1.06x slower
    - pybench.CompareInternedStrings: 16.5 ns +- 0.0 ns -> 17.4 ns +- 0.0 ns: 1.05x slower

    Faster (4):

    • pybench.SimpleIntFloatArithmetic: 441 ns +- 2 ns -> 400 ns +- 6 ns: 1.10x faster
    • pybench.SimpleIntegerArithmetic: 441 ns +- 2 ns -> 401 ns +- 5 ns: 1.10x faster
    • pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 608 ns +- 6 ns: 1.06x faster
    • genshi_text: 79.6 ms +- 0.5 ms -> 75.5 ms +- 0.8 ms: 1.05x faster

    Benchmark hidden because not significant (114): 2to3, call_method, (...)

    == inline2.patch ==

    haypo@selma$ python3 -m perf compare_to master.json inline2.json --group-by-speed --min-speed=5
    Faster (2):

    • spectral_norm: 223 ms +- 1 ms -> 209 ms +- 1 ms: 1.07x faster
    • pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 606 ns +- 7 ns: 1.06x faster

    Benchmark hidden because not significant (119): 2to3, call_method, (...)

    @vstinner
    Copy link
    Member Author

    fastint6_inline2_json.tar.gz: archive of JSON files

    • fastint6.json
    • inline2.json
    • master.json
    • timeit-fastint6.json
    • timeit-inline2.json
    • timeit-master.json

    @vstinner
    Copy link
    Member Author

    The fatest patch (inline2.patch) has a negligible impact on benchmarks. The purpose of an optimization is to make Python faster, it's not the case here, so I close the issue.

    Using timeit, the largest speedup is 1.29x faster. Using performance, spectral_norm is 1.07x faster and pybench.SimpleLongArithmetic is 1.06x faster. I consider that spectral_norm and pybench.SimpleLongArithmetic are microbenchmarks and so not representative of a real application.

    The issue was fun, thank you for playing with me the game of micro-optimization ;-) Let's move to more interesting optimizations having a larger impact on more realistic workloads, like cache global variables, optimizing method calls, fastcalls, etc.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 20, 2016

    New changeset 61fcb12a9873 by Victor Stinner in branch 'default':
    Issue bpo-21955: Please don't try to optimize int+int
    https://hg.python.org/cpython/rev/61fcb12a9873

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 1, 2020

    New changeset bd0a08e by Victor Stinner in branch 'master':
    bpo-21955: Change my nickname in BINARY_ADD comment (GH-22481)
    bd0a08e

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants