ceval.c: implement fast path for integers with a single digit #66154

vstinner · 2014-07-11T09:10:28Z

BPO	21955
Nosy	@malemburg, @rhettinger, @mdickinson, @pitrou, @vstinner, @skrah, @serhiy-storchaka, @1st1, @MojoVampire
PRs	bpo-21955: Change my nickname in BINARY_ADD comment #22481
Files	21955.patch bench_long.py inline.patch 21955_2.patch bench_results.txt fastint1.patch fastint2.patch fastint_alt.patch fastintfloat_alt.patch fastint4.patch fastint5.patch bench_long2.py compare.txt compare_to.txt fastint5_2.patch fastint5_3.patch fastint5_4.patch inline-2.patch fastint6.patch mpmath_bench.py fastint6_inline2_json.tar.gz

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/1st1'
closed_at = <Date 2016-10-20.10:11:39.156>
created_at = <Date 2014-07-11.09:10:27.900>
labels = ['interpreter-core', 'performance']
title = 'ceval.c: implement fast path for integers with a single digit'
updated_at = <Date 2020-10-01.16:57:44.688>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2020-10-01.16:57:44.688>
actor = 'vstinner'
assignee = 'yselivanov'
closed = True
closed_date = <Date 2016-10-20.10:11:39.156>
closer = 'vstinner'
components = ['Interpreter Core']
creation = <Date 2014-07-11.09:10:27.900>
creator = 'vstinner'
dependencies = []
files = ['35961', '35964', '35965', '36021', '41795', '41796', '41799', '41801', '41807', '41811', '41815', '41821', '41822', '41823', '41829', '41830', '41831', '41832', '41843', '41882', '45150']
hgrepos = []
issue_num = 21955
keywords = ['patch']
message_count = 111.0
messages = ['222731', '222804', '222824', '222829', '222830', '222985', '223162', '223177', '223180', '223186', '223214', '223623', '223711', '223726', '238437', '238455', '258057', '258060', '258062', '259417', '259428', '259429', '259431', '259490', '259491', '259493', '259494', '259495', '259496', '259497', '259499', '259500', '259502', '259503', '259505', '259506', '259508', '259509', '259530', '259540', '259541', '259542', '259545', '259549', '259552', '259554', '259560', '259562', '259563', '259564', '259565', '259567', '259568', '259571', '259573', '259574', '259577', '259578', '259601', '259605', '259607', '259612', '259614', '259626', '259663', '259664', '259666', '259667', '259668', '259669', '259670', '259671', '259672', '259673', '259675', '259678', '259695', '259702', '259706', '259707', '259712', '259713', '259714', '259722', '259729', '259730', '259733', '259734', '259743', '259790', '259791', '259792', '259793', '259800', '259801', '259804', '259832', '259859', '259860', '259918', '259919', '259948', '259999', '264018', '264019', '279021', '279022', '279023', '279026', '279027', '377777']
nosy_count = 13.0
nosy_names = ['lemburg', 'rhettinger', 'mark.dickinson', 'pitrou', 'vstinner', 'casevh', 'skrah', 'Yury.Selivanov', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'zbyrne']
pr_nums = ['22481']
priority = 'normal'
resolution = 'rejected'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue21955'
versions = ['Python 3.6']

vstinner · 2014-07-11T09:10:27Z

Python 2 has fast path in ceval.c for operations (a+b, a-b, etc.) on small integers ("int" type) if the operation does not overflow.

We loose these fast-path in Python 3 when we dropped the int type in favor of the long type.

Antoine Pitrou proposed a fast-path, but only for int singletons (integers in the range [-5; 255]): issue bpo-10044. His patch was rejected because it introduces undefined behaviour.

I propose to reimplemenet Python 2 optimization for long with a single digit, which are the most common numbers.

Pseudo-code for BINARY_ADD:
---

if (PyLong_CheckExact(x) && Py_ABS(Py_SIZE(x)) == 1
    && PyLong_CheckExact(y) && Py_ABS(Py_SIZE(y)) == 1)
{
   stwodigits a = ..., b = ...;
   stwodigits c;
   if (... a+b will not overflow ...) { 
      c = a + b;
      return PyLong_FromLongLong(c);
   }
}
/* fall back to PyNumber_Add() */

The code can be copied from longobject.c, there are already fast-path for single digit numbers. See for example long_mul():
---

    /* fast path for single-digit multiplication */
    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        ....
    }

As any other optimization, it should be proved to be faster with benchmarks.

MojoVampire · 2014-07-11T22:23:36Z

On: if (... a+b will not overflow ...) {

Since you limited the optimization for addition to single digit numbers, at least for addition and subtraction, overflow is impossible. The signed twodigit you use for the result is guaranteed to be able to store far larger numbers than addition of single digits can produce. In fact, due to the extra wasted bit on large (30 bit) digits, if you used a fixed width 32 bit type for addition/subtraction, and a fixed width 64 bit type for multiplication, overflow would be impossible regardless of whether you used 15 or 30 bit digits.

On a related note: Presumably you should check if the abs(size) <= 1 like in longobject.c, not == 1, or you omit the fast path for 0. Doesn't come up much, not worth paying extra to optimize, but it costs nothing to handle it.

serhiy-storchaka · 2014-07-12T07:19:29Z

Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.

vstinner · 2014-07-12T09:01:30Z

Serhiy Storchaka added the comment:

Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.

I read that Mark rejected the issue bpo-10044 because it introduces an
undefined behaviour.

vstinner · 2014-07-12T09:01:56Z

I'm not interested to work on this issue right now. If anyone is
interested, please go ahead!

rhettinger · 2014-07-14T00:42:07Z

There also used to be a fast path for binary subscriptions with integer indexes. I would like to see that performance regression fixed if it can be done cleanly.

zbyrne · 2014-07-16T00:29:26Z

So I'm trying something pretty similar to Victor's pseudo-code and just using timeit to look for speedups
timeit('x+x', 'x=10', number=10000000)
before:
1.1934231410000393
1.1988609210002323
1.1998214110003573
1.206968028999654
1.2065417159997196

after:
1.1698650090002047
1.1705158909999227
1.1752884750003432
1.1744818619999933
1.1741297110002051
1.1760422649999782

Small improvement. Haven't looked at optimizing BINARY_SUBSCR yet.

serhiy-storchaka · 2014-07-16T06:23:26Z

Thank you Zach. I found even small regression.

Before:

$ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.51 usec per loop

After:

$ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.6 usec per loop

vstinner · 2014-07-16T08:14:30Z

bench_long.py: micro-benchmark for x+y. I confirm a slow down with 21955.patch. IMO you should at least inline PyLong_AsLong() which can be simplified if the number has 0 or 1 digit. Here is my patch "inline.patch" which is 21955.patch with PyLong_AsLong() inlined.

Benchmark result (patch=21955.patch, inline=inline.patch):

Common platform:
Platform: Linux-3.14.8-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Bits: int=32, long=64, long long=64, size_t=64, void*=64
CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Python unicode implementation: PEP-393
Timer: time.perf_counter

Platform of campaign orig:
Date: 2014-07-16 10:04:27
Python version: 3.5.0a0 (default:08b3ee523577, Jul 16 2014, 10:04:23) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
SCM: hg revision=08b3ee523577 tag=tip branch=default date="2014-07-15 13:23 +0300"
Timer precision: 40 ns

Platform of campaign patch:
Timer precision: 40 ns
Date: 2014-07-16 10:04:01
Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:02:12) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

Platform of campaign inline:
Timer precision: 31 ns
Date: 2014-07-16 10:11:21
Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:10:48) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

--------------------+-------------+---------------+---------------
Tests               |        orig |         patch |         inline
--------------------+-------------+---------------+---------------
1+2                 |   23 ns () |         24 ns |   21 ns (-12%)
"1+2" ran 100 times | 1.61 us () | 1.74 us (+7%) | 1.39 us (-14%)
--------------------+-------------+---------------+---------------
Total               | 1.64 us (*) | 1.76 us (+7%) | 1.41 us (-14%)
--------------------+-------------+---------------+---------------

(I removed my message because I posted the wrong benchmark output, inline column was missing.)

serhiy-storchaka · 2014-07-16T09:28:44Z

Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255).

Before:

$ ./python -m timeit -s "x=10"  "x+x"
10000000 loops, best of 3: 0.143 usec per loop
$ ./python -m timeit -s "x=1000"  "x+x"
1000000 loops, best of 3: 0.247 usec per loop

After:

$ ./python -m timeit -s "x=10"  "x+x"
10000000 loops, best of 3: 0.117 usec per loop
$ ./python -m timeit -s "x=1000"  "x+x"
1000000 loops, best of 3: 0.209 usec per loop

All measures are made with modified timeit (bpo-21988).

zbyrne · 2014-07-16T14:40:57Z

Well, dont' I feel silly. I confirmed both my regression and the inline speedup using the benchmark Victor added. I wonder if I got my binaries backwards in my first test...

zbyrne · 2014-07-22T02:34:35Z

I did something similar to BINARY_SUBSCR after looking at the 2.7 source as Raymond suggested. Hopefully I got my binaries straight this time :) The new patch includes Victor's inlining and my new subscript changes.

Platform of campaign orig:
Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 15 2014, 18:11:28) [GCC 4.6.3]
Timer precision: 6 ns
Date: 2014-07-21 20:28:30

Platform of campaign patch:
Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 21 2014, 20:21:20) [GCC 4.6.3]
Timer precision: 20 ns
Date: 2014-07-21 20:28:39

---------------------+-------------+---------------
Tests                |        orig |          patch
---------------------+-------------+---------------
1+2                  | 118 ns () | 103 ns (-13%)
"1+2" ran 100 times | 7.28 us () | 5.93 us (-19%)
x[1]                 | 120 ns () |   98 ns (-19%)
"x[1]" ran 100 times | 7.35 us () | 5.31 us (-28%)
---------------------+-------------+---------------
Total                | 14.9 us (*) | 11.4 us (-23%)
---------------------+-------------+---------------

pitrou · 2014-07-23T01:20:41Z

Please run the actual benchmark suite to get interesting numbers: http://hg.python.org/benchmarks

zbyrne · 2014-07-23T06:00:17Z

I ran the whole benchmark suite. There are a few that are slower: call_method_slots, float, pickle_dict, and unpack_sequence.

Report on Linux zach-vbox 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686
Total CPU cores: 1

### 2to3 ###
24.789549 -> 24.809551: 1.00x slower

### call_method_slots ###
Min: 1.743554 -> 1.780807: 1.02x slower
Avg: 1.751735 -> 1.792814: 1.02x slower
Significant (t=-26.32)
Stddev: 0.00576 -> 0.01823: 3.1660x larger

### call_method_unknown ###
Min: 1.828094 -> 1.739625: 1.05x faster
Avg: 1.852225 -> 1.806721: 1.03x faster
Significant (t=2.28)
Stddev: 0.01874 -> 0.24320: 12.9783x larger

### call_simple ###
Min: 1.353581 -> 1.263386: 1.07x faster
Avg: 1.397946 -> 1.302046: 1.07x faster
Significant (t=24.28)
Stddev: 0.03667 -> 0.03154: 1.1629x smaller

### chaos ###
Min: 1.199377 -> 1.115550: 1.08x faster
Avg: 1.230859 -> 1.146573: 1.07x faster
Significant (t=16.24)
Stddev: 0.02663 -> 0.02525: 1.0544x smaller

### django_v2 ###
Min: 2.682884 -> 2.633110: 1.02x faster
Avg: 2.747521 -> 2.690486: 1.02x faster
Significant (t=9.90)
Stddev: 0.02744 -> 0.03010: 1.0970x larger

### fastpickle ###
Min: 1.751475 -> 1.597340: 1.10x faster
Avg: 1.771805 -> 1.613533: 1.10x faster
Significant (t=64.81)
Stddev: 0.01177 -> 0.01263: 1.0727x larger

### float ###
Min: 1.254858 -> 1.293067: 1.03x slower
Avg: 1.336045 -> 1.365787: 1.02x slower
Significant (t=-3.30)
Stddev: 0.04851 -> 0.04135: 1.1730x smaller

### json_dump_v2 ###
Min: 17.871819 -> 16.968647: 1.05x faster
Avg: 18.428747 -> 17.483397: 1.05x faster
Significant (t=4.10)
Stddev: 1.60617 -> 0.27655: 5.8078x smaller

### mako ###
Min: 0.241614 -> 0.231678: 1.04x faster
Avg: 0.253730 -> 0.240585: 1.05x faster
Significant (t=8.93)
Stddev: 0.01912 -> 0.01327: 1.4417x smaller

### mako_v2 ###
Min: 0.225664 -> 0.213179: 1.06x faster
Avg: 0.234850 -> 0.225984: 1.04x faster
Significant (t=10.12)
Stddev: 0.01379 -> 0.01391: 1.0090x larger

### meteor_contest ###
Min: 0.777612 -> 0.758924: 1.02x faster
Avg: 0.799580 -> 0.780897: 1.02x faster
Significant (t=3.97)
Stddev: 0.02482 -> 0.02212: 1.1221x smaller

### nbody ###
Min: 0.969724 -> 0.883935: 1.10x faster
Avg: 0.996416 -> 0.918375: 1.08x faster
Significant (t=12.65)
Stddev: 0.02426 -> 0.03627: 1.4951x larger

### nqueens ###
Min: 1.142745 -> 1.128195: 1.01x faster
Avg: 1.296659 -> 1.162443: 1.12x faster
Significant (t=2.75)
Stddev: 0.34462 -> 0.02680: 12.8578x smaller

### pickle_dict ###
Min: 1.433264 -> 1.467394: 1.02x slower
Avg: 1.468122 -> 1.506908: 1.03x slower
Significant (t=-7.20)
Stddev: 0.02695 -> 0.02691: 1.0013x smaller

### raytrace ###
Min: 5.454853 -> 5.538799: 1.02x slower
Avg: 5.530943 -> 5.676983: 1.03x slower
Significant (t=-8.64)
Stddev: 0.05152 -> 0.10791: 2.0947x larger

### regex_effbot ###
Min: 0.205875 -> 0.194776: 1.06x faster
Avg: 0.211118 -> 0.198759: 1.06x faster
Significant (t=5.10)
Stddev: 0.01305 -> 0.01112: 1.1736x smaller

### regex_v8 ###
Min: 0.141628 -> 0.133819: 1.06x faster
Avg: 0.147024 -> 0.140053: 1.05x faster
Significant (t=2.72)
Stddev: 0.01163 -> 0.01388: 1.1933x larger

### richards ###
Min: 0.734472 -> 0.727501: 1.01x faster
Avg: 0.760795 -> 0.743484: 1.02x faster
Significant (t=3.50)
Stddev: 0.02778 -> 0.02127: 1.3061x smaller

### silent_logging ###
Min: 0.344678 -> 0.336087: 1.03x faster
Avg: 0.357982 -> 0.347361: 1.03x faster
Significant (t=2.76)
Stddev: 0.01992 -> 0.01852: 1.0755x smaller

### simple_logging ###
Min: 1.104831 -> 1.072921: 1.03x faster
Avg: 1.146844 -> 1.117068: 1.03x faster
Significant (t=4.02)
Stddev: 0.03552 -> 0.03848: 1.0833x larger

### spectral_norm ###
Min: 1.710336 -> 1.688910: 1.01x faster
Avg: 1.872578 -> 1.738698: 1.08x faster
Significant (t=2.35)
Stddev: 0.40095 -> 0.03331: 12.0356x smaller

### tornado_http ###
Min: 0.849374 -> 0.852209: 1.00x slower
Avg: 0.955472 -> 0.916075: 1.04x faster
Significant (t=4.82)
Stddev: 0.07059 -> 0.04119: 1.7139x smaller

### unpack_sequence ###
Min: 0.000030 -> 0.000020: 1.52x faster
Avg: 0.000164 -> 0.000174: 1.06x slower
Significant (t=-13.11)
Stddev: 0.00011 -> 0.00013: 1.2256x larger

### unpickle_list ###
Min: 1.333952 -> 1.212805: 1.10x faster
Avg: 1.373228 -> 1.266677: 1.08x faster
Significant (t=16.32)
Stddev: 0.02894 -> 0.03597: 1.2428x larger

vstinner · 2015-03-18T13:31:59Z

What's the status of this issue?

zbyrne · 2015-03-18T15:53:44Z

I haven't looked at it since I posted the benchmark results for 21955_2.patch.

zbyrne · 2016-01-12T02:10:11Z

Anybody still looking at this? I can take another stab at it if it's still in scope.

1st1 · 2016-01-12T02:42:00Z

Anybody still looking at this? I can take another stab at it if it's still in scope.

There were some visible speedups from your patch -- I think we should merge this optimization. Can you figure why unpack_sequence and other benchmarks were slower?

zbyrne · 2016-01-12T03:37:32Z

Can you figure why unpack_sequence and other benchmarks were slower?
I didn't look really closely, A few of the slower ones were floating point heavy, which would incur the slow path penalty, but I can dig into unpack_sequence.

1st1 · 2016-02-02T18:55:11Z

I'm assigning this patch to myself to commit it in 3.6 later.

zbyrne · 2016-02-02T20:37:16Z

I took another look at this, and tried applying it to 3.6 and running the latest benchmarks. It applied cleanly, and the benchmark results were similar, this time unpack_sequence and spectral_norm were slower. Spectral norm makes sense, it's doing lots of FP addition. The unpack_sequence instruction looks like it already has optimizations for unpacking lists and tuples onto the stack, and running dis on the test showed that it's completely dominated calls to unpack_sequence, load_fast, and store_fast so I still don't know what's going on there.

pitrou · 2016-02-02T20:56:52Z

Any change that increases the cache or branch predictor footprint of the evaluation loop may make the interpreter slower, even if the change doesn't seem related to a particular benchmark. That may be the reason here.

1st1 · 2016-02-02T21:06:55Z

unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = to_unpack". This code doesn't even touch BINARY_SUBSCR or BINARY_ADD.

Zach, could you please run your benchmarks in rigorous mode (perf.py -r)? I'd also suggest to experiment with putting the baseline cpython as a first arg and as a second -- maybe your machine runs the second interpreter slightly faster.

zbyrne · 2016-02-03T16:40:26Z

I ran 6 benchmarks on Two with just the BINARY_ADD I'm attaching the output In this table I've flipped |------------------|--------- |Build | Baseline Reference |------------------|--------- | | Faster |------------------|--------- |BINARY_ADD | chameleon_v2 | | chaos | | django | | etree_generate | | fannkuch | | formatted_logging | | go | | json_load | | regex_compile | | simple_logging | | spectral_norm |------------------|--------- |BINARY_SUBSCR | chameleon_v2 | | chaos | | etree_generate | | fannkuch | | fastpickle | | hexiom2 | | json_load | | mako_v2 | | meteor_contest | | nbody | | regex_v8 | | spectral_norm | | | | | | | | |------------------|--------- |BOTH | chameleon_v2 | | chaos | | etree_generate | | etree_process | | fannkuch | | fastunpickle | | float | | formatted_logging | | hexiom2 | | nbody | | nqueens | | regex_v8 | | spectral_norm | | unpickle_list |------------------|--------- my work machine(not the same one as the last set) overnight.
change, two with the BINARY_SUBSCR change, and two with both.
from all my benchmark runs, but here are the highlights
the results for running the modified build as the reference, but in the new attachment, slower in the right column means faster, I think :)
------------------------------|-----------------------------------|
| Modified Reference |
-----------|------------------|--------------------|--------------|
| Slower | Faster | Slower |
-----------|------------------|--------------------|--------------|
| etree_parse | chameleon_v2 | call_simple |
| nbody | fannkuch | nbody |
| normal_startup | normal_startup | pickle_dict |
| pickle_dict | nqueens | regex_v8 |
| pickle_list | regex_compile | |
| regex_effbot | spectral_norm | |
| | unpickle_list | |
| | | |
| | | |
| | | |
| | | |
-----------|------------------|--------------------|--------------|
| call_simple | 2to3 | etree_parse |
| go | call_method_slots | json_dump_v2 |
| pickle_list | chaos | pickle_dict |
| telco | fannkuch | |
| | formatted_logging | |
| | go | |
| | hexiom2 | |
| | mako_v2 | |
| | meteor_contest | |
| | nbody | |
| | normal_startup | |
| | nqueens | |
| | pickle_list | |
| | simple_logging | |
| | spectral_norm | |
| | telco | |
-----------|------------------|--------------------|--------------|
| call_simple | chameleon_v2 | fastpickle |
| etree_parse | choas | pickle_dict |
| pathlib | etree_generate | pickle_list |
| pickle_list | etree_process | telco |
| | fannkuch | |
| | float | |
| | formatted_logging | |
| | go | |
| | hexiom2 | |
| | nbody | |
| | normal_startup | |
| | nqueens | |
| | simple_logging | |
| | spectral_norm | |
-----------|------------------|--------------------|--------------|

unpack_sequence is nowhere to be seen and spectral_norm is faster now...

1st1 · 2016-02-03T17:00:24Z

Attaching a new patch -- rewritten to optimize -, *, +, -=, *= and +=. I also removed the optimization of [] operator -- that should be done in a separate patch and in a separate issue.

Some nano-benchmarks (best of 3):

python -m timeit "sum([x + x + 1 for x in range(100)])"
2.7: 7.71 3.5: 8.54 3.6: 7.33

python -m timeit "sum([x - x - 1 for x in range(100)])"
2.7: 7.81 3.5: 8.59 3.6: 7.57

python -m timeit "sum([x * x * 1 for x in range(100)])"
2.7: 9.28 3.5: 10.6 3.6: 9.44

Python 3.6 vs 3.5 (spectral_norm, rigorous run):
Min: 0.315917 -> 0.276785: 1.14x faster
Avg: 0.321006 -> 0.284909: 1.13x faster

Zach, thanks a lot for the research! I'm glad that unpack_sequence finally proved to be irrelevant. Could you please take a look at the updated patch?

vstinner · 2016-02-03T17:05:23Z

python -m timeit "sum([x * x * 1 for x in range(100)])"

If you only want to benchmark x*y, x+y and list-comprehension, you
should use a tuple for the iterator.

pitrou · 2016-02-03T17:07:19Z

In this table I've flipped the results for running the modified build > as the reference, but in the new attachment, slower in the right
column means faster, I think :)

I don't understand what this table means (why 4 columns?). Can you explain what you did?

serhiy-storchaka · 2016-02-06T08:37:30Z

I see two main trends: optimize most cases (optimize most operators for int and float, ex: fastint5_4.patch) versus optimize very few cases to limit changes and to limit effects on ceval.c (ex: inline-2.patch).

I agree that may be optimizing very few cases is better. We need to collect the statistics of using different operations with different types in long run of tests or benchmarks. If say division is used 100 times less than addition, we shouldn't complicate ceval loop to optimize it.

vstinner · 2016-02-06T15:47:44Z

Benchmark on inline-2.patch. No speedup, only slowdown.

I'm now running benchmark on fastint5_4.patch.

$ python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python.inline-2

Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64
Total CPU cores: 8

### json_load ###
Min: 0.707290 -> 0.723411: 1.02x slower
Avg: 0.707845 -> 0.724238: 1.02x slower
Significant (t=-297.25)
Stddev: 0.00026 -> 0.00049: 1.8696x larger

### regex_v8 ###
Min: 0.066663 -> 0.070435: 1.06x slower
Avg: 0.066947 -> 0.071378: 1.07x slower
Significant (t=-17.98)
Stddev: 0.00172 -> 0.00177: 1.0313x larger

The following not significant results are hidden, use -v to show them:
2to3, chameleon_v2, django_v3, fastpickle, fastunpickle, json_dump_v2, nbody, tornado_http.

real 58m32.662s
user 57m43.058s
sys 0m47.428s

vstinner · 2016-02-06T15:49:02Z

Benchmark on fastint5_4.patch.

python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python_fastint5_4

Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64
Total CPU cores: 8

### django_v3 ###
Min: 0.563959 -> 0.578181: 1.03x slower
Avg: 0.565383 -> 0.579137: 1.02x slower
Significant (t=-152.48)
Stddev: 0.00075 -> 0.00050: 1.4900x smaller

### fastunpickle ###
Min: 0.551076 -> 0.563469: 1.02x slower
Avg: 0.555481 -> 0.567028: 1.02x slower
Significant (t=-27.05)
Stddev: 0.00278 -> 0.00324: 1.1687x larger

### json_dump_v2 ###
Min: 2.737429 -> 2.662615: 1.03x faster
Avg: 2.754239 -> 2.685404: 1.03x faster
Significant (t=55.63)
Stddev: 0.00610 -> 0.01077: 1.7662x larger

### nbody ###
Min: 0.228548 -> 0.212292: 1.08x faster
Avg: 0.230082 -> 0.213574: 1.08x faster
Significant (t=73.74)
Stddev: 0.00175 -> 0.00139: 1.2567x smaller

### regex_v8 ###
Min: 0.041323 -> 0.048099: 1.16x slower
Avg: 0.041624 -> 0.049318: 1.18x slower
Significant (t=-45.38)
Stddev: 0.00123 -> 0.00116: 1.0613x smaller

The following not significant results are hidden, use -v to show them:
2to3, chameleon_v2, fastpickle, json_load, tornado_http.

1st1 · 2016-02-06T17:00:25Z

regex_v8

Min: 0.041323 -> 0.048099: 1.16x slower
Avg: 0.041624 -> 0.049318: 1.18x slower

I think this is a random fluctuation, that benchmark (and re lib) doesn't use the operators too much. It can't be THAT slower just because of optimizing a few binops.

1st1 · 2016-02-06T17:02:55Z

You're also running a very small subset of all benchmarks available. Please try the '-b all' option. I'll also run benchmarks on my machines.

1st1 · 2016-02-06T17:38:20Z

Alright, I ran a few benchmarks myself. In rigorous mode regex_v8 has the same performance on my 2013 Macbook Pro and an 8-years old i7 CPU (Linux).

Here're results of "perf.py -b raytrace,spectral_norm,meteor_contest,nbody ../cpython/python.exe ../cpython-git/python.exe -r"

fastint5:

### nbody ###
Min: 0.227683 -> 0.197046: 1.16x faster
Avg: 0.229366 -> 0.198889: 1.15x faster
Significant (t=137.31)
Stddev: 0.00170 -> 0.00142: 1.1977x smaller

### spectral_norm ###
Min: 0.296840 -> 0.262279: 1.13x faster
Avg: 0.299616 -> 0.265387: 1.13x faster
Significant (t=74.52)
Stddev: 0.00331 -> 0.00319: 1.0382x smaller

The following not significant results are hidden, use -v to show them:
meteor_contest, raytrace.

======

inline-2:

### raytrace ###
Min: 1.188825 -> 1.213788: 1.02x slower
Avg: 1.199827 -> 1.227276: 1.02x slower
Significant (t=-18.12)
Stddev: 0.00559 -> 0.01408: 2.5184x larger

### spectral_norm ###
Min: 0.296535 -> 0.277025: 1.07x faster
Avg: 0.299044 -> 0.278071: 1.08x faster
Significant (t=87.40)
Stddev: 0.00220 -> 0.00097: 2.2684x smaller

The following not significant results are hidden, use -v to show them:
meteor_contest, nbody.

1st1 · 2016-02-07T15:01:07Z

From what I can see there is no negative impact of the patch on stable macro benchmarks.

There is quite a detectable positive impact on most of integer and float operations from my patch. 13-16% on nbody and spectral_norm benchmarks is still impressive. And you can see a huge improvement in various timeit micro-benchmarks.

fastint5 is a very compact patch, that only touches the ceval.c file. It doesn't complicate the code, as the macro is very straightforward. Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it.

serhiy-storchaka · 2016-02-07T15:08:44Z

Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics.

1st1 · 2016-02-07T15:16:36Z

Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics.

But what you will use to gather statistics data? Test suite isn't representative, and we already know what will benchmarks suite show. I can assist with writing some code for stats, but what's the plan?

skrah · 2016-02-07T16:18:25Z

bpo-26288 brought a great speedup for floats. With fastint5_4.patch *on top of bpo-26288* I see no improvement for floats and a big slowdown for _decimal.

casevh · 2016-02-07T19:30:04Z

Can I suggest the mpmath test suite as a good benchmark? I've used it to test the various optimizations in gmpy2 and it has always been a valuable real-world benchmark. And it is slower in Python 3 than Python 2....

pitrou · 2016-02-07T19:42:01Z

Be careful with test suites: first, they might exercise code that would never be a critical point for performance in a real-world application; second and most important, unittest seems to have gotten slower between 2.x and 3.x, so you would really be comparing apples to oranges.

1st1 · 2016-02-07T21:32:53Z

Attaching another patch - fastint6.patch that only optimizes longs (no FP fast path).

bpo-26288 brought a great speedup for floats. With fastint5_4.patch *on top of bpo-26288* I see no improvement for floats and a big slowdown for _decimal.

What benchmark did you use? What were the numbers? I'm asking because before you benchmarked different patches that are conceptually similar to fastint5, and the result was that decimal was 5% faster with fast paths for just longs, and 6% slower with fast paths for longs & floats.

Also, some quick timeit results (quite stable from run to run):

-m timeit -s "x=2" "x + 10 + x * 20 + x* 10 + 20 -x"
3.6: 0.150usec 3.6+fastint: 0.112usec

-m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)"
3.6: 0.425usec 3.6+fastint: 0.302usec

vstinner · 2016-02-08T09:46:10Z

Yury Selivanov:

Alright, I ran a few benchmarks myself. (...)
From what I can see there is no negative impact of the patch on stable macro benchmarks.

I'm disappointed by the results. In short, these patches have *no* impact on macro benchmarks, other than the two which are stressing the int and float types. Maybe we are just loosing our time on this issue...

I understand that the patches are only useful to get xx% speedup (where xx% is smaller than 25%) if your whole application is blocked by numeric computations. If it's the case, I would suggest to move to PyPy, Numba, Cython, etc. I expect something more interesting than xx% faster, but a much more impressive speedup.

http://speed.pypy.org/ : PyPy/CPython 2.7 for spectral_norm is 0.04: 25x faster. For nbody (nbody_modified), it's 0.09: 11x faster.

With patches of this issue, the *best* speedup is only 1.16x faster... We are *very* far from 11x or 25x faster. It's not even 2x faster...

Yury Selivanov:

fastint5 is a very compact patch, that only touches the ceval.c file. It doesn't complicate the code, as the macro is very straightforward. Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it.

According to my micro-benchmark msg259706, inline-2.patch is faster than fastint5_4.patch. I would suggest to "finish" the inline-2.patch to optimize other operations, and *maybe* add fast-path for float.

On macrobenchmark, inline-2.patch is slower than fastint5_4.patch, but it was easy to expect since I only added fast-path for int-int and only for a few operators.

The question is it is worth to get xx% speedup on one or two specific benchmarks where CPython really sucks compared to other languages and other implementations of Python...

Stefan Krah:

With fastint5_4.patch *on top of bpo-26288* I see no improvement for floats and a big slowdown for _decimal.

How do you run your benchmark?

Case Van Horsen:

Can I suggest the mpmath test suite as a good benchmark?

Where can we find this benchmark?

Case Van Horsen:

it has always been a valuable real-world benchmark

What do you mean by "real-world benchmark"? :-)

casevh · 2016-02-08T16:30:16Z

mpmath is a library for arbitrary-precision floating-point arithmetic. It uses Python's native long type or gmpy2's mpz type for computations. It is available at https://pypi.python.org/pypi/mpmath.

The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation:

~/src/mpmath-0.19/mpmath/tests$ time py36 runtests.py -local

For example, I've tried to optimize gmpy2's handling of binary operations between its mpz type and short Python integers. I've found it to provide useful results: improvements that are significant on a micro-benchmark (say 20%) will usually cause a small but repeatable improvement. And some improvements that looked good on a micro-benchmark would slow down mpmath.

I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%.

I will run more tests tonight.

1st1 · 2016-02-08T16:40:24Z

I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%.

I will run more tests tonight.

Please try to test fastint5 too (fast paths for long & floats, whereas fastint6 is only focused on longs).

casevh · 2016-02-09T08:25:42Z

I ran the mpmath test suite with the fastint6 and fastint5_4 patches.

fastint6 results

without gmpy: 0.25% faster
with gmpy: 3% slower

fastint5_4 results

without gmpy: 1.5% slower
with gmpy: 5.5% slower

vstinner · 2016-02-09T09:15:56Z

Case Van Horsen added the comment:

I ran the mpmath test suite with the fastint6 and fastint5_4 patches.

fastint6 results
without gmpy: 0.25% faster
with gmpy: 3% slower

fastint5_4 results
without gmpy: 1.5% slower
with gmpy: 5.5% slower

I'm more and more disappointed by this issue... If even a test
stressing int & float is *slower* (or less than 1% faster) with a
patch supposed to optimized them, what's the point? I'm also concerned
by the slow-down for other types (gmpy types).

Maybe we should just close the issue?

1st1 · 2016-02-09T18:24:01Z

Maybe we should just close the issue?

I'll take a closer look at gmpy later. Please don't close.

vstinner · 2016-02-10T10:02:14Z

The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation:

I extracted the slowest test (test_polyroots_legendre) and put it in a loop of 5 iterations: see attached mpmath_bench.py. I ran this benchmark on Linux with 4 isolated CPU (/sys/devices/system/cpu/isolated=2-3,6-7).
http://haypo-notes.readthedocs.org/misc.html#reliable-micro-benchmarks

On such setup, the benchmark looks stable. Example:

Run #1/5: 12.28 sec
Run #2/5: 12.27 sec
Run #3/5: 12.29 sec
Run #4/5: 12.28 sec
Run #5/5: 12.30 sec

test_polyroots_legendre (min of 5 runs):

Original: 12.51 sec
fastint5_4.patch: (min of 5 runs): 12.27 sec (-1.9%)
fastint6.patch: 12.21 sec (-2.4%)

I ran tests without GMP, to stress the Python int type.

I guess that the benchmark is dominated by CPU time spent on computing operations on large Python int, not by the time spent in ceval.c. So the speedup is low (2%). Such use case doesn't seem to benefit of micro optimization discussed in this issue.

mpmath is an arbitrary-precision floating-point arithmetic using Python int (or GMP if available).

vstinner · 2016-04-22T14:05:45Z

Maybe we should adopt a difference approach.

There is something called "inline caching": put the cache between instructions, in the same memory block. Example of paper on CPython:

"Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler (2009)
https://www.sba-research.org/wp-content/uploads/publications/sac10.pdf

Maybe we can build something on top of the issue bpo-26219 "implement per-opcode cache in ceval"?

skrah · 2016-04-22T14:24:46Z

bpo-14757 has an implementation of inline caching, which at least seemed to slow down some use cases. Then again, whenever someone posts a new speedup suggestion, it seems to slow down things I'm working on. At least Case van Horsen independently verified the phenomenon in this issue. :)

vstinner · 2016-10-20T09:25:38Z

Between inline2.patch and fastint6.patch, it seems like inline2.patch is faster (between 9% and 12% faster than fastint6.patch).

Microbenchmark on Python default (rev 554fb699af8c), compilation using LTO (./configure --with-lto), GCC 6.2.1 on Fedora 24, Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, perf 0.8.3 (dev version, just after 0.8.2).

Commands:

./python -m perf timeit --name='x+y' -s 'x=1; y=2' 'x+y' --dup 1000 -v -o timeit-$branch.json
./python -m perf timeit --name=sum -s "R=range(100)" "[x + x + 1 for x in R]" --dup 1000 -v --append timeit-$branch.json

Results:

$ python3 -m perf compare_to timeit-master.json timeit-inline2.json
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.14x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.29x faster

$ python3 -m perf compare_to timeit-master.json timeit-fastint6.json 
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-fastint6] 6.09 us +- 0.11 us: 1.02x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-fastint6] 12.7 ns +- 0.2 ns: 1.18x faster

$ python3 -m perf compare_to timeit-fastint6.json  timeit-inline2.json
sum: Median +- std dev: [timeit-fastint6] 6.09 us +- 0.11 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.12x faster
x+y: Median +- std dev: [timeit-fastint6] 12.7 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.09x faster

vstinner · 2016-10-20T09:29:36Z

Result of performance 0.3.3 (and perf 0.8.3).

No major benchmark is faster. A few benchmarks seem to be event slower using fastint6.patch (but I don't really trust pybench).

== fastint6.patch ==

$ python3 -m perf compare_to master.json fastint6.json --group-by-speed --min-speed=5
Slower (3):
- pybench.ConcatUnicode: 52.7 ns +- 0.0 ns -> 56.1 ns +- 0.4 ns: 1.06x slower
- pybench.ConcatStrings: 52.7 ns +- 0.3 ns -> 56.1 ns +- 0.1 ns: 1.06x slower
- pybench.CompareInternedStrings: 16.5 ns +- 0.0 ns -> 17.4 ns +- 0.0 ns: 1.05x slower

Faster (4):

pybench.SimpleIntFloatArithmetic: 441 ns +- 2 ns -> 400 ns +- 6 ns: 1.10x faster
pybench.SimpleIntegerArithmetic: 441 ns +- 2 ns -> 401 ns +- 5 ns: 1.10x faster
pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 608 ns +- 6 ns: 1.06x faster
genshi_text: 79.6 ms +- 0.5 ms -> 75.5 ms +- 0.8 ms: 1.05x faster

Benchmark hidden because not significant (114): 2to3, call_method, (...)

== inline2.patch ==

haypo@selma$ python3 -m perf compare_to master.json inline2.json --group-by-speed --min-speed=5
Faster (2):

spectral_norm: 223 ms +- 1 ms -> 209 ms +- 1 ms: 1.07x faster
pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 606 ns +- 7 ns: 1.06x faster

Benchmark hidden because not significant (119): 2to3, call_method, (...)

vstinner · 2016-10-20T09:31:18Z

fastint6_inline2_json.tar.gz: archive of JSON files

fastint6.json
inline2.json
master.json
timeit-fastint6.json
timeit-inline2.json
timeit-master.json

vstinner · 2016-10-20T10:11:39Z

The fatest patch (inline2.patch) has a negligible impact on benchmarks. The purpose of an optimization is to make Python faster, it's not the case here, so I close the issue.

Using timeit, the largest speedup is 1.29x faster. Using performance, spectral_norm is 1.07x faster and pybench.SimpleLongArithmetic is 1.06x faster. I consider that spectral_norm and pybench.SimpleLongArithmetic are microbenchmarks and so not representative of a real application.

The issue was fun, thank you for playing with me the game of micro-optimization ;-) Let's move to more interesting optimizations having a larger impact on more realistic workloads, like cache global variables, optimizing method calls, fastcalls, etc.

python-dev · 2016-10-20T10:19:59Z

New changeset 61fcb12a9873 by Victor Stinner in branch 'default':
Issue bpo-21955: Please don't try to optimize int+int
https://hg.python.org/cpython/rev/61fcb12a9873

vstinner · 2020-10-01T16:57:45Z

New changeset bd0a08e by Victor Stinner in branch 'master':
bpo-21955: Change my nickname in BINARY_ADD comment (GH-22481)
bd0a08e

vstinner added the performance Performance or resource usage label Jul 11, 2014

1st1 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 2, 2016

1st1 self-assigned this Feb 2, 2016

vstinner closed this as completed Oct 20, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

ceval.c: implement fast path for integers with a single digit #66154

ceval.c: implement fast path for integers with a single digit #66154

Comments

vstinner commented Jul 11, 2014

vstinner commented Jul 11, 2014

MojoVampire mannequin commented Jul 11, 2014

serhiy-storchaka commented Jul 12, 2014

vstinner commented Jul 12, 2014

vstinner commented Jul 12, 2014

rhettinger commented Jul 14, 2014

zbyrne mannequin commented Jul 16, 2014

serhiy-storchaka commented Jul 16, 2014

vstinner commented Jul 16, 2014

serhiy-storchaka commented Jul 16, 2014

zbyrne mannequin commented Jul 16, 2014

zbyrne mannequin commented Jul 22, 2014

pitrou commented Jul 23, 2014

zbyrne mannequin commented Jul 23, 2014

vstinner commented Mar 18, 2015

zbyrne mannequin commented Mar 18, 2015

zbyrne mannequin commented Jan 12, 2016

1st1 commented Jan 12, 2016

zbyrne mannequin commented Jan 12, 2016

1st1 commented Feb 2, 2016

zbyrne mannequin commented Feb 2, 2016

pitrou commented Feb 2, 2016

1st1 commented Feb 2, 2016

zbyrne mannequin commented Feb 3, 2016

1st1 commented Feb 3, 2016

vstinner commented Feb 3, 2016

pitrou commented Feb 3, 2016

serhiy-storchaka commented Feb 6, 2016

vstinner commented Feb 6, 2016

vstinner commented Feb 6, 2016

1st1 commented Feb 6, 2016

regex_v8

1st1 commented Feb 6, 2016

1st1 commented Feb 6, 2016

1st1 commented Feb 7, 2016

serhiy-storchaka commented Feb 7, 2016

1st1 commented Feb 7, 2016

skrah mannequin commented Feb 7, 2016

casevh mannequin commented Feb 7, 2016

pitrou commented Feb 7, 2016

1st1 commented Feb 7, 2016

vstinner commented Feb 8, 2016

casevh mannequin commented Feb 8, 2016

1st1 commented Feb 8, 2016

casevh mannequin commented Feb 9, 2016

vstinner commented Feb 9, 2016

1st1 commented Feb 9, 2016

vstinner commented Feb 10, 2016

vstinner commented Apr 22, 2016

skrah mannequin commented Apr 22, 2016

vstinner commented Oct 20, 2016

vstinner commented Oct 20, 2016

vstinner commented Oct 20, 2016

vstinner commented Oct 20, 2016

python-dev mannequin commented Oct 20, 2016

vstinner commented Oct 1, 2020