classification
Title: ceval.c: implement fast path for integers with a single digit
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: yselivanov Nosy List: Yury.Selivanov, casevh, josh.r, lemburg, mark.dickinson, pitrou, python-dev, rhettinger, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Priority: normal Keywords: patch

Created on 2014-07-11 09:10 by vstinner, last changed 2016-10-20 10:19 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
21955.patch zbyrne, 2014-07-16 00:29 review
bench_long.py vstinner, 2014-07-16 08:13
inline.patch vstinner, 2014-07-16 08:13 review
21955_2.patch zbyrne, 2014-07-22 02:34 review
bench_results.txt zbyrne, 2016-02-03 16:40
fastint1.patch yselivanov, 2016-02-03 17:00 review
fastint2.patch yselivanov, 2016-02-04 06:02 review
fastint_alt.patch serhiy.storchaka, 2016-02-04 10:30 review
fastintfloat_alt.patch serhiy.storchaka, 2016-02-04 16:36 review
fastint4.patch yselivanov, 2016-02-05 01:37 review
fastint5.patch yselivanov, 2016-02-05 04:04 review
bench_long2.py vstinner, 2016-02-05 15:58
compare.txt vstinner, 2016-02-05 15:58
compare_to.txt vstinner, 2016-02-05 15:58
fastint5_2.patch yselivanov, 2016-02-06 00:10 review
fastint5_3.patch yselivanov, 2016-02-06 00:45 review
fastint5_4.patch yselivanov, 2016-02-06 01:29 review
inline-2.patch vstinner, 2016-02-06 01:31 review
fastint6.patch yselivanov, 2016-02-07 21:32 review
mpmath_bench.py vstinner, 2016-02-10 10:02
fastint6_inline2_json.tar.gz vstinner, 2016-10-20 09:31
Messages (110)
msg222731 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-11 09:10
Python 2 has fast path in ceval.c for operations (a+b, a-b, etc.) on small integers ("int" type) if the operation does not overflow.

We loose these fast-path in Python 3 when we dropped the int type in favor of the long type.

Antoine Pitrou proposed a fast-path, but only for int singletons (integers in the range [-5; 255]): issue #10044. His patch was rejected because it introduces undefined behaviour.

I propose to reimplemenet Python 2 optimization for long with a single digit, which are the most common numbers.

Pseudo-code for BINARY_ADD:
---
if (PyLong_CheckExact(x) && Py_ABS(Py_SIZE(x)) == 1
    && PyLong_CheckExact(y) && Py_ABS(Py_SIZE(y)) == 1)
{
   stwodigits a = ..., b = ...;
   stwodigits c;
   if (... a+b will not overflow ...) { 
      c = a + b;
      return PyLong_FromLongLong(c);
   }
}
/* fall back to PyNumber_Add() */
---

The code can be copied from longobject.c, there are already fast-path for single digit numbers. See for example long_mul():
---
    /* fast path for single-digit multiplication */
    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        ....
    }
---

As any other optimization, it should be proved to be faster with benchmarks.
msg222804 - (view) Author: Josh Rosenberg (josh.r) * Date: 2014-07-11 22:23
On:  if (... a+b will not overflow ...) { 

Since you limited the optimization for addition to single digit numbers, at least for addition and subtraction, overflow is impossible. The signed twodigit you use for the result is guaranteed to be able to store far larger numbers than addition of single digits can produce. In fact, due to the extra wasted bit on large (30 bit) digits, if you used a fixed width 32 bit type for addition/subtraction, and a fixed width 64 bit type for multiplication, overflow would be impossible regardless of whether you used 15 or 30 bit digits.

On a related note: Presumably you should check if the abs(size) <= 1 like in longobject.c, not == 1, or you omit the fast path for 0. Doesn't come up much, not worth paying extra to optimize, but it costs nothing to handle it.
msg222824 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-12 07:19
Let's try. As I understand, issue10044 was rejected because it complicates the code too much. May be new attempt will be more successful.
msg222829 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-12 09:01
Serhiy Storchaka added the comment:
> Let's try. As I understand, issue10044 was rejected because it complicates the code too much. May be new attempt will be more successful.

I read that Mark rejected the issue #10044 because it introduces an
undefined behaviour.
msg222830 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-12 09:01
I'm not interested to work on this issue right now. If anyone is
interested, please go ahead!
msg222985 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-07-14 00:42
There also used to be a fast path for binary subscriptions with integer indexes.  I would like to see that performance regression fixed if it can be done cleanly.
msg223162 - (view) Author: Zach Byrne (zbyrne) * Date: 2014-07-16 00:29
So I'm trying something pretty similar to Victor's pseudo-code and just using timeit to look for speedups
timeit('x+x', 'x=10', number=10000000)
before:
1.1934231410000393
1.1988609210002323
1.1998214110003573
1.206968028999654
1.2065417159997196

after:
1.1698650090002047
1.1705158909999227
1.1752884750003432
1.1744818619999933
1.1741297110002051
1.1760422649999782

Small improvement. Haven't looked at optimizing BINARY_SUBSCR yet.
msg223177 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-16 06:23
Thank you Zach. I found even small regression.

Before:

$ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.51 usec per loop

After:

$ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.6 usec per loop
msg223180 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-16 08:14
bench_long.py: micro-benchmark for x+y. I confirm a slow down with 21955.patch. IMO you should at least inline PyLong_AsLong() which can be simplified if the number has 0 or 1 digit. Here is my patch "inline.patch" which is  21955.patch  with PyLong_AsLong() inlined.

Benchmark result (patch=21955.patch, inline=inline.patch):

Common platform:
Platform: Linux-3.14.8-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Bits: int=32, long=64, long long=64, size_t=64, void*=64
CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Python unicode implementation: PEP 393
Timer: time.perf_counter

Platform of campaign orig:
Date: 2014-07-16 10:04:27
Python version: 3.5.0a0 (default:08b3ee523577, Jul 16 2014, 10:04:23) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
SCM: hg revision=08b3ee523577 tag=tip branch=default date="2014-07-15 13:23 +0300"
Timer precision: 40 ns

Platform of campaign patch:
Timer precision: 40 ns
Date: 2014-07-16 10:04:01
Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:02:12) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

Platform of campaign inline:
Timer precision: 31 ns
Date: 2014-07-16 10:11:21
Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:10:48) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

--------------------+-------------+---------------+---------------
Tests               |        orig |         patch |         inline
--------------------+-------------+---------------+---------------
1+2                 |   23 ns (*) |         24 ns |   21 ns (-12%)
"1+2" ran 100 times | 1.61 us (*) | 1.74 us (+7%) | 1.39 us (-14%)
--------------------+-------------+---------------+---------------
Total               | 1.64 us (*) | 1.76 us (+7%) | 1.41 us (-14%)
--------------------+-------------+---------------+---------------

(I removed my message because I posted the wrong benchmark output, inline column was missing.)
msg223186 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-16 09:28
Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255).

Before:

$ ./python -m timeit -s "x=10"  "x+x"
10000000 loops, best of 3: 0.143 usec per loop
$ ./python -m timeit -s "x=1000"  "x+x"
1000000 loops, best of 3: 0.247 usec per loop

After:

$ ./python -m timeit -s "x=10"  "x+x"
10000000 loops, best of 3: 0.117 usec per loop
$ ./python -m timeit -s "x=1000"  "x+x"
1000000 loops, best of 3: 0.209 usec per loop

All measures are made with modified timeit (issue21988).
msg223214 - (view) Author: Zach Byrne (zbyrne) * Date: 2014-07-16 14:40
Well, dont' I feel silly. I confirmed both my regression and the inline speedup using the benchmark Victor added. I wonder if I got my binaries backwards in my first test...
msg223623 - (view) Author: Zach Byrne (zbyrne) * Date: 2014-07-22 02:34
I did something similar to BINARY_SUBSCR after looking at the 2.7 source as Raymond suggested. Hopefully I got my binaries straight this time :) The new patch includes Victor's inlining and my new subscript changes.

Platform of campaign orig:
Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 15 2014, 18:11:28) [GCC 4.6.3]
Timer precision: 6 ns
Date: 2014-07-21 20:28:30

Platform of campaign patch:
Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 21 2014, 20:21:20) [GCC 4.6.3]
Timer precision: 20 ns
Date: 2014-07-21 20:28:39

---------------------+-------------+---------------
Tests                |        orig |          patch
---------------------+-------------+---------------
1+2                  |  118 ns (*) |  103 ns (-13%)
"1+2" ran 100 times  | 7.28 us (*) | 5.93 us (-19%)
x[1]                 |  120 ns (*) |   98 ns (-19%)
"x[1]" ran 100 times | 7.35 us (*) | 5.31 us (-28%)
---------------------+-------------+---------------
Total                | 14.9 us (*) | 11.4 us (-23%)
---------------------+-------------+---------------
msg223711 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-07-23 01:20
Please run the actual benchmark suite to get interesting numbers: http://hg.python.org/benchmarks
msg223726 - (view) Author: Zach Byrne (zbyrne) * Date: 2014-07-23 06:00
I ran the whole benchmark suite. There are a few that are slower: call_method_slots, float, pickle_dict, and unpack_sequence.

Report on Linux zach-vbox 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686
Total CPU cores: 1

### 2to3 ###
24.789549 -> 24.809551: 1.00x slower

### call_method_slots ###
Min: 1.743554 -> 1.780807: 1.02x slower
Avg: 1.751735 -> 1.792814: 1.02x slower
Significant (t=-26.32)
Stddev: 0.00576 -> 0.01823: 3.1660x larger

### call_method_unknown ###
Min: 1.828094 -> 1.739625: 1.05x faster
Avg: 1.852225 -> 1.806721: 1.03x faster
Significant (t=2.28)
Stddev: 0.01874 -> 0.24320: 12.9783x larger

### call_simple ###
Min: 1.353581 -> 1.263386: 1.07x faster
Avg: 1.397946 -> 1.302046: 1.07x faster
Significant (t=24.28)
Stddev: 0.03667 -> 0.03154: 1.1629x smaller

### chaos ###
Min: 1.199377 -> 1.115550: 1.08x faster
Avg: 1.230859 -> 1.146573: 1.07x faster
Significant (t=16.24)
Stddev: 0.02663 -> 0.02525: 1.0544x smaller

### django_v2 ###
Min: 2.682884 -> 2.633110: 1.02x faster
Avg: 2.747521 -> 2.690486: 1.02x faster
Significant (t=9.90)
Stddev: 0.02744 -> 0.03010: 1.0970x larger

### fastpickle ###
Min: 1.751475 -> 1.597340: 1.10x faster
Avg: 1.771805 -> 1.613533: 1.10x faster
Significant (t=64.81)
Stddev: 0.01177 -> 0.01263: 1.0727x larger

### float ###
Min: 1.254858 -> 1.293067: 1.03x slower
Avg: 1.336045 -> 1.365787: 1.02x slower
Significant (t=-3.30)
Stddev: 0.04851 -> 0.04135: 1.1730x smaller

### json_dump_v2 ###
Min: 17.871819 -> 16.968647: 1.05x faster
Avg: 18.428747 -> 17.483397: 1.05x faster
Significant (t=4.10)
Stddev: 1.60617 -> 0.27655: 5.8078x smaller

### mako ###
Min: 0.241614 -> 0.231678: 1.04x faster
Avg: 0.253730 -> 0.240585: 1.05x faster
Significant (t=8.93)
Stddev: 0.01912 -> 0.01327: 1.4417x smaller

### mako_v2 ###
Min: 0.225664 -> 0.213179: 1.06x faster
Avg: 0.234850 -> 0.225984: 1.04x faster
Significant (t=10.12)
Stddev: 0.01379 -> 0.01391: 1.0090x larger

### meteor_contest ###
Min: 0.777612 -> 0.758924: 1.02x faster
Avg: 0.799580 -> 0.780897: 1.02x faster
Significant (t=3.97)
Stddev: 0.02482 -> 0.02212: 1.1221x smaller

### nbody ###
Min: 0.969724 -> 0.883935: 1.10x faster
Avg: 0.996416 -> 0.918375: 1.08x faster
Significant (t=12.65)
Stddev: 0.02426 -> 0.03627: 1.4951x larger

### nqueens ###
Min: 1.142745 -> 1.128195: 1.01x faster
Avg: 1.296659 -> 1.162443: 1.12x faster
Significant (t=2.75)
Stddev: 0.34462 -> 0.02680: 12.8578x smaller

### pickle_dict ###
Min: 1.433264 -> 1.467394: 1.02x slower
Avg: 1.468122 -> 1.506908: 1.03x slower
Significant (t=-7.20)
Stddev: 0.02695 -> 0.02691: 1.0013x smaller

### raytrace ###
Min: 5.454853 -> 5.538799: 1.02x slower
Avg: 5.530943 -> 5.676983: 1.03x slower
Significant (t=-8.64)
Stddev: 0.05152 -> 0.10791: 2.0947x larger

### regex_effbot ###
Min: 0.205875 -> 0.194776: 1.06x faster
Avg: 0.211118 -> 0.198759: 1.06x faster
Significant (t=5.10)
Stddev: 0.01305 -> 0.01112: 1.1736x smaller

### regex_v8 ###
Min: 0.141628 -> 0.133819: 1.06x faster
Avg: 0.147024 -> 0.140053: 1.05x faster
Significant (t=2.72)
Stddev: 0.01163 -> 0.01388: 1.1933x larger

### richards ###
Min: 0.734472 -> 0.727501: 1.01x faster
Avg: 0.760795 -> 0.743484: 1.02x faster
Significant (t=3.50)
Stddev: 0.02778 -> 0.02127: 1.3061x smaller

### silent_logging ###
Min: 0.344678 -> 0.336087: 1.03x faster
Avg: 0.357982 -> 0.347361: 1.03x faster
Significant (t=2.76)
Stddev: 0.01992 -> 0.01852: 1.0755x smaller

### simple_logging ###
Min: 1.104831 -> 1.072921: 1.03x faster
Avg: 1.146844 -> 1.117068: 1.03x faster
Significant (t=4.02)
Stddev: 0.03552 -> 0.03848: 1.0833x larger

### spectral_norm ###
Min: 1.710336 -> 1.688910: 1.01x faster
Avg: 1.872578 -> 1.738698: 1.08x faster
Significant (t=2.35)
Stddev: 0.40095 -> 0.03331: 12.0356x smaller

### tornado_http ###
Min: 0.849374 -> 0.852209: 1.00x slower
Avg: 0.955472 -> 0.916075: 1.04x faster
Significant (t=4.82)
Stddev: 0.07059 -> 0.04119: 1.7139x smaller

### unpack_sequence ###
Min: 0.000030 -> 0.000020: 1.52x faster
Avg: 0.000164 -> 0.000174: 1.06x slower
Significant (t=-13.11)
Stddev: 0.00011 -> 0.00013: 1.2256x larger

### unpickle_list ###
Min: 1.333952 -> 1.212805: 1.10x faster
Avg: 1.373228 -> 1.266677: 1.08x faster
Significant (t=16.32)
Stddev: 0.02894 -> 0.03597: 1.2428x larger
msg238437 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-18 13:31
What's the status of this issue?
msg238455 - (view) Author: Zach Byrne (zbyrne) * Date: 2015-03-18 15:53
I haven't looked at it since I posted the benchmark results for 21955_2.patch.
msg258057 - (view) Author: Zach Byrne (zbyrne) * Date: 2016-01-12 02:10
Anybody still looking at this? I can take another stab at it if it's still in scope.
msg258060 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-01-12 02:41
> Anybody still looking at this? I can take another stab at it if it's still in scope.

There were some visible speedups from your patch -- I think we should merge this optimization.  Can you figure why unpack_sequence and other benchmarks were slower?
msg258062 - (view) Author: Zach Byrne (zbyrne) * Date: 2016-01-12 03:37
> Can you figure why unpack_sequence and other benchmarks were slower?
I didn't look really closely, A few of the slower ones were floating point heavy, which would incur the slow path penalty, but I can dig into unpack_sequence.
msg259417 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-02 18:55
I'm assigning this patch to myself to commit it in 3.6 later.
msg259428 - (view) Author: Zach Byrne (zbyrne) * Date: 2016-02-02 20:37
I took another look at this, and tried applying it to 3.6 and running the latest benchmarks. It applied cleanly, and the benchmark results were similar, this time unpack_sequence and spectral_norm were slower. Spectral norm makes sense, it's doing lots of FP addition. The unpack_sequence instruction looks like it already has optimizations for unpacking lists and tuples onto the stack, and running dis on the test showed that it's completely dominated calls to unpack_sequence, load_fast, and store_fast so I still don't know what's going on there.
msg259429 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-02 20:56
Any change that increases the cache or branch predictor footprint of the evaluation loop may make the interpreter slower, even if the change doesn't seem related to a particular benchmark. That may be the reason here.
msg259431 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-02 21:06
unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = to_unpack".  This code doesn't even touch BINARY_SUBSCR or BINARY_ADD.

Zach, could you please run your benchmarks in rigorous mode (perf.py -r)?  I'd also suggest to experiment with putting the baseline cpython as a first arg and as a second -- maybe your machine runs the second interpreter slightly faster.
msg259490 - (view) Author: Zach Byrne (zbyrne) * Date: 2016-02-03 16:40
I ran 6 benchmarks on my work machine(not the same one as the last set) overnight.
Two with just the BINARY_ADD change, two with the BINARY_SUBSCR change, and two with both.
I'm attaching the output from all my benchmark runs, but here are the highlights
In this table I've flipped the results for running the modified build as the reference, but in the new attachment, slower in the right column means faster, I think :)
|------------------|---------------------------------------|-----------------------------------|
|Build             | Baseline Reference                    | Modified Reference                |
|------------------|--------------------|------------------|--------------------|--------------|
|                  | Faster             | Slower           | Faster             | Slower       |
|------------------|--------------------|------------------|--------------------|--------------|
|BINARY_ADD        | chameleon_v2       | etree_parse      | chameleon_v2       | call_simple  |
|                  | chaos              | nbody            | fannkuch           | nbody        |
|                  | django             | normal_startup   | normal_startup     | pickle_dict  |
|                  | etree_generate     | pickle_dict      | nqueens            | regex_v8     |
|                  | fannkuch           | pickle_list      | regex_compile      |              |
|                  | formatted_logging  | regex_effbot     | spectral_norm      |              |
|                  | go                 |                  | unpickle_list      |              |
|                  | json_load          |                  |                    |              |
|                  | regex_compile      |                  |                    |              |
|                  | simple_logging     |                  |                    |              |
|                  | spectral_norm      |                  |                    |              |
|------------------|--------------------|------------------|--------------------|--------------|
|BINARY_SUBSCR     | chameleon_v2       | call_simple      | 2to3               | etree_parse  |
|                  | chaos              | go               | call_method_slots  | json_dump_v2 |
|                  | etree_generate     | pickle_list      | chaos              | pickle_dict  |
|                  | fannkuch           | telco            | fannkuch           |              |
|                  | fastpickle         |                  | formatted_logging  |              |
|                  | hexiom2            |                  | go                 |              |
|                  | json_load          |                  | hexiom2            |              |
|                  | mako_v2            |                  | mako_v2            |              |
|                  | meteor_contest     |                  | meteor_contest     |              |
|                  | nbody              |                  | nbody              |              |
|                  | regex_v8           |                  | normal_startup     |              |
|                  | spectral_norm      |                  | nqueens            |              |
|                  |                    |                  | pickle_list        |              |
|                  |                    |                  | simple_logging     |              |
|                  |                    |                  | spectral_norm      |              |
|                  |                    |                  | telco              |              |
|------------------|--------------------|------------------|--------------------|--------------|
|BOTH              | chameleon_v2       | call_simple      | chameleon_v2       | fastpickle   |
|                  | chaos              | etree_parse      | choas              | pickle_dict  |
|                  | etree_generate     | pathlib          | etree_generate     | pickle_list  |
|                  | etree_process      | pickle_list      | etree_process      | telco        |
|                  | fannkuch           |                  | fannkuch           |              |
|                  | fastunpickle       |                  | float              |              |
|                  | float              |                  | formatted_logging  |              |
|                  | formatted_logging  |                  | go                 |              |
|                  | hexiom2            |                  | hexiom2            |              |
|                  | nbody              |                  | nbody              |              |
|                  | nqueens            |                  | normal_startup     |              |
|                  | regex_v8           |                  | nqueens            |              |
|                  | spectral_norm      |                  | simple_logging     |              |
|                  | unpickle_list      |                  | spectral_norm      |              |
|------------------|--------------------|------------------|--------------------|--------------|

unpack_sequence is nowhere to be seen and spectral_norm is faster now...
msg259491 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-03 17:00
Attaching a new patch -- rewritten to optimize -, *, +, -=, *= and +=.  I also removed the optimization of [] operator -- that should be done in a separate patch and in a separate issue.

Some nano-benchmarks (best of 3):

python -m timeit  "sum([x + x + 1 for x in range(100)])"
2.7: 7.71     3.5: 8.54      3.6: 7.33

python -m timeit  "sum([x - x - 1 for x in range(100)])"
2.7: 7.81     3.5: 8.59      3.6: 7.57

python -m timeit  "sum([x * x * 1 for x in range(100)])"
2.7: 9.28     3.5: 10.6      3.6: 9.44


Python 3.6 vs 3.5 (spectral_norm, rigorous run):
Min: 0.315917 -> 0.276785: 1.14x faster
Avg: 0.321006 -> 0.284909: 1.13x faster


Zach, thanks a lot for the research!  I'm glad that unpack_sequence finally proved to be irrelevant.  Could you please take a look at the updated patch?
msg259493 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-03 17:05
> python -m timeit  "sum([x * x * 1 for x in range(100)])"

If you only want to benchmark x*y, x+y and list-comprehension, you
should use a tuple for the iterator.
msg259494 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-03 17:07
> In this table I've flipped the results for running the modified build > as the reference, but in the new attachment, slower in the right
> column means faster, I think :)

I don't understand what this table means (why 4 columns?). Can you explain what you did?
msg259495 - (view) Author: Zach Byrne (zbyrne) * Date: 2016-02-03 17:15
> I don't understand what this table means (why 4 columns?). Can you explain what you did?

Yury suggested running perf.py twice with the binaries swapped
So "faster" and "slower" underneath "Baseline Reference" are runs where the unmodified python binary was the first argument to perf, and the "Modified Reference" is where the patched binary is the first argument.

ie. "perf.py -r -b all python patched_python" vs "perf.py -r -b all patched_python python"

bench_results.txt has the actual output in it, and the "slower in the right column" comment was referring to the contents of that file, not the table. Sorry for the confusion.
msg259496 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-03 17:21
> Yury suggested running perf.py twice with the binaries swapped

Yeah, I had some experience with perf.py when its results were skewed depending on what you test first.  Hopefully Victor's new patch will fix that http://bugs.python.org/issue26275
msg259497 - (view) Author: Zach Byrne (zbyrne) * Date: 2016-02-03 17:47
> Could you please take a look at the updated patch?
Looks ok to me, for whatever that's worth.
msg259499 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-03 17:52
Le 03/02/2016 18:21, Yury Selivanov a écrit :
> 
> Yury Selivanov added the comment:
> 
>> Yury suggested running perf.py twice with the binaries swapped
> 
> Yeah, I had some experience with perf.py when its results were
> skewed
depending on what you test first.

Have you tried disabling turbo on your CPU? (or any kind of power
management that would change the CPU clock depending on the current
workload)
msg259500 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-02-03 18:50
On 03.02.2016 18:05, STINNER Victor wrote:
> 
>> python -m timeit  "sum([x * x * 1 for x in range(100)])"
> 
> If you only want to benchmark x*y, x+y and list-comprehension, you
> should use a tuple for the iterator.

... and precalculate that in the setup:

python -m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in loops])"

# python -m timeit "sum([x * x * 1 for x in range(100)])"
100000 loops, best of 3: 5.74 usec per loop
# python -m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in loops])"
100000 loops, best of 3: 5.56 usec per loop

(python = Python 2.7)
msg259502 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-03 19:04
Antoine, yeah, it's probably turbo boost related.  There is no easy way to turn it off on mac os x, though.  I hope Victor's patch to perf.py will help to mitigate this. 

Victor, Marc-Andre,

Updated results of nano-bench (best of 10):

-m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in loops])"
2.7  8.5     3.5  10.1     3.6  8.91

-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])"
2.7  7.27    3.5  8.2      3.6  7.13

-m timeit -s "loops=tuple(range(100))" "sum([x - x - 1 for x in loops])"
2.7  7.01    3.5  8.1      3.6  6.95

Antoine, Serhiy, I'll upload a new patch soon.  Probably Serhiy's idea of using a switch statement will make it slightly faster.  I'll also add a fast path for integer division.
msg259503 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-03 19:19
Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside.
msg259505 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-03 19:26
> Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside.

Agree.

BTW, what do you think about using __int128 when available?  That way we can also optimize twodigit PyLongs.
msg259506 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-03 19:29
I don't think. I run benchmarks (for __int128) :-)
msg259508 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-03 19:35
> I don't think. I run benchmarks (for __int128) :-)

Never mind...  Seems that __int128 is still an experimental feature and some versions of clang even had bugs with it.
msg259509 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-03 19:43
> BTW, what do you think about using __int128 when available?  That way we can also optimize twodigit PyLongs.

__int128 is not always available and it will add too much of complexity for possible less gain. There is many ways to optimize the code and we should to choose those of them that have the best gain/complexity ratio.

Lets split the patch on smaller parts: 1) direct using long-specialized functions in ceval.c, and 2) optimize the fast path in these functions, and test them separately and combined. May be only one way will add a gain.
msg259530 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-04 06:02
Attaching a second version of the patch.  (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement).


I decided to add fast path for floats & single-digit longs and their combinations.  +, -, *, /, //, and their inplace versions are optimized now.  


I'll have a full result of macro-benchmarks run tomorrow morning, but here's a result for spectral_norm (rigorous run, best of 3):

### spectral_norm ###
Min: 0.300269 -> 0.233037: 1.29x faster
Avg: 0.301700 -> 0.234282: 1.29x faster
Significant (t=399.89)
Stddev: 0.00147 -> 0.00083: 1.7619x smaller


Some nano-benchmarks (best of 3):

-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])"
2.7  7.23    3.5  8.17      3.6  7.57

-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1.0 for x in loops])"
2.7  9.08    3.5  11.7      3.6  7.22

-m timeit -s "loops=tuple(range(100))" "sum([x/2.2 + 2 + x*2.5 + 1.0 for x in loops])"
2.7  17.9    3.5  24.3      3.6  11.8
msg259540 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-02-04 07:58
On 04.02.2016 07:02, Yury Selivanov wrote:
> Attaching a second version of the patch.  (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement).

It would be better to consistently have the fast_*() helpers
return -1 in case of an error, instead of -1 or 1.

Overall, I see two problems with doing too many of these
fast paths:

 * the ceval loop may no longer fit in to the CPU cache on
   systems with small cache sizes, since the compiler will likely
   inline all the fast_*() functions (I guess it would be possible
   to simply eliminate all fast paths using a compile time
   flag)

 * maintenance will get more difficult

In a numerics heavy application it's like that all fast paths
will trigger somewhere, but those will likely be better off
using numpy or numba. For a text heavy application such as a web
server, only few fast paths will trigger and so the various
checks only add overhead.

Since 'a'+'b' is a very often used instruction type in the
latter type of applications, please make sure that this fast
path gets more priority in your patch.

Please also check the effects of the fast paths for cases
where they don't trigger, e.g. 'a'+'b' or 'a'*2.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
msg259541 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 08:01
"In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead."

Hum, I disagree. See benchmark results in other messages. Examples:

### django_v2 ###
Min: 2.682884 -> 2.633110: 1.02x faster

### unpickle_list ###
Min: 1.333952 -> 1.212805: 1.10x faster

These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type.
msg259542 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 08:13
+        if (Py_SIZE(left) != 0) {
+            if (Py_SIZE(right) != 0) {
+
+#ifdef HAVE_LONG_LONG
+                mul = PyLong_FromLongLong(
+                        (long long)SINGLE_DIGIT_LONG_AS_LONG(left) *
+                            SINGLE_DIGIT_LONG_AS_LONG(right));
+#else
+                mul = PyNumber_Multiply(left, right);
+#endif

Why don't you use the same code than long_mul() (you need #include "longintrepr.h")?
----------------
        stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
#ifdef HAVE_LONG_LONG
        return PyLong_FromLongLong((PY_LONG_LONG)v);
#else
        /* if we don't have long long then we're almost certainly
           using 15-bit digits, so v will fit in a long.  In the
           unlikely event that we're using 30-bit digits on a platform
           without long long, a large v will just cause us to fall
           through to the general multiplication code below. */
        if (v >= LONG_MIN && v <= LONG_MAX)
            return PyLong_FromLong((long)v);
#endif
----------------

I guess that long_mul() was always well optimized, no need to experiment something new.
msg259545 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-02-04 08:56
On 04.02.2016 09:01, STINNER Victor wrote:
> 
> "In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead."
> 
> Hum, I disagree. See benchmark results in other messages. Examples:
> 
> ### django_v2 ###
> Min: 2.682884 -> 2.633110: 1.02x faster
> 
> ### unpickle_list ###
> Min: 1.333952 -> 1.212805: 1.10x faster
> 
> These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type.

Sure, some integer math is used in text applications as well,
e.g. for indexing, counting and slicing, but the patch puts more
emphasis on numeric operations, e.g. fast_add() tests for integers
and floats before then coming back to check for Unicode.

It would be interesting to know how often these paths trigger
or not in the various benchmarks.

BTW: The django_v2 benchmark result does not really say
anything much. Values of +/- 2% do not have much meaning in
benchmark results :-)
msg259549 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-04 09:35
I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.

The point of optimizing integers is that they are used for many purposes, not only "math" (e.g. indexing).
msg259552 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 09:37
> I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.

Oh wait, I maybe misunderstood Marc-Andre comment. If the question is only on float: I'm ok to drop the fast-path for float. By the way, I would prefer to see PyLong_CheckExact() in the main loop, and only call fast_mul() if both operands are Python int.
msg259554 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-04 10:30
fastint2.patch adds small regression for string multiplication:

$ ./python -m timeit -s "x = 'x'" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
Unpatched:  1.46 usec per loop
Patched:    1.54 usec per loop

Here is an alternative patch. It just uses existing specialized functions for integers: long_add, long_sub and long_mul. It doesn't add regression for above example with string multiplication, and it looks faster than fastint2.patch for integer multiplication.

$ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
Unpatched:          0.887 usec per loop
fastint2.patch:     0.841 usec per loop
fastint_alt.patch:  0.804 usec per loop
msg259560 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 12:50
I prefer fastint_alt.patch design, it's simpler. I added a comment on the review.

My numbers, best of 5 timeit runs:

$ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "

* original: 299 ns
* fastint2.patch: 282 ns (-17 ns, -6%)
* fastint_alt.patch: 267 ns (-32 ns, -11%)
msg259562 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-04 13:54
> I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.

I disagree.

30% faster floats (sic!) is a serious improvement, that shouldn't just be discarded.  Many applications have floating point calculations one way or another, but don't use numpy because it's an overkill.

Python 2 is much faster than Python 3 on any kind of numeric calculations.  This point is being frequently brought up in every python2 vs 3 debate.  I think it's unacceptable.


> * the ceval loop may no longer fit in to the CPU cache on
   systems with small cache sizes, since the compiler will likely
   inline all the fast_*() functions (I guess it would be possible
   to simply eliminate all fast paths using a compile time
   flag)

That's a speculation.  It may still fit.  Or it had never really fitted, it's already huge.  I tested the patch on a 8 year old desktop CPU, no performance degradation on our benchmarks.

### raytrace ###
Avg: 1.858527 -> 1.652754: 1.12x faster

### nbody ###
Avg: 0.310281 -> 0.285179: 1.09x faster

### float ###
Avg: 0.392169 -> 0.358989: 1.09x faster

### chaos ###
Avg: 0.355519 -> 0.326400: 1.09x faster

### spectral_norm ###
Avg: 0.377147 -> 0.303928: 1.24x faster

### telco ###
Avg: 0.012845 -> 0.013006: 1.01x slower

The last benchmark (telco) is especially interesting.  It uses decimals for calculation, that means that it uses overloaded numeric operators.  Still no significant performance degradation.

> * maintenance will get more difficult

Fast path for floats is easy to understand for every core dev that works with ceval, there is no rocket science there (if you want rocket science that is hard to maintain look at generators/yield from).  If you don't like inlining floating point calculations, we can make float_add, float_sub, float_div, and float_mul exported and use them in ceval.

Why not combine my patch and Serhiy's?  First we check if left & right are both longs.  Then we check if they are unicode (for +).  And then we have a fastpath for floats.
msg259563 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 14:01
> Why not combine my patch and Serhiy's?  First we check if left & right are both longs.  Then we check if they are unicode (for +).  And then we have a fastpath for floats.

See my comment on Serhiy's patch. Maybe we can start by check that the type of both operands are the same, and then use PyLong_CheckExact and PyUnicode_CheckExact.

Using such design, we may add a _PyFloat_Add(). But the next question is then the overhead on the "slow" path, which requires a benchmark too! For example, use a subtype of int.
msg259564 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-04 14:06
Le 04/02/2016 14:54, Yury Selivanov a écrit :
> 
> 30% faster floats (sic!) is a serious improvement, that shouldn't
> just be discarded. Many applications have floating point calculations one way
> or another, but don't use numpy because it's an overkill.

Can you give any example of such an application and how they would
actually benefit from "faster floats"? I'm curious why anyone who wants
fast FP calculations would use pure Python with CPython...

Discarding Numpy because it's "overkill" sounds misguided to me.
That's like discarding asyncio because it's "less overkill" to write
your own select() loop. It's often far more productive to use the
established, robust, optimized library rather than tweak your own
low-level code.

(and Numpy is easier to learn than asyncio ;-))

I'm not violently opposing the patch, but I think maintenance effort
devoted to such micro-optimizations is a bit wasted. And once you add
such a micro-optimization, trying to remove it often faces a barrage of
FUD about making Python slower, even if the micro-optimization is
practically worthless.

> Python 2 is much faster than Python 3 on any kind of numeric
> calculations.

Actually, it shouldn't really be faster on FP calculations, since the
float object hasn't changed (as opposed to int/long). So I'm skeptical
of FP-heavy code that would have been made slower by Python 3 (unless
there's also integer handling in that, e.g. indexing).
msg259565 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-04 14:18
>But the next question is then the overhead on the "slow" path, which requires a benchmark too! For example, use a subtype of int.

telco is such a benchmark (although it's very unstable).  It uses decimals extensively.  I've tested it many times on three different CPUs, and it doesn't seem to become any slower.


> Discarding Numpy because it's "overkill" sounds misguided to me.
That's like discarding asyncio because it's "less overkill" to write
your own select() loop. It's often far more productive to use the
established, robust, optimized library rather than tweak your own
low-level code.

Don't get me wrong, numpy is simply amazing!  But if you have a 100,000 lines application that happens to have a a few FP-related calculations here and there, you won't use numpy (unless you had experience with it before).

My opinion on this: numeric operations in Python (and any general purpose language) should be as fast as we can make them.


> Python 2 is much faster than Python 3 on any kind of numeric
> calculations.

> Actually, it shouldn't really be faster on FP calculations, since the
float object hasn't changed (as opposed to int/long). So I'm skeptical
of FP-heavy code that would have been made slower by Python 3 (unless
there's also integer handling in that, e.g. indexing).

But it is faster.  That's visible on many benchmarks.  Even simple timeit oneliners can show that.  Probably it's because that such benchmarks usually combine floats and ints, i.e. "2 * smth" instead of "2.0 * smth".
msg259567 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-04 14:24
Le 04/02/2016 15:18, Yury Selivanov a écrit :
> 
> But it is faster. That's visible on many benchmarks. Even simple
timeit oneliners can show that. Probably it's because that such
benchmarks usually combine floats and ints, i.e. "2 * smth" instead of
"2.0 * smth".

So it's not about FP-related calculations anymore. It's about ints
having become slower ;-)
msg259568 - (view) Author: Yury Selivanov (Yury.Selivanov) * Date: 2016-02-04 14:27
>> But it is faster. That's visible on many benchmarks. Even simple
> timeit oneliners can show that. Probably it's because that such
> benchmarks usually combine floats and ints, i.e. "2 * smth" instead of
> "2.0 * smth".
> 
> So it's not about FP-related calculations anymore. It's about ints
> having become slower ;-)

I should have written 2 * smth_float vs 2.0 * smth_float
msg259571 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 15:40
It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C after the first cdecimal result, 5 repetitions or so).

fastint2.patch speeds up floats enormously and slows down decimal by 6%.
fastint_alt.patch slows down float *and* decimal (5% or so).

Overall the status quo isn't that bad, but I concede that float benchmarks like that are useful for PR.
msg259573 - (view) Author: Yury Selivanov (Yury.Selivanov) * Date: 2016-02-04 15:56
> 
> Stefan Krah added the comment:
> 
> It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C after the first cdecimal result, 5 repetitions or so).
> 
> fastint2.patch speeds up floats enormously and slows down decimal by 6%.
> fastint_alt.patch slows down float *and* decimal (5% or so).
> 
> Overall the status quo isn't that bad, but I concede that float benchmarks like that are useful for PR.
> 

Thanks Stefan! I'll update my patch to include Serhiy's ideas. The fact that fastint_alt slows down floats *and* decimals is not acceptable.

I'm all for keeping cpython and ceval loop simple, but we should not pass on opportunities to improve some things in a significant way.
msg259574 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-04 16:36
It is easy to extend fastint_alt.patch to support floats too. Here is new patch.

> It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C after the first cdecimal result, 5 repetitions or so).

Note that this benchmark is not very stable. I ran it few times and the difference betweens runs was about 20%.
msg259577 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 16:42
I've never seen 20% fluctuation in that benchmark between runs. The benchmark is very stable if you take the average of 10 runs.
msg259578 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 16:44
I mean, if you run the benchmark 10 times and the unpatched result is always between 11.3 and 12.0 for floats while the patched result is
between 12.3 and 12.9, for me the situation is clear.
msg259601 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-04 22:55
People should stop getting hung up about benchmarks numbers and instead should first think about what they are trying to *achieve*. FP performance in pure Python does not seem like an important goal in itself. Also, some benchmarks may show variations which are randomly correlated with a patch (e.g. before of different code placement by the compiler interfering with instruction cache wayness). It is important not to block a patch because some random benchmark on some random machine shows an unexpected slowdown.

That said, both of Serhiy's patches are probably ok IMO.
msg259605 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 00:09
> People should stop getting hung up about benchmarks numbers and instead should first think about what they are trying to *achieve*. FP performance in pure Python does not seem like an important goal in itself.

I'm not sure how to respond to that.  Every performance aspect *is* important.  numpy isn't shipped with CPython, not everyone uses it.  In one of my programs I used colorsys extensively -- did I need to rewrite it using numpy?  Probably I could, but that was a simple shell script without any dependencies.

It also harms Python 3 adoption a little bit, since many benchmarks are still slower.  Some of them are FP related.

In any case, I think that if we can optimize something - we should.


> Also, some benchmarks may show variations which are randomly correlated with a patch (e.g. before of different code placement by the compiler interfering with instruction cache wayness). 

30-50% speed improvement is not a variation.  It's just that a lot less code gets executed if we inline some operations.


> It is important not to block a patch because some random benchmark on some random machine shows an unexpected slowdown.

Nothing is blocked atm, we're just discussing various approaches.


> That said, both of Serhiy's patches are probably ok IMO.

Current Serhiy's patches are incomplete.  In any case, I've been doing some research and will post another message shortly.
msg259607 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-05 01:05
Hi Yury,

> I'm not sure how to respond to that. Every performance aspect *is*
> important.

Performance is not a religion (not any more than security or any other
matter).  It is not helpful to brandish results on benchmarks which have
no relevance to real-world applications.

It helps to define what we should achieve and why we want to achieve it.
 Once you start asking "why", the prospect of speeding up FP
computations in the eval loop starts becoming dubious.

> numpy isn't shipped with CPython, not everyone uses it.

That's not the point. *People doing FP-heavy computations* should use
Numpy or any of the packages that can make FP-heavy computations faster
(Numba, Cython, Pythran, etc.).

You should use the right tool for the job.  There is no need to
micro-optimize a hammer for driving screws when you could use a
screwdriver instead.  Lists or tuples of Python float objects are an
awful representation for what should be vectorized native data.  They
eat more memory in addition to being massively slower (they will also be
slower to serialize from/to disk, etc.).

"Not using" Numpy when you would benefit from it is silly.
Numpy is not only massively faster on array-wide tasks, it also makes it
easier to write high-level, readable, reusable code instead of writing
loops and iterating by hand.  Because it has been designed explicitly
for such use cases (which the Python core was not, despite the existence
of the colorsys module ;-)).  It also gives you access to a large
ecosystem of third-party modules implementing various domain-specific
operations, actively maintained by experts in the field.

Really, the mindset of "people shouldn't need to use Numpy, they can do
FP computations in the interpreter loop" is counter-productive.  I
understand that it's seductive to think that Python core should stand on
its own, but it's also a dangerous fallacy.

You *should* advocate people use Numpy for FP computations.  It's an
excellent library, and it's currently a major selling point for Python.
Anyone doing FP-heavy computations with Python should learn to use
Numpy, even if they only use it from time to time.  Downplaying its
importance, and pretending core Python is sufficient, is not helpful.

> It also harms Python 3 adoption a little bit, since many benchmarks
> are still slower. Some of them are FP related.

The Python 3 migration is happening already. There is no need to worry
about it... Even the diehard 3.x haters have stopped talking of
releasing a 2.8 ;-)

> In any case, I think that if we can optimize something - we should.

That's not true. Some optimizations add maintenance overhead for no real
benefit. Some may even hinder performance as they add conditional
branches in a critical path (increasing the load on the CPU's branch
predictors and making them potentially less efficient).

Some optimizations are obviously good, like the method call optimization
which caters to real-world use cases (and, by the way, kudos for that...
you are doing much better than all previous attempts ;-)). But some are
solutions waiting for a problem to solve.
msg259612 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 01:37
tl;dr   I'm attaching a new patch - fastint4 -- the fastest of them all. It incorporates Serhiy's suggestion to export long/float functions and use them.  I think it's reasonable complete -- please review it, and let's get it committed.

== Benchmarks ==

spectral_norm (fastint_alt)    -> 1.07x faster
spectral_norm (fastintfloat)   -> 1.08x faster
spectral_norm (fastint3.patch) -> 1.29x faster
spectral_norm (fastint4.patch) -> 1.16x faster

spectral_norm (fastint**.patch)-> 1.31x faster
nbody (fastint**.patch)        -> 1.16x faster

Where:
- fastint3 - is my previous patch that nobody likes (it inlined a lot of logic from longobject/floatobject)

- fastint4 - is the patch I'm attaching and ideally want to commit

- fastint** - is a modification of fastint4.  This is very interesting -- I started to profile different approaches, and found two bottlenecks, that really made Serhiy's and my other patches slower than fastint3.  What I found is that PyLong_AsDouble can be significantly optimized, and PyLong_FloorDiv is super inefficient.

PyLong_AsDouble can be sped up several times if we add a fastpath for 1-digit longs:

    // longobject.c: PyLong_AsDouble
    if (PyLong_CheckExact(v) && Py_ABS(Py_SIZE(v)) <= 1) {
        /* fast path; single digit will always fit decimal */
        return (double)MEDIUM_VALUE((PyLongObject *)v);
    }


PyLong_FloorDiv (fastint4 adds it) can be specialized for single digits, which gives it a tremendous boost.

With those too optimizations, fastint4 becomes as fast as fastint3.  I'll create separate issues for PyLong_AsDouble and FloorDiv.

== Micro-benchmarks ==

Floats + ints:  -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"

2.7:          0.42 (usec)
3.5:          0.619
fastint_alt   0.619
fastintfloat: 0.52
fastint3:     0.289
fastint4:     0.51
fastint**:    0.314

===

Ints:  -m timeit -s "x=2" "x + 10 + x * 20 - x // 3 + x* 10 + 20 -x"

2.7:          0.151 (usec)
3.5:          0.19
fastint_alt:  0.136
fastintfloat: 0.135
fastint3:     0.135
fastint4:     0.122
fastint**:    0.122


P.S. I have another variant of fastint4 that uses fast_* functions in ceval loop, instead of a big macro.  Its performance is slightly worse than with the macro.
msg259614 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 01:48
Antoine, FWIW I agree on most of your points :)  And yes, numpy, scipy, numba, etc rock.

Please take a look at my fastint4.patch.  All tests pass, no performance regressions, no crazy inlining of floating point exceptions etc.  And yet we have a nice improvement for both ints and floats.
msg259626 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 04:04
Attaching another approach -- fastint5.patch.

Similar to what fastint4.patch does, but doesn't export any new APIs.  Instead, similarly to abstract.c, it uses type slots directly.
msg259663 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 15:10
Unless there are any objections, I'll commit fastint5.patch in a day or two.
msg259664 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-05 15:14
> Unless there are any objections, I'll commit fastint5.patch in a day or two.

Please don't. I would like to have time to benchmark all these patches (there are now 9 patches attached to the issue :-)) and I would like to hear Serhiy's feedback on your latest patches.
msg259666 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-02-05 15:26
On 05.02.2016 16:14, STINNER Victor wrote:
> 
> Please don't. I would like to have time to benchmark all these patches (there are now 9 patches attached to the issue :-)) and I would like to hear Serhiy's feedback on your latest patches.

Regardless of the performance, the fastint5.patch looks like the
least invasive approach to me. It also doesn't incur as much
maintenance overhead as the others do.

I'd only rename the macro MAYBE_DISPATCH_FAST_NUM_OP to
TRY_FAST_NUMOP_DISPATCH :-)

BTW: I do wonder why this approach is as fast as the others. Have
compilers grown smart enough to realize that the number slot
functions will not change and can thus be inlined ?
msg259667 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 15:32
>> Unless there are any objections, I'll commit fastint5.patch in a day or two.

> Please don't. I would like to have time to benchmark all these patches (there are now 9 patches attached to the issue :-)) and I would like to hear Serhiy's feedback on your latest patches.

Sure, I'd very appreciate a review of fastint5.

I can save you sometime on benchmarking -- it's really about fastint_alt vs fastint5.  The latter optimizes ALL ops on longs AND floats.  The former only optimizes some ops on longs.  So please be sure you're comparing oranges to oranges ;)
msg259668 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 15:43
> Regardless of the performance, the fastint5.patch looks like the
least invasive approach to me. It also doesn't incur as much
maintenance overhead as the others do.

Thanks.  It's a result of an enlightenment that can only come
after running benchmarks all day :)

> I'd only rename the macro MAYBE_DISPATCH_FAST_NUM_OP to
TRY_FAST_NUMOP_DISPATCH :-)

Yeah, your name is better.

> BTW: I do wonder why this approach is as fast as the others. Have
compilers grown smart enough to realize that the number slot
functions will not change and can thus be inlined ?

Looks like so, I'm very impressed myself.  I'd expect fastint3 (which just inlines a lot of logic directly in ceval.c) to be the fastest one.  But it seems that compiler does an excellent job here.

Victor, BTW, if you want to test fastint3 vs fastint5, don't forget to apply the patch from issue #26288 over fastint5 (fixes slow performance of PyLong_AsDouble)
msg259669 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-05 15:58
bench_long2.py: my updated microbenchmark to test many types and more operations.

compare.txt: compare Python original, fastint_alt.patch, fastintfloat_alt.patch and fastint5.patch. "(*)" marks the minimum of the line, percents are relative to the minimum (if larger than +/-5%).

compare_to.txt: similar to compare.txt, but percents are relative to the original Python.
msg259670 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-05 16:15
My analysis of benchmarks.

Even using CPU isolation to run benchmarks, the results look unreliable for very short benchmarks like 3 ** 2.0: I don't think that fastint_alt can make the operation 16% slower since it doesn't touch this code, no?

Well... as expected, speedup is quite *small*: the largest difference is on "3 * 2" ran 100 times: 18% faster with fastint_alt. We are talking about 1.82 us => 1.49 us: delta of 330 ns. I expect a much larger difference is you compile a function to machine code using Cython or a JIT like Numba and PyPy. Remember that we are running *micro*-benchmarks, so we should not push overkill optimizations except if the speedup is really impressive.

It's quite obvious from the tables than fastint_alt.patch only optimize int (float is not optimized). If we choose to optimize float too, fastintfloat_alt.patch and fastint5.patch look to have the *same* speed.

I don't see any overhead on Decimal + Decimal with any patch: good.

--

Between fastintfloat_alt.patch and fastint5.patch, I prefer fastintfloat_alt.patch which is much easier to read, so probably much easier to debug. I hate huge macro when I have to debug code in gdb :-( I also like very much the idea of *reusing* existing functions, rather than duplicating code.

Even if Antoine doesn't seem interested by optimizations on float, I think that it's ok to add a few lines for this type, fastintfloat_alt.patch is not so complex. What do *you* think?

Why not optimizing a**b? It's a common operation, especially 2**k, no?
msg259671 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 16:18
> Between fastintfloat_alt.patch and fastint5.patch, I prefer fastintfloat_alt.patch which is much easier to read, so probably much easier to debug. I hate huge macro when I have to debug code in gdb :-( I also like very much the idea of *reusing* existing functions, rather than duplicating code.

I disagree.

fastintfloat_alt exports a lot of functions from longobject/floatobject, something that I really don't like.  Lots of repetitive code in ceval.c also make it harder to make sure everything is correct.
msg259672 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 16:22
Anyways, if it's about macro vs non-macro, I can inline the macro by hand (which I think is an inferior approach here).  But I'd like the final code to use my approach of using slots directly, instead of modifying longobject/floatobject to export lots of *internal* stuff.
msg259673 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 16:32
As to weather we want this patch committed or not, here's a mini-macro-something benchmark:


$ ./python.exe -m timeit -s "x=2" "x + 10 + x * 20  + x* 10 + 20 -x"
10000000 loops, best of 3: 0.115 usec per loop

$ python3.5 -m timeit -s "x=2" "x + 10 + x * 20  + x* 10 + 20 -x"
10000000 loops, best of 3: 0.141 usec per loop


$ ./python.exe -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"
1000000 loops, best of 3: 0.308 usec per loop

$ python3.5 -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"
1000000 loops, best of 3: 0.652 usec per loop


Still, longs are faster 30-50%, FP are faster 100%.  I think it's a very good result.  Please don't block this patch.
msg259675 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-05 17:01
My patches were just samples. I'm glad that Yury incorporated the main idea and that this helps. If apply any patch I would prefer fastint5.patch. But I don't quite understand why it adds any gain. Is this just due to overhead of calling PyNumber_Add? Then we should test with other compilers and with the LTO option. fastint5.patch adds an overhead for type checks and increases the size of ceval loop. What is outweigh this overhead?

As for tests, it would be more honest to test data that results out of small ints range (-5..256).
msg259678 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 17:16
Thanks, Serhiy,

> But I don't quite understand why it adds any gain. 

Perhaps, and this is just a guess - the fast path does only couple of eq tests & one call for the actual op.  If it's long+long then long_add will be called directly.

PyNumber_Add has more overhead on:
- at least one extra call
- a few extra checks to guard against NotImplemented
- abstract.c/binary_op1 has a few more checks/slot lookups

So it look that there's just much less instructions to be executed.  If this guess is correct, then an LTO build without fast paths will still be somewhat slower.

> Is this just due to overhead of calling PyNumber_Add? Then we should test with other compilers and with the LTO option.

I actually tried to compile CPython with LTO -- but couldn't.  Almost all of C extension modules failed to link.  Do we compile official binaries with LTO?
msg259695 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-05 22:37
Serhiy Storchaka: "My patches were just samples. I'm glad that Yury incorporated the main idea and that this helps."

Oh, if even Serhiy prefers Yury's patches, I should read them again :-)

--

I read fastint5.patch one more time and I finally understood the following macros:

+#define NB_SLOT(slot) offsetof(PyNumberMethods, slot)
+#define NB_BINOP(nb_methods, slot) \
+    (*(binaryfunc*)(& ((char*)nb_methods)[NB_SLOT(slot)]))
+#define PY_LONG_CALL_BINOP(slot, left, right) \
+    (NB_BINOP(PyLong_Type.tp_as_number, slot))(left, right)
+#define PY_FLOAT_CALL_BINOP(slot, left, right) \
+    (NB_BINOP(PyFloat_Type.tp_as_number, slot))(left, right)

In short, a+b calls long_add(a, b) with that. At the first read, I understood that it casted objects to C long or C double (don't ask me why).


I see a difference between fastint5.patch and fastintfloat_alt.patch: fastint5.patch resolves the address of long_add() at runtime, whereas fastintfloat_alt.patch gets a direct pointer to _PyLong_Add() at the compilation. I expected a sublte speedup, but I'm unable to see it on benchmarks (again, both patches have the same speed).

The float path is simpler in fastint5.patch because it uses the same code if right is float or long, but it adds more checks for the slow-path. No patch looks to have a real impact on the slow-path. Is it worth to change the second if to PyFloat_CheckExact() and then check type of right in the if body to avoid other checks on the slow-path?

(C checks look very cheap, so I think that I already replied to my own question :-))

--

fastint5.patch optimizes a+b, a-b, a*b, a/b and a//b. Why not other operators? List of operators from my constant folding optimzation in fatoptimizer:

* int, float: a+b, a-b, a*b, a/b, +x, -x, ~x, a//b, a%b, a**b
* int only: a<<b, a>>b, a&b, a|b, a^b

If we optimize a//b, I suggest to also optimize a%b to be consistent. For integers, a**b, a<<b and a>>b would make sense too. Coming from the C language, I would prefer a<<b and a>>b than a*2**k or a//2**k, since I expect better performance.

For float, -x and +x may be common, but less a+b, a-b, a*b, a/b.

Well, what I'm trying to say: if choose fastintfloat_alt.patch design, we will have to expose like a lot of new C functions in headers, and duplicate a lot of code.

To support more than 4 operators, we need a macro.

If we use a macro, it's cheap (in term of code maintenance) to use it for most or even all operators.

--

> But I don't quite understand why it adds any gain. Is this just due to overhead of calling PyNumber_Add?

Hum, that's a good question.


> Then we should test with other compilers and with the LTO option.

There are projects (I don't recall the number number) but I would prefer to handle LTO separatly. Python supports platforms and compilers which don't implement LTO.


> fastint5.patch adds an overhead for type checks and increases the size of ceval loop. What is outweigh this overhead?

I stopped to guess the speedup just by reading the code or a patch. I only trust benchmarks :-)

Advice: don't trust yourself! only trust benchmarks.
msg259702 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-06 00:10
Attached is the new version of fastint5 patch.  I fixed most of the review comments.  I also optimized %, << and >> operators.  I didn't optimize other operators because they are less common.  I guess we have to draw a line somwhere...

Victor, thanks a lot for your suggestion to drop NB_SLOT etc macros!  Without them the code is even simpler.
msg259706 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-06 01:31
inline-2.patch: more complete version of inline.patch.

Optimize the same instructions than Python 2: BINARY_ADD, INPLACE_ADD, BINARY_SUBSTRACT, INPLACE_SUBSTRACT.


Quick & *dirty* microbenchmark:

$ ./python -m timeit -s 'x=1' 'x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x'

* Original: 287 ns
* fastint5_2.patch: 261 ns (-9%)
* inline-2.patch: 212 ns (-26%)


$ ./python -m timeit -s 'x=1000; y=1' 'x-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y'

* Original: 517 ns
* fastint5_2.patch: 469 ns (-9%)
* inline-2.patch: 442 ns (-15%)


Ok. Now I'm lost. We have so many patches :-) Which one do you prefer?

In term of speedup, I expect that Python 2 design (inline-2.patch) cannot be beaten in term of performance by another another option since it doesn't need any C code and does everything in ceval.c.
msg259707 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-06 01:36
> Ok. Now I'm lost. We have so many patches :-) Which one do you prefer?

To no-ones surprise I prefer fastint5, because it optimizes almost all binary operators on both ints and floats.

inline-2.patch only optimizes just + and - for just ints.  If + and - performance of inline-2 is really important, I suggest to merge it in fastint5, but i'd keep it simple ;)
msg259712 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-06 01:52
msg222985: Raymond Hettinger
"There also used to be a fast path for binary subscriptions with integer indexes.  I would like to see that performance regression fixed if it can be done cleanly."

The issue #26280 was opened to track this optimization.
msg259713 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-06 02:12
msg223186, Serhiy Storchaka about inline.patch: "Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255)."

The optimization applies to Python int with 0 or 1 digit so in range [-2^30+1; 2^30-1].

Small integers in [-5; 255] might be faster but for PyLong_FromLong().
msg259714 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-06 02:22
myself> Ok. Now I'm lost. We have so many patches :-) Which one do you prefer?

I read again fully this *old* issue, well, *almost* all messages.

Well, it's clear that no consensus was found yet :-) I see two main trends: optimize most cases (optimize most operators for int and float,  ex: fastint5_4.patch) versus optimize very few cases to limit changes and to limit effects on ceval.c (ex: inline-2.patch).

Marc-Andre and Antoine asked to not stick to micro-optimizations but think wider: run macro benchmarks, like perf.py, and suggest to use PyPy, Numba, Cython & cie for users who use best performances on numeric functions.

They also warned about subtle side-effects of any kind of change on ceval.c which may be counter-productive. It was shown in the long list of patches that some of them introduced performance *regressions*.

I don't expect that CPython can beat any compiler emiting machine code. CPython will always have to pay the price of boxing/unboxing and its loop evaluating bytecode. We can do *better*, the question is "how far?".

I think that we gone far enough on investigation *all* different options to optimize 1+2 ;-) Each option was micro-benchmarked very carefully.

Now I suggest to focus on *macro* benchmarks to help use to take a decision. I will run perf.py on fastint5_4.patch and inline-2.patch.
msg259722 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-06 08:37
> I see two main trends: optimize most cases (optimize most operators for int and float,  ex: fastint5_4.patch) versus optimize very few cases to limit changes and to limit effects on ceval.c (ex: inline-2.patch).

I agree that may be optimizing very few cases is better. We need to collect the statistics of using different operations with different types in long run of tests or benchmarks. If say division is used 100 times less than addition, we shouldn't complicate ceval loop to optimize it.
msg259729 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-06 15:47
Benchmark on inline-2.patch. No speedup, only slowdown.

I'm now running benchmark on fastint5_4.patch.

$ python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python.inline-2

Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64
Total CPU cores: 8

### json_load ###
Min: 0.707290 -> 0.723411: 1.02x slower
Avg: 0.707845 -> 0.724238: 1.02x slower
Significant (t=-297.25)
Stddev: 0.00026 -> 0.00049: 1.8696x larger

### regex_v8 ###
Min: 0.066663 -> 0.070435: 1.06x slower
Avg: 0.066947 -> 0.071378: 1.07x slower
Significant (t=-17.98)
Stddev: 0.00172 -> 0.00177: 1.0313x larger

The following not significant results are hidden, use -v to show them:
2to3, chameleon_v2, django_v3, fastpickle, fastunpickle, json_dump_v2, nbody, tornado_http.

real    58m32.662s
user    57m43.058s
sys     0m47.428s
msg259730 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-06 15:49
Benchmark on fastint5_4.patch.

python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python_fastint5_4

Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64
Total CPU cores: 8

### django_v3 ###
Min: 0.563959 -> 0.578181: 1.03x slower
Avg: 0.565383 -> 0.579137: 1.02x slower
Significant (t=-152.48)
Stddev: 0.00075 -> 0.00050: 1.4900x smaller

### fastunpickle ###
Min: 0.551076 -> 0.563469: 1.02x slower
Avg: 0.555481 -> 0.567028: 1.02x slower
Significant (t=-27.05)
Stddev: 0.00278 -> 0.00324: 1.1687x larger

### json_dump_v2 ###
Min: 2.737429 -> 2.662615: 1.03x faster
Avg: 2.754239 -> 2.685404: 1.03x faster
Significant (t=55.63)
Stddev: 0.00610 -> 0.01077: 1.7662x larger

### nbody ###
Min: 0.228548 -> 0.212292: 1.08x faster
Avg: 0.230082 -> 0.213574: 1.08x faster
Significant (t=73.74)
Stddev: 0.00175 -> 0.00139: 1.2567x smaller

### regex_v8 ###
Min: 0.041323 -> 0.048099: 1.16x slower
Avg: 0.041624 -> 0.049318: 1.18x slower
Significant (t=-45.38)
Stddev: 0.00123 -> 0.00116: 1.0613x smaller

The following not significant results are hidden, use -v to show them:
2to3, chameleon_v2, fastpickle, json_load, tornado_http.
msg259733 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-06 17:00
> ### regex_v8 ###
> Min: 0.041323 -> 0.048099: 1.16x slower
> Avg: 0.041624 -> 0.049318: 1.18x slower

I think this is a random fluctuation, that benchmark (and re lib) doesn't use the operators too much.  It can't be THAT slower just because of optimizing a few binops.
msg259734 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-06 17:02
You're also running a very small subset of all benchmarks available. Please try the '-b all' option.  I'll also run benchmarks on my machines.
msg259743 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-06 17:38
Alright, I ran a few benchmarks myself.  In rigorous mode regex_v8 has the same performance on my 2013 Macbook Pro and an 8-years old i7 CPU (Linux).

Here're results of "perf.py -b raytrace,spectral_norm,meteor_contest,nbody ../cpython/python.exe ../cpython-git/python.exe -r"


fastint5:

### nbody ###
Min: 0.227683 -> 0.197046: 1.16x faster
Avg: 0.229366 -> 0.198889: 1.15x faster
Significant (t=137.31)
Stddev: 0.00170 -> 0.00142: 1.1977x smaller

### spectral_norm ###
Min: 0.296840 -> 0.262279: 1.13x faster
Avg: 0.299616 -> 0.265387: 1.13x faster
Significant (t=74.52)
Stddev: 0.00331 -> 0.00319: 1.0382x smaller

The following not significant results are hidden, use -v to show them:
meteor_contest, raytrace.


======


inline-2:


### raytrace ###
Min: 1.188825 -> 1.213788: 1.02x slower
Avg: 1.199827 -> 1.227276: 1.02x slower
Significant (t=-18.12)
Stddev: 0.00559 -> 0.01408: 2.5184x larger

### spectral_norm ###
Min: 0.296535 -> 0.277025: 1.07x faster
Avg: 0.299044 -> 0.278071: 1.08x faster
Significant (t=87.40)
Stddev: 0.00220 -> 0.00097: 2.2684x smaller

The following not significant results are hidden, use -v to show them:
meteor_contest, nbody.
msg259790 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-07 15:01
From what I can see there is no negative impact of the patch on stable macro benchmarks.

There is quite a detectable positive impact on most of integer and float operations from my patch.  13-16% on nbody and spectral_norm benchmarks is still impressive.  And you can see a huge improvement in various timeit micro-benchmarks.

fastint5 is a very compact patch, that only touches the ceval.c file.  It doesn't complicate the code, as the macro is very straightforward.  Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it.
msg259791 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-07 15:08
Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics.
msg259792 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-07 15:16
> Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics.

But what you will use to gather statistics data?  Test suite isn't representative, and we already know what will benchmarks suite show.  I can assist with writing some code for stats, but what's the plan?
msg259793 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-07 16:18
#26288 brought a great speedup for floats. With fastint5_4.patch *on top of #26288* I see no improvement for floats and a big slowdown for _decimal.
msg259800 - (view) Author: Case Van Horsen (casevh) Date: 2016-02-07 19:30
Can I suggest the mpmath test suite as a good benchmark? I've used it to test the various optimizations in gmpy2 and it has always been a valuable real-world benchmark. And it is slower in Python 3 than Python 2....
msg259801 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-07 19:42
Be careful with test suites: first, they might exercise code that would never be a critical point for performance in a real-world application; second and most important, unittest seems to have gotten slower between 2.x and 3.x, so you would really be comparing apples to oranges.
msg259804 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-07 21:32
Attaching another patch - fastint6.patch that only optimizes longs (no FP fast path).

> #26288 brought a great speedup for floats. With fastint5_4.patch *on top of #26288* I see no improvement for floats and a big slowdown for _decimal.

What benchmark did you use?  What were the numbers?  I'm asking because before you benchmarked different patches that are conceptually similar to fastint5, and the result was that decimal was 5% faster with fast paths for just longs, and 6% slower with fast paths for longs & floats.

Also, some quick timeit results (quite stable from run to run):


-m timeit -s "x=2" "x + 10 + x * 20  + x* 10 + 20 -x"
3.6: 0.150usec           3.6+fastint: 0.112usec


-m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"
3.6: 0.425usec           3.6+fastint: 0.302usec
msg259832 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-08 09:46
Yury Selivanov:
> Alright, I ran a few benchmarks myself. (...)
> From what I can see there is no negative impact of the patch on stable macro benchmarks.

I'm disappointed by the results. In short, these patches have *no* impact on macro benchmarks, other than the two which are stressing the int and float types. Maybe we are just loosing our time on this issue...

I understand that the patches are only useful to get xx% speedup (where xx% is smaller than 25%) if your whole application is blocked by numeric computations. If it's the case, I would suggest to move to PyPy, Numba, Cython, etc. I expect something more interesting than xx% faster, but a much more impressive speedup.

http://speed.pypy.org/ : PyPy/CPython 2.7 for spectral_norm is 0.04: 25x faster. For nbody (nbody_modified), it's 0.09: 11x faster.

With patches of this issue, the *best* speedup is only 1.16x faster... We are *very* far from 11x or 25x faster. It's not even 2x faster...


Yury Selivanov:
> fastint5 is a very compact patch, that only touches the ceval.c file.  It doesn't complicate the code, as the macro is very straightforward.  Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it.

According to my micro-benchmark msg259706, inline-2.patch is faster than fastint5_4.patch. I would suggest to "finish" the inline-2.patch to optimize other operations, and *maybe* add fast-path for float.

On macrobenchmark, inline-2.patch is slower than fastint5_4.patch, but it was easy to expect since I only added fast-path for int-int and only for a few operators.

The question is it is worth to get xx% speedup on one or two specific benchmarks where CPython really sucks compared to other languages and other implementations of Python...


Stefan Krah:
> With fastint5_4.patch *on top of #26288* I see no improvement for floats and a big slowdown for _decimal.

How do you run your benchmark?


Case Van Horsen:
> Can I suggest the mpmath test suite as a good benchmark?

Where can we find this benchmark?


Case Van Horsen:
> it has always been a valuable real-world benchmark

What do you mean by "real-world benchmark"? :-)
msg259859 - (view) Author: Case Van Horsen (casevh) Date: 2016-02-08 16:30
mpmath is a library for arbitrary-precision floating-point arithmetic. It uses Python's native long type or gmpy2's mpz type for computations. It is available at https://pypi.python.org/pypi/mpmath.

The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation:

~/src/mpmath-0.19/mpmath/tests$ time py36 runtests.py -local

For example, I've tried to optimize gmpy2's handling of binary operations between its mpz type and short Python integers. I've found it to provide useful results: improvements that are significant on a micro-benchmark (say 20%) will usually cause a small but repeatable improvement. And some improvements that looked good on a micro-benchmark would slow down mpmath.

I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%.

I will run more tests tonight.
msg259860 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-08 16:40
> I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%.

> I will run more tests tonight.

Please try to test fastint5 too (fast paths for long & floats, whereas fastint6 is only focused on longs).
msg259918 - (view) Author: Case Van Horsen (casevh) Date: 2016-02-09 08:25
I ran the mpmath test suite with the fastint6 and fastint5_4 patches.

fastint6 results

without gmpy: 0.25% faster
with gmpy: 3% slower

fastint5_4 results

without gmpy: 1.5% slower
with gmpy: 5.5% slower
msg259919 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-09 09:15
Case Van Horsen added the comment:
> I ran the mpmath test suite with the fastint6 and fastint5_4 patches.
>
> fastint6 results
> without gmpy: 0.25% faster
> with gmpy: 3% slower
>
> fastint5_4 results
> without gmpy: 1.5% slower
> with gmpy: 5.5% slower

I'm more and more disappointed by this issue... If even a test
stressing int & float is *slower* (or less than 1% faster) with a
patch supposed to optimized them, what's the point? I'm also concerned
by the slow-down for other types (gmpy types).

Maybe we should just close the issue?
msg259948 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-09 18:24
> Maybe we should just close the issue?

I'll take a closer look at gmpy later. Please don't close.
msg259999 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-10 10:02
> The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation:

I extracted the slowest test (test_polyroots_legendre) and put it in a loop of 5 iterations: see attached mpmath_bench.py. I ran this benchmark on Linux with 4 isolated CPU (/sys/devices/system/cpu/isolated=2-3,6-7).
http://haypo-notes.readthedocs.org/misc.html#reliable-micro-benchmarks

On such setup, the benchmark looks stable. Example:

Run #1/5: 12.28 sec
Run #2/5: 12.27 sec
Run #3/5: 12.29 sec
Run #4/5: 12.28 sec
Run #5/5: 12.30 sec

test_polyroots_legendre (min of 5 runs):

* Original: 12.51 sec
* fastint5_4.patch: (min of 5 runs): 12.27 sec (-1.9%)
* fastint6.patch: 12.21 sec (-2.4%)

I ran tests without GMP, to stress the Python int type.

I guess that the benchmark is dominated by CPU time spent on computing operations on large Python int, not by the time spent in ceval.c. So the speedup is low (2%). Such use case doesn't seem to benefit of micro optimization discussed in this issue.

mpmath is an arbitrary-precision floating-point arithmetic using Python int (or GMP if available).
msg264018 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 14:05
Maybe we should adopt a difference approach.

There is something called "inline caching": put the cache between instructions, in the same memory block. Example of paper on CPython:

"Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler (2009)
https://www.sba-research.org/wp-content/uploads/publications/sac10.pdf

Maybe we can build something on top of the issue #26219 "implement per-opcode cache in ceval"?
msg264019 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-04-22 14:24
#14757 has an implementation of inline caching, which at least seemed to slow down some use cases. Then again, whenever someone posts a new speedup suggestion, it seems to slow down things I'm working on. At least Case van Horsen independently verified the phenomenon in this issue. :)
msg279021 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-20 09:25
Between inline2.patch and fastint6.patch, it seems like inline2.patch is faster (between 9% and 12% faster than fastint6.patch).

Microbenchmark on Python default (rev 554fb699af8c), compilation using LTO (./configure --with-lto), GCC 6.2.1 on Fedora 24, Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, perf 0.8.3 (dev version, just after 0.8.2).

Commands:

./python -m perf timeit --name='x+y' -s 'x=1; y=2' 'x+y' --dup 1000 -v -o timeit-$branch.json
./python -m perf timeit --name=sum -s "R=range(100)" "[x + x + 1 for x in R]" --dup 1000 -v --append timeit-$branch.json

Results:

$ python3 -m perf compare_to timeit-master.json timeit-inline2.json
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.14x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.29x faster

$ python3 -m perf compare_to timeit-master.json timeit-fastint6.json 
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-fastint6] 6.09 us +- 0.11 us: 1.02x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-fastint6] 12.7 ns +- 0.2 ns: 1.18x faster

$ python3 -m perf compare_to timeit-fastint6.json  timeit-inline2.json
sum: Median +- std dev: [timeit-fastint6] 6.09 us +- 0.11 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.12x faster
x+y: Median +- std dev: [timeit-fastint6] 12.7 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.09x faster
msg279022 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-20 09:29
Result of performance 0.3.3 (and perf 0.8.3).

No major benchmark is faster. A few benchmarks seem to be event slower using fastint6.patch (but I don't really trust pybench).


== fastint6.patch ==

$ python3 -m perf compare_to master.json fastint6.json --group-by-speed --min-speed=5
Slower (3):
- pybench.ConcatUnicode: 52.7 ns +- 0.0 ns -> 56.1 ns +- 0.4 ns: 1.06x slower
- pybench.ConcatStrings: 52.7 ns +- 0.3 ns -> 56.1 ns +- 0.1 ns: 1.06x slower
- pybench.CompareInternedStrings: 16.5 ns +- 0.0 ns -> 17.4 ns +- 0.0 ns: 1.05x slower

Faster (4):
- pybench.SimpleIntFloatArithmetic: 441 ns +- 2 ns -> 400 ns +- 6 ns: 1.10x faster
- pybench.SimpleIntegerArithmetic: 441 ns +- 2 ns -> 401 ns +- 5 ns: 1.10x faster
- pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 608 ns +- 6 ns: 1.06x faster
- genshi_text: 79.6 ms +- 0.5 ms -> 75.5 ms +- 0.8 ms: 1.05x faster

Benchmark hidden because not significant (114): 2to3, call_method, (...)


== inline2.patch ==

haypo@selma$ python3 -m perf compare_to master.json inline2.json --group-by-speed --min-speed=5
Faster (2):
- spectral_norm: 223 ms +- 1 ms -> 209 ms +- 1 ms: 1.07x faster
- pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 606 ns +- 7 ns: 1.06x faster

Benchmark hidden because not significant (119): 2to3, call_method, (...)
msg279023 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-20 09:31
fastint6_inline2_json.tar.gz: archive of JSON files

- fastint6.json
- inline2.json
- master.json
- timeit-fastint6.json
- timeit-inline2.json
- timeit-master.json
msg279026 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-20 10:11
The fatest patch (inline2.patch) has a negligible impact on benchmarks. The purpose of an optimization is to make Python faster, it's not the case here, so I close the issue.

Using timeit, the largest speedup is 1.29x faster. Using performance, spectral_norm is 1.07x faster and pybench.SimpleLongArithmetic is 1.06x faster. I consider that spectral_norm and pybench.SimpleLongArithmetic are microbenchmarks and so not representative of a real application.

The issue was fun, thank you for playing with me the game of micro-optimization ;-) Let's move to more interesting optimizations having a larger impact on more realistic workloads, like cache global variables, optimizing method calls, fastcalls, etc.
msg279027 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-10-20 10:19
New changeset 61fcb12a9873 by Victor Stinner in branch 'default':
Issue #21955: Please don't try to optimize int+int
https://hg.python.org/cpython/rev/61fcb12a9873
History
Date User Action Args
2016-10-20 10:19:59python-devsetnosy: + python-dev
messages: + msg279027
2016-10-20 10:12:46vstinnersetresolution: fixed -> rejected
2016-10-20 10:11:39vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg279026
2016-10-20 09:31:18vstinnersetfiles: + fastint6_inline2_json.tar.gz

messages: + msg279023
2016-10-20 09:29:35vstinnersetmessages: + msg279022
2016-10-20 09:25:38vstinnersetmessages: + msg279021
2016-04-22 14:24:45skrahsetmessages: + msg264019
2016-04-22 14:05:45vstinnersetmessages: + msg264018
2016-02-10 10:02:15vstinnersetfiles: + mpmath_bench.py

messages: + msg259999
2016-02-09 18:24:01yselivanovsetmessages: + msg259948
2016-02-09 09:15:55vstinnersetmessages: + msg259919
2016-02-09 08:25:42casevhsetmessages: + msg259918
2016-02-08 16:40:24yselivanovsetmessages: + msg259860
2016-02-08 16:30:16casevhsetmessages: + msg259859
2016-02-08 09:46:11vstinnersetmessages: + msg259832
2016-02-07 21:32:54yselivanovsetfiles: + fastint6.patch

messages: + msg259804
2016-02-07 19:42:01pitrousetmessages: + msg259801
2016-02-07 19:30:03casevhsetmessages: + msg259800
2016-02-07 16:18:24skrahsetmessages: + msg259793
2016-02-07 15:16:35yselivanovsetmessages: + msg259792
2016-02-07 15:08:44serhiy.storchakasetmessages: + msg259791
2016-02-07 15:01:06yselivanovsetmessages: + msg259790
2016-02-06 17:38:20yselivanovsetmessages: + msg259743
2016-02-06 17:02:55yselivanovsetmessages: + msg259734
2016-02-06 17:00:25yselivanovsetmessages: + msg259733
2016-02-06 15:49:02vstinnersetmessages: + msg259730
2016-02-06 15:47:45vstinnersetmessages: + msg259729
2016-02-06 08:37:30serhiy.storchakasetmessages: + msg259722
2016-02-06 02:22:08vstinnersetmessages: + msg259714
2016-02-06 02:12:02vstinnersetmessages: + msg259713
2016-02-06 01:52:51vstinnersetmessages: + msg259712
2016-02-06 01:36:45yselivanovsetmessages: + msg259707
2016-02-06 01:31:57vstinnersetfiles: + inline-2.patch

messages: + msg259706
2016-02-06 01:29:55yselivanovsetfiles: + fastint5_4.patch
2016-02-06 00:45:07yselivanovsetfiles: + fastint5_3.patch
2016-02-06 00:10:27yselivanovsetfiles: + fastint5_2.patch

messages: + msg259702
2016-02-05 22:37:27vstinnersetmessages: + msg259695
2016-02-05 17:17:00yselivanovsetmessages: + msg259678
2016-02-05 17:01:35serhiy.storchakasetmessages: + msg259675
2016-02-05 16:32:58yselivanovsetmessages: + msg259673
2016-02-05 16:22:39yselivanovsetmessages: + msg259672
2016-02-05 16:18:59yselivanovsetmessages: + msg259671
2016-02-05 16:15:24vstinnersetmessages: + msg259670
2016-02-05 15:58:30vstinnersetfiles: + compare_to.txt
2016-02-05 15:58:24vstinnersetfiles: + compare.txt
2016-02-05 15:58:18vstinnersetfiles: + bench_long2.py

messages: + msg259669
2016-02-05 15:43:39yselivanovsetmessages: + msg259668
2016-02-05 15:32:30yselivanovsetmessages: + msg259667
2016-02-05 15:26:13lemburgsetmessages: + msg259666
2016-02-05 15:14:25vstinnersetmessages: + msg259664
2016-02-05 15:10:26yselivanovsetmessages: + msg259663
2016-02-05 04:04:35yselivanovsetfiles: + fastint5.patch

messages: + msg259626
2016-02-05 01:48:02yselivanovsetmessages: + msg259614
2016-02-05 01:37:43yselivanovsetfiles: + fastint4.patch

messages: + msg259612
2016-02-05 01:06:01pitrousetmessages: + msg259607
2016-02-05 00:09:37yselivanovsetmessages: + msg259605
2016-02-04 22:55:42pitrousetmessages: + msg259601
2016-02-04 16:44:07skrahsetmessages: + msg259578
2016-02-04 16:42:10skrahsetmessages: + msg259577
2016-02-04 16:36:19serhiy.storchakasetfiles: + fastintfloat_alt.patch

messages: + msg259574
2016-02-04 15:56:36Yury.Selivanovsetmessages: + msg259573
2016-02-04 15:40:09skrahsetnosy: + skrah
messages: + msg259571
2016-02-04 14:27:21Yury.Selivanovsetnosy: + Yury.Selivanov
messages: + msg259568
2016-02-04 14:24:48pitrousetmessages: + msg259567
2016-02-04 14:18:41yselivanovsetmessages: + msg259565
2016-02-04 14:06:50pitrousetmessages: + msg259564
2016-02-04 14:01:39vstinnersetmessages: + msg259563
2016-02-04 13:54:55yselivanovsetmessages: + msg259562
2016-02-04 12:50:15vstinnersetmessages: + msg259560
2016-02-04 10:30:04serhiy.storchakasetfiles: + fastint_alt.patch

messages: + msg259554
2016-02-04 09:37:42vstinnersetmessages: + msg259552
2016-02-04 09:35:46pitrousetmessages: + msg259549
2016-02-04 08:56:21lemburgsetmessages: + msg259545
2016-02-04 08:13:35vstinnersetmessages: + msg259542
2016-02-04 08:01:51vstinnersetmessages: + msg259541
2016-02-04 07:58:06lemburgsetmessages: + msg259540
2016-02-04 06:02:49yselivanovsetfiles: + fastint2.patch

messages: + msg259530
2016-02-03 19:43:20serhiy.storchakasetmessages: + msg259509
2016-02-03 19:35:22yselivanovsetmessages: + msg259508
2016-02-03 19:29:16vstinnersetmessages: + msg259506
2016-02-03 19:26:33yselivanovsetmessages: + msg259505
2016-02-03 19:19:30serhiy.storchakasetmessages: + msg259503
2016-02-03 19:04:03yselivanovsetmessages: + msg259502
2016-02-03 18:50:03lemburgsetnosy: + lemburg
messages: + msg259500
2016-02-03 17:52:40pitrousetmessages: + msg259499
2016-02-03 17:47:22zbyrnesetmessages: + msg259497
2016-02-03 17:21:01yselivanovsetmessages: + msg259496
2016-02-03 17:15:50zbyrnesetmessages: + msg259495
2016-02-03 17:07:18pitrousetmessages: + msg259494
2016-02-03 17:05:22vstinnersetmessages: + msg259493
2016-02-03 17:00:24yselivanovsetfiles: + fastint1.patch

messages: + msg259491
2016-02-03 16:40:31zbyrnesetfiles: + bench_results.txt

messages: + msg259490
2016-02-02 21:06:54yselivanovsetmessages: + msg259431
2016-02-02 20:56:51pitrousetmessages: + msg259429
2016-02-02 20:37:15zbyrnesetmessages: + msg259428
2016-02-02 18:55:11yselivanovsetversions: + Python 3.6, - Python 3.5
messages: + msg259417

assignee: yselivanov
components: + Interpreter Core
stage: patch review
2016-01-12 03:37:31zbyrnesetmessages: + msg258062
2016-01-12 02:42:00yselivanovsetnosy: + yselivanov
messages: + msg258060
2016-01-12 02:10:11zbyrnesetmessages: + msg258057
2015-03-18 15:53:44zbyrnesetmessages: + msg238455
2015-03-18 13:31:59vstinnersetmessages: + msg238437
2014-07-23 06:00:19zbyrnesetmessages: + msg223726
2014-07-23 01:20:41pitrousetnosy: + pitrou
messages: + msg223711
2014-07-22 02:34:35zbyrnesetfiles: + 21955_2.patch

messages: + msg223623
2014-07-16 18:31:45casevhsetnosy: + casevh
2014-07-16 14:40:57zbyrnesetmessages: + msg223214
2014-07-16 09:28:43serhiy.storchakasetmessages: + msg223186
2014-07-16 08:14:29vstinnersetmessages: + msg223180
2014-07-16 08:13:51vstinnersetmessages: - msg223179
2014-07-16 08:13:29vstinnersetfiles: + inline.patch
2014-07-16 08:13:14vstinnersetfiles: + bench_long.py

messages: + msg223179
2014-07-16 06:23:25serhiy.storchakasetmessages: + msg223177
2014-07-16 00:29:28zbyrnesetfiles: + 21955.patch

nosy: + zbyrne
messages: + msg223162

keywords: + patch
2014-07-14 00:42:07rhettingersetnosy: + rhettinger
messages: + msg222985
2014-07-12 09:01:55vstinnersetmessages: + msg222830
2014-07-12 09:01:30vstinnersetmessages: + msg222829
2014-07-12 07:19:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg222824
2014-07-11 22:23:35josh.rsetnosy: + josh.r
messages: + msg222804
2014-07-11 09:10:27vstinnercreate