classification
Title: INCA: Inline Caching meets Quickening in Python 3.3
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: cvrebert, dmalcolm, eric.snow, meador.inge, pitrou, sbrunthaler, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2012-05-08 17:37 by sbrunthaler, last changed 2016-09-09 23:26 by vstinner.

Files
File name Uploaded Description Edit
20120508-inca.patch sbrunthaler, 2012-05-08 17:37 Patch against 2f563908ebc5 (Jesus Cea, Thu Apr 26 17:05:31 2012 +0200) review
20120511-inca.patch sbrunthaler, 2012-05-11 21:32 Patches CALL_FUNCTION quickening, supersedes previous patch file. review
20120511-inca-perf.txt sbrunthaler, 2012-05-11 21:34 perf.py performance on i7-920, Linux 3.0.0-19
20120510-vanilla-perf.txt sbrunthaler, 2012-05-11 21:35 perf.py performance baseline measurement on i7-920, Linux 3.0.0-19
20120514-inca.patch sbrunthaler, 2012-05-14 20:31 Patch including quickening of INPLACE_* instructions, supersedes previous patch file (20120511-inca.patch) review
20120514-inca-perf.txt sbrunthaler, 2012-05-14 20:39 perf.py results including INPLACE_* quickening; 12 repetitions on i7-920, summarized
Repositories containing patches
https://bitbucket.org/sbrunthaler/cpython-inline-caching
Messages (12)
msg160216 - (view) Author: stefan brunthaler (sbrunthaler) * Date: 2012-05-08 17:37
The attached patch adds quickening based inline caching (INCA) to the CPython 3.3 interpreter. It uses a code generator to generate instruction derivatives using the mako template engine, and a set of utility functions to enable automatic and safe quickening.

The code generator itself resides in "cgen" and the generated files reside in "Python/opt/gen". Auxiliary files resides in "Python/opt" and only minor changes are necessary in ceval.c and places where type feedback is possible (mostly in abstract.c and object.c)

Details of the technique have been published (see my home page: http://www.ics.uci.edu/~sbruntha/.) 

On my machine (i7-920 with Intel Turbo Boost disabled) this results in average arithmetic speedups of 1.47 over the vanilla interpreter without threaded code/computed gotos, and 1.13 over an interpreter with threaded code/computed gotos enabled. (Maximal speedups are 1.61 over the vanilla interpreter and 1.17 over the threaded code interpreter.) The optimized interpreter uses 206 instructions which currently only cover the standard library, i.e., there is still ample space left for optimized instruction derivatives for popular applications/libraries, such as NumPy or Django.

Furthermore, based on the purely interpretative nature of the technique, there are no compatibility implications (modulo libraries/modules relying on concrete opcode values---I would guess that such code is rather unlikely, but one never knows...) Additional memory overhead is minimal, too, since the technique only requires space for the new derivatives and is something along the lines of 80-100 KiB.
msg160222 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-05-08 20:18
This looks quite impressive, so sorry for immediately jumping in with
criticism. -- I've benchmarked the things I worked on, and I can't see
any speedups but some significant slowdowns. This is on 64-bit Linux
with a Core 2 Duo, both versions compiled with just `./configure && make`:


Modules/_decimal/tests/bench.py:
--------------------------------

Not much change for floats and decimal.py, 8-10% slowdown for _decimal!


Telco benchmark [1]:
--------------------

4% slowdown.


Memoryview:
-----------

./python -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]"

17% (!) slowdown.


Did I perhaps miss some option to turn on the optimizations?



[1] http://www.bytereef.org/mpdecimal/quickstart.html#telco-benchmark
msg160224 - (view) Author: stefan brunthaler (sbrunthaler) * Date: 2012-05-08 20:58
> This looks quite impressive, so sorry for immediately jumping in with
> criticism. -- I've benchmarked the things I worked on, and I can't see
> any speedups but some significant slowdowns. This is on 64-bit Linux
> with a Core 2 Duo, both versions compiled with just `./configure && make`:

Well, no problem -- I don't actually consider it criticism at all.
Build is correct, you could verify the interpreter working adequatly
by running the test suite and seeing some tests depending on specific
bytecodes fail (test_dis, and test_importlib, AFAIR).

I don't have a Core 2 Duo available for testing, though.

> Modules/_decimal/tests/bench.py:
> --------------------------------
>
> Not much change for floats and decimal.py, 8-10% slowdown for _decimal!

This result is not unexpected, as I have no inline cached versions of
functions using this module. The derivatives I generate work for Long,
Float and Complex numbers (plus Unicode strings and some others.) If
there is a clear need, of course I can look into that and add these
derivatives (as I said, there are still some 40+ opcodes unused.)

> Memoryview:
> -----------
>
> ./python -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]"
>
> 17% (!) slowdown.

Hm, the 17% slowdown seems strange to me. However, I don't expect to
see any speedups in this case, as there is no repeated execution
within the benchmark code that could leverage type feedback via inline
caching.

You should see most speedups when dealing with for-loops (as FOR_ITER
has optimized derivatives), if-statements (COMPARE_OP has optimized
derivatives), and mathematical code. In addition there are some
optimizations for frequently executed function calls, unpacked
sequences, etc. Note: frequent as in how I encountered them, probably
this needs adjustments for different use cases.

> Did I perhaps miss some option to turn on the optimizations?

Does not seem to be the case, but if you could verify running the
regression tests we could easily eliminate this scenario. You could
verifiy speedups, too, on computer language benchmark game benchmarks,
primarily binarytrees, mandelbrot, nbody and spectralnorm, just to see
how much you *should* gain on your machine. Testing methodology could
also make a difference. I use the following:
- Linux 3.0.0-17 (Ubuntu)
- gcc version 4.6.1
- nice -n -20 to minimize scheduler interference
- 30 repetitions per benchmark

I hope that helps/explains,
regards,
--stefan
msg160225 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-08 21:09
Well I've just run the benchmark suite from http://hg.python.org/benchmarks/ on a Core i5-2500K under 64-bit Linux with gcc 4.5.2, and got the following results:


Report on Linux localhost.localdomain 2.6.38.8-desktop-10.mga #1 SMP Wed Jan 25 10:17:18 UTC 2012 x86_64 x86_64
Total CPU cores: 4

### call_method ###
Min: 0.281984 -> 0.272776: 1.03x faster
Avg: 0.284389 -> 0.275780: 1.03x faster
Significant (t=34.11)
Stddev: 0.00170 -> 0.00258: 1.5137x larger
Timeline: http://tinyurl.com/bvr6vgc

### call_method_slots ###
Min: 0.284536 -> 0.266709: 1.07x faster
Avg: 0.287137 -> 0.268378: 1.07x faster
Significant (t=146.85)
Stddev: 0.00132 -> 0.00084: 1.5761x smaller
Timeline: http://tinyurl.com/c5a2exr

### call_method_unknown ###
Min: 0.282038 -> 0.267772: 1.05x faster
Avg: 0.286445 -> 0.272408: 1.05x faster
Significant (t=36.76)
Stddev: 0.00368 -> 0.00289: 1.2735x smaller
Timeline: http://tinyurl.com/d9c44cg

### call_simple ###
Min: 0.209886 -> 0.199631: 1.05x faster
Avg: 0.213417 -> 0.202189: 1.06x faster
Significant (t=34.52)
Stddev: 0.00304 -> 0.00258: 1.1775x smaller
Timeline: http://tinyurl.com/dxes8w7

### fastunpickle ###
Min: 0.400666 -> 0.417532: 1.04x slower
Avg: 0.406783 -> 0.426830: 1.05x slower
Significant (t=-21.12)
Stddev: 0.00427 -> 0.00518: 1.2148x larger
Timeline: http://tinyurl.com/c4rp95k

### formatted_logging ###
Min: 0.277797 -> 0.301302: 1.08x slower
Avg: 0.281376 -> 0.308164: 1.10x slower
Significant (t=-37.95)
Stddev: 0.00239 -> 0.00438: 1.8378x larger
Timeline: http://tinyurl.com/ck9ggaj

### iterative_count ###
Min: 0.103784 -> 0.114922: 1.11x slower
Avg: 0.105549 -> 0.117207: 1.11x slower
Significant (t=-40.04)
Stddev: 0.00140 -> 0.00151: 1.0725x larger
Timeline: http://tinyurl.com/cffbdw4

### nbody ###
Min: 0.241282 -> 0.228552: 1.06x faster
Avg: 0.247193 -> 0.233512: 1.06x faster
Significant (t=19.99)
Stddev: 0.00369 -> 0.00313: 1.1766x smaller
Timeline: http://tinyurl.com/c85oabx

### nqueens ###
Min: 0.260105 -> 0.298352: 1.15x slower
Avg: 0.263279 -> 0.303216: 1.15x slower
Significant (t=-72.55)
Stddev: 0.00220 -> 0.00321: 1.4595x larger
Timeline: http://tinyurl.com/c4w9qul

### regex_effbot ###
Min: 0.061832 -> 0.064062: 1.04x slower
Avg: 0.062817 -> 0.066098: 1.05x slower
Significant (t=-19.95)
Stddev: 0.00071 -> 0.00092: 1.2962x larger
Timeline: http://tinyurl.com/bnfh5s3

### regex_v8 ###
Min: 0.063420 -> 0.066753: 1.05x slower
Avg: 0.064579 -> 0.067768: 1.05x slower
Significant (t=-8.38)
Stddev: 0.00186 -> 0.00195: 1.0464x larger
Timeline: http://tinyurl.com/bq4uwr4

### richards ###
Min: 0.157204 -> 0.161524: 1.03x slower
Avg: 0.159884 -> 0.166099: 1.04x slower
Significant (t=-13.74)
Stddev: 0.00198 -> 0.00251: 1.2709x larger
Timeline: http://tinyurl.com/c774hmy

### silent_logging ###
Min: 0.066531 -> 0.071248: 1.07x slower
Avg: 0.068753 -> 0.073617: 1.07x slower
Significant (t=-16.43)
Stddev: 0.00133 -> 0.00162: 1.2181x larger
Timeline: http://tinyurl.com/cmg433a

### simple_logging ###
Min: 0.265349 -> 0.280200: 1.06x slower
Avg: 0.269531 -> 0.285340: 1.06x slower
Significant (t=-23.12)
Stddev: 0.00295 -> 0.00383: 1.2957x larger
Timeline: http://tinyurl.com/dytv5rs

### threaded_count ###
Min: 0.102369 -> 0.115217: 1.13x slower
Avg: 0.104309 -> 0.117157: 1.12x slower
Significant (t=-58.97)
Stddev: 0.00113 -> 0.00104: 1.0893x smaller
Timeline: http://tinyurl.com/cu4b5hn

### unpack_sequence ###
Min: 0.000044 -> 0.000051: 1.18x slower
Avg: 0.000045 -> 0.000053: 1.18x slower
Significant (t=-557.67)
Stddev: 0.00000 -> 0.00000: 1.1571x larger
Timeline: http://tinyurl.com/d822sf2

The following not significant results are hidden, use -v to show them:
fastpickle, float, json_dump, json_load, normal_startup, pidigits, regex_compile, startup_nosite.
msg160228 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-05-08 21:28
Here are my perf.py results:


Report on Linux localhost 2.6.32-28-generic #55-Ubuntu SMP Mon Jan 10 23:42:43 UTC 2011 x86_64 
Total CPU cores: 2

### call_method_unknown ###
Min: 0.345289 -> 0.358096: 1.04x slower
Avg: 0.345678 -> 0.358167: 1.04x slower
Significant (t=-713.58)
Stddev: 0.00019 -> 0.00010: 1.7897x smaller
Timeline: b'http://tinyurl.com/c9k7oxu'

### call_simple ###
Min: 0.297635 -> 0.239782: 1.24x faster
Avg: 0.297788 -> 0.240961: 1.24x faster
Significant (t=1183.63)
Stddev: 0.00020 -> 0.00055: 2.7115x larger
Timeline: b'http://tinyurl.com/c769rmj'

### fastunpickle ###
Min: 0.640831 -> 0.629214: 1.02x faster
Avg: 0.646831 -> 0.630471: 1.03x faster
Significant (t=34.25)
Stddev: 0.00324 -> 0.00095: 3.4098x smaller
Timeline: b'http://tinyurl.com/cpyrcoy'

### float ###
Min: 0.067458 -> 0.070712: 1.05x slower
Avg: 0.069109 -> 0.072166: 1.04x slower
Significant (t=-27.33)
Stddev: 0.00119 -> 0.00131: 1.0939x larger
Timeline: b'http://tinyurl.com/c2mhk5o'


### iterative_count ###
Min: 0.132501 -> 0.157027: 1.19x slower
Avg: 0.132724 -> 0.157425: 1.19x slower
Significant (t=-628.41)
Stddev: 0.00015 -> 0.00023: 1.5093x larger
Timeline: b'http://tinyurl.com/7auyflz'

### json_dump ###
Min: 0.569771 -> 0.588654: 1.03x slower
Avg: 0.571149 -> 0.589615: 1.03x slower
Significant (t=-101.32)
Stddev: 0.00094 -> 0.00088: 1.0689x smaller
Timeline: b'http://tinyurl.com/d4hhfjh'

### nbody ###
Min: 0.283447 -> 0.274458: 1.03x faster
Avg: 0.289093 -> 0.281663: 1.03x faster
Significant (t=6.47)
Stddev: 0.00764 -> 0.00276: 2.7642x smaller
Timeline: b'http://tinyurl.com/7483dap'

### normal_startup ###
Min: 0.649437 -> 0.666486: 1.03x slower
Avg: 0.650395 -> 0.669678: 1.03x slower
Significant (t=-10.47)
Stddev: 0.00048 -> 0.01302: 27.2168x larger
Timeline: b'http://tinyurl.com/d773p3c'

### nqueens ###
Min: 0.380882 -> 0.390680: 1.03x slower
Avg: 0.382476 -> 0.391790: 1.02x slower
Significant (t=-56.25)
Stddev: 0.00069 -> 0.00095: 1.3821x larger
Timeline: b'http://tinyurl.com/c49drdo'

### regex_compile ###
Min: 0.544866 -> 0.564778: 1.04x slower
Avg: 0.546157 -> 0.565641: 1.04x slower
Significant (t=-159.84)
Stddev: 0.00076 -> 0.00041: 1.8308x smaller
Timeline: b'http://tinyurl.com/cxv3s5n'

### richards ###
Min: 0.217447 -> 0.233456: 1.07x slower
Avg: 0.220199 -> 0.235870: 1.07x slower
Significant (t=-55.49)
Stddev: 0.00154 -> 0.00127: 1.2111x smaller
Timeline: b'http://tinyurl.com/7h5zd82'

### silent_logging ###
Min: 0.088886 -> 0.086601: 1.03x faster
Avg: 0.089748 -> 0.087314: 1.03x faster
Significant (t=27.06)
Stddev: 0.00056 -> 0.00030: 1.8744x smaller
Timeline: b'http://tinyurl.com/6mub6j8'

### simple_logging ###
Min: 0.384265 -> 0.394479: 1.03x slower
Avg: 0.386462 -> 0.396793: 1.03x slower
Significant (t=-28.26)
Stddev: 0.00189 -> 0.00177: 1.0687x smaller
Timeline: b'http://tinyurl.com/cwacrva'

### threaded_count ###
Min: 0.149418 -> 0.213361: 1.43x slower
Avg: 0.195123 -> 0.224733: 1.15x slower
Significant (t=-17.77)
Stddev: 0.01011 -> 0.00604: 1.6744x smaller
Timeline: b'http://tinyurl.com/chh36kr'

### unpack_sequence ###
Min: 0.000054 -> 0.000063: 1.17x slower
Avg: 0.000054 -> 0.000064: 1.17x slower
Significant (t=-1952.49)
Stddev: 0.00000 -> 0.00000: 2.2326x smaller
Timeline: b'http://tinyurl.com/6nps6ew'
msg160230 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-05-08 22:00
> > Modules/_decimal/tests/bench.py:
> > --------------------------------
> >
> > Not much change for floats and decimal.py, 8-10% slowdown for _decimal!
> 
> This result is not unexpected, as I have no inline cached versions of
> functions using this module. The derivatives I generate work for Long,
> Float and Complex numbers (plus Unicode strings and some others.)

But I couldn't detect a speedup for either float or int in bench.py. Also,
in perf.py I'm consistently getting slower results for float with the patch.

> there is a clear need, of course I can look into that and add these
> derivatives (as I said, there are still some 40+ opcodes unused.)

I'd say that the minimum requirement for a performance enhancement
patch is that there aren't any slowdowns. :)

I'm getting a 9% speedup for mandelbrot and an 18% speedup for spectralnorm.
That's nice, but many benchmarks that Antoine and I have posted show
slowdowns of 15-18%.
msg160453 - (view) Author: stefan brunthaler (sbrunthaler) * Date: 2012-05-11 21:32
This is the updated patch file that fixes the performance issues measurable using the official perf.py py3k test suite.
msg160457 - (view) Author: stefan brunthaler (sbrunthaler) * Date: 2012-05-11 22:14
So I took a close look to what the performance problem was. Many of the benchmarks used by the perf.py py3k benchmarks use function calls for which there are no optimized derivatives available. In this case the function trying to do the quickening (aptly called "quickening_call_function") did not have a proper target instruction. Therefore, it was "stuck" with the additional quickening code in "quickening_call_function." This causes some overhead, or at least ate away some of the benefits of quickening. The new patch takes care of quickening to another function, if no better target is available, the only optimization being to have one instruction for calling a C function or a Python function/method (instruction labels being INCA_C_CALL, and INCA_PYTHON_CALL respectively. The new interpreter dispatch loop therefore has 208 instructions.)

Furthermore, I have also added a baseline evaluation of running perf.py on my system, where I measure the vanilla Python 3.3 version against itself. I see quite some noise there, with some benchmarks showing up to 10pct outliers. On average, my interpretation is that around five percent variance needs to be accounted for.

The new patch shows an overall improvement. For the slower ones, I guess that's within the error margin demonstrated by the baseline evaluation run. The remaining speedups are acceptable, even though I could tune the code generator to have more optimized derivatives in place. I took a closer look at the float benchmark, because my experience with the computer language benchmarks game the INCA optimization performs quite well there.

These are the dynamic instruction frequencies for the two most frequently executed functions:

maximize:
---------
   Freq.  |pos | instruction                         | arg
   1999900:   0: LOAD_FAST                             0
   1999900:   3: LOAD_ATTR                             0
   1999900:   6: LOAD_FAST                             1
   1999900:   9: LOAD_ATTR                             0
   1999900:  12: INCA_CMP_FLOAT                        4
   1999900:  15: POP_JUMP_IF_FALSE                     27
   1996400:  18: LOAD_FAST                             0
   1996400:  21: LOAD_ATTR                             0
   1996400:  24: JUMP_FORWARD                          6
      3500:  27: LOAD_FAST                             1
      3500:  30: LOAD_ATTR                             0
   1999900:  33: LOAD_FAST                             0
   1999900:  36: STORE_ATTR                            0
   1999900:  39: LOAD_FAST                             0
   1999900:  42: LOAD_ATTR                             1
   1999900:  45: LOAD_FAST                             1
   1999900:  48: LOAD_ATTR                             1
   1999900:  51: INCA_CMP_FLOAT                        4
   1999900:  54: POP_JUMP_IF_FALSE                     66
   1999900:  57: LOAD_FAST                             0
   1999900:  60: LOAD_ATTR                             1
   1999900:  63: JUMP_FORWARD                          6
   1999900:  72: LOAD_FAST                             0
   1999900:  75: STORE_ATTR                            1
   1999900:  78: LOAD_FAST                             0
   1999900:  81: LOAD_ATTR                             2
   1999900:  84: LOAD_FAST                             1
   1999900:  87: LOAD_ATTR                             2
   1999900:  90: INCA_CMP_FLOAT                        4
   1999900:  93: POP_JUMP_IF_FALSE                     105
   1993800:  96: LOAD_FAST                             0
   1993800:  99: LOAD_ATTR                             2
   1993800: 102: JUMP_FORWARD                          6
      6100: 105: LOAD_FAST                             1
      6100: 108: LOAD_ATTR                             2
   1999900: 111: LOAD_FAST                             0
   1999900: 114: STORE_ATTR                            2
   1999900: 117: LOAD_CONST                            0
   1999900: 120: RETURN_VALUE                          0

normalize:
----------
   Freq.  |pos | instruction                         | arg
   2000000:   0: LOAD_GLOBAL                           0
   2000000:   3: LOAD_FAST                             0
   2000000:   6: LOAD_ATTR                             1
   2000000:   9: LOAD_CONST                            1
   2000000:  12: INCA_FLOAT_POWER                      1
   2000000:  13: LOAD_FAST                             0
   2000000:  16: LOAD_ATTR                             2
   2000000:  19: LOAD_CONST                            1
   2000000:  22: INCA_FLOAT_POWER                      1
   2000000:  23: INCA_FLOAT_ADD                        1
   2000000:  24: LOAD_FAST                             0
   2000000:  27: LOAD_ATTR                             3
   2000000:  30: LOAD_CONST                            1
   2000000:  33: INCA_FLOAT_POWER                      1
   2000000:  34: INCA_FLOAT_ADD                        1
   2000000:  35: FAST_C_ONE                            1            (call?)
   2000000:  38: STORE_FAST                            1
   2000000:  41: LOAD_FAST                             0
   2000000:  44: LOAD_ATTR                             1
   2000000:  47: LOAD_FAST                             1
   2000000:  50: INCA_FLOAT_TRUE_DIVIDE                1
   2000000:  51: LOAD_FAST                             0
   2000000:  54: STORE_ATTR                            1
   2000000:  57: LOAD_FAST                             0
   2000000:  60: LOAD_ATTR                             2
   2000000:  63: LOAD_FAST                             1
   2000000:  66: INCA_FLOAT_TRUE_DIVIDE                1
   2000000:  67: LOAD_FAST                             0
   2000000:  70: STORE_ATTR                            2
   2000000:  73: LOAD_FAST                             0
   2000000:  76: LOAD_ATTR                             3
   2000000:  79: LOAD_FAST                             1
   2000000:  82: INCA_FLOAT_TRUE_DIVIDE                1
   2000000:  83: LOAD_FAST                             0
   2000000:  86: STORE_ATTR                            3
   2000000:  89: LOAD_CONST                            0
   2000000:  92: RETURN_VALUE                          0


It is clear from these data that INCA_FLOAT_POWER and INCA_FLOAT_TRUE_DIVIDE are the most frequently executed float operations during run time. However, the code-gen only inlines the simple cases of INCA_FLOAT_{ADD, SUBTRACT, and MULT}. In consequence, we see a lower optimization potential around here. One solution to improving this benchmarks' performance would be to inline the POWER and TRUE_DIVIDE operations into the corresponding derivatives (for TRUE_DIVIDE this is easy, I'll try and run the numbers again.)
The dynamic instruction frequencies show that there are many data movement instructions interleaved. If I find some spare time somewhere hidden in some future weekend, I'm going to focus on getting partial-stack-frame-caching to work, too, as this should have a noticeable impact on such code.
msg160666 - (view) Author: stefan brunthaler (sbrunthaler) * Date: 2012-05-14 20:39
While looking at the instruction frequency traces, I noticed that INPLACE_* instructions were not quickened to optimized instruction derivatives. The submitted patch fixes this issue and correctly generates code performing the substitution. The performance improvements are noticeably better (iterative_count, nbody and threaded_count about 1.25x faster on average.)
The summarized performance numbers make it easy to see where actual performance gains* are, and which results seem to be due to measurement error.

*: It seems that the mako benchmark suite suffers a consistent loss. I am going to look into this.
msg160870 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-16 14:33
Perhaps that's just me, but I find the performance gains rather limited given the sheer size of the changes. Is there any non-micro benchmark where the performance gains are actually substantial (say, more than 20%)?
msg160881 - (view) Author: stefan brunthaler (sbrunthaler) * Date: 2012-05-16 16:37
> Perhaps that's just me, but I find the performance gains rather limited given the sheer size of the changes.

Well there are a couple of things to keep in mind:

a) There is a substantial speedup potential in further interpretative
optimizations, but they come at increased complexity (mostly due to a
different instruction encoding). From the response on python-dev I
took away that this is not what people want.

b) The size is deceptive: the patch contains all resources, i.e., the
code gen *and* the generated files. I could split it up into three
separate patches to show that the *actual* intersection with existing
Python sources is very small. (Disregarding opcode.h, my guess is that
it's about a 100 lines.)

c) There are no reasonable compatbility implications (modulo code that
checks specific opcode values) and the memory consumption is
essentially nil (<= 100KiB, constant.)

There are further speedups available by ordering the interpreter
instructions (I have a paper on that called "Interpreter Instruction
Scheduling", and am currently working on a better algorithm [well, the
algorithm already exists, I'm just evaluating it].) I could easily add
that at no extra cost to the implementation, too.

> Is there any non-micro benchmark where the performance gains are actually substantial (say, more than 20%)?

Hm, I don't know. Are there applications/frameworks running on Python
3 that I can benchmark with?

Based on my experience, the speedups should be achievable across the
board, primarily because the most frequent CALL_FUNCTION instructions
have optimized derivatives. In addition with the arithmetic and
COMPARE_OP derivatives this covers a wide array of dynamic instruction
frequency mixes. There exist further inlining capabilities, too, which
can be easily added to the code generator.
The only reason why some benchmarks don't achieve expected speedups
isdue to them using operations where the code-gen does not contain
optimized derivatives. There is still space for ~45 derivatives to
cover those (including some important application-specific ones.)
msg275492 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-09 23:26
Link to the paper: https://www.sba-research.org/wp-content/uploads/publications/ecoop10.pdf
History
Date User Action Args
2016-09-09 23:26:39vstinnersetmessages: + msg275492
2016-09-09 23:25:38yselivanovsetnosy: + yselivanov
2014-10-14 15:02:42skrahsetnosy: - skrah
2013-04-07 23:25:44vstinnersetnosy: + vstinner
2012-05-20 22:21:03meador.ingesetnosy: + meador.inge
2012-05-16 16:37:51sbrunthalersetmessages: + msg160881
2012-05-16 14:33:17pitrousettype: enhancement -> performance
messages: + msg160870
2012-05-15 03:00:09cvrebertsetnosy: + cvrebert
2012-05-14 20:39:01sbrunthalersetfiles: + 20120514-inca-perf.txt

messages: + msg160666
2012-05-14 20:31:40sbrunthalersetfiles: + 20120514-inca.patch
2012-05-11 22:14:52sbrunthalersetmessages: + msg160457
2012-05-11 21:35:04sbrunthalersetfiles: + 20120510-vanilla-perf.txt
2012-05-11 21:34:03sbrunthalersetfiles: + 20120511-inca-perf.txt
2012-05-11 21:32:16sbrunthalersetfiles: + 20120511-inca.patch

messages: + msg160453
2012-05-08 22:00:12skrahsetmessages: + msg160230
2012-05-08 21:28:50skrahsetmessages: + msg160228
2012-05-08 21:09:37pitrousetnosy: + pitrou
messages: + msg160225
2012-05-08 20:58:54sbrunthalersetmessages: + msg160224
2012-05-08 20:18:42skrahsetnosy: + skrah
messages: + msg160222
2012-05-08 19:48:43dmalcolmsetnosy: + dmalcolm
2012-05-08 18:23:55eric.snowsetnosy: + eric.snow
2012-05-08 17:37:54sbrunthalercreate