Message 160457 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	sbrunthaler
Recipients	dmalcolm, eric.snow, pitrou, sbrunthaler, skrah
Date	2012-05-11.22:14:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1336774492.93.0.662185862617.issue14757@psf.upfronthosting.co.za>
In-reply-to

Content
So I took a close look to what the performance problem was. Many of the benchmarks used by the perf.py py3k benchmarks use function calls for which there are no optimized derivatives available. In this case the function trying to do the quickening (aptly called "quickening_call_function") did not have a proper target instruction. Therefore, it was "stuck" with the additional quickening code in "quickening_call_function." This causes some overhead, or at least ate away some of the benefits of quickening. The new patch takes care of quickening to another function, if no better target is available, the only optimization being to have one instruction for calling a C function or a Python function/method (instruction labels being INCA_C_CALL, and INCA_PYTHON_CALL respectively. The new interpreter dispatch loop therefore has 208 instructions.) Furthermore, I have also added a baseline evaluation of running perf.py on my system, where I measure the vanilla Python 3.3 version against itself. I see quite some noise there, with some benchmarks showing up to 10pct outliers. On average, my interpretation is that around five percent variance needs to be accounted for. The new patch shows an overall improvement. For the slower ones, I guess that's within the error margin demonstrated by the baseline evaluation run. The remaining speedups are acceptable, even though I could tune the code generator to have more optimized derivatives in place. I took a closer look at the float benchmark, because my experience with the computer language benchmarks game the INCA optimization performs quite well there. These are the dynamic instruction frequencies for the two most frequently executed functions: maximize: --------- Freq. \|pos \| instruction \| arg 1999900: 0: LOAD_FAST 0 1999900: 3: LOAD_ATTR 0 1999900: 6: LOAD_FAST 1 1999900: 9: LOAD_ATTR 0 1999900: 12: INCA_CMP_FLOAT 4 1999900: 15: POP_JUMP_IF_FALSE 27 1996400: 18: LOAD_FAST 0 1996400: 21: LOAD_ATTR 0 1996400: 24: JUMP_FORWARD 6 3500: 27: LOAD_FAST 1 3500: 30: LOAD_ATTR 0 1999900: 33: LOAD_FAST 0 1999900: 36: STORE_ATTR 0 1999900: 39: LOAD_FAST 0 1999900: 42: LOAD_ATTR 1 1999900: 45: LOAD_FAST 1 1999900: 48: LOAD_ATTR 1 1999900: 51: INCA_CMP_FLOAT 4 1999900: 54: POP_JUMP_IF_FALSE 66 1999900: 57: LOAD_FAST 0 1999900: 60: LOAD_ATTR 1 1999900: 63: JUMP_FORWARD 6 1999900: 72: LOAD_FAST 0 1999900: 75: STORE_ATTR 1 1999900: 78: LOAD_FAST 0 1999900: 81: LOAD_ATTR 2 1999900: 84: LOAD_FAST 1 1999900: 87: LOAD_ATTR 2 1999900: 90: INCA_CMP_FLOAT 4 1999900: 93: POP_JUMP_IF_FALSE 105 1993800: 96: LOAD_FAST 0 1993800: 99: LOAD_ATTR 2 1993800: 102: JUMP_FORWARD 6 6100: 105: LOAD_FAST 1 6100: 108: LOAD_ATTR 2 1999900: 111: LOAD_FAST 0 1999900: 114: STORE_ATTR 2 1999900: 117: LOAD_CONST 0 1999900: 120: RETURN_VALUE 0 normalize: ---------- Freq. \|pos \| instruction \| arg 2000000: 0: LOAD_GLOBAL 0 2000000: 3: LOAD_FAST 0 2000000: 6: LOAD_ATTR 1 2000000: 9: LOAD_CONST 1 2000000: 12: INCA_FLOAT_POWER 1 2000000: 13: LOAD_FAST 0 2000000: 16: LOAD_ATTR 2 2000000: 19: LOAD_CONST 1 2000000: 22: INCA_FLOAT_POWER 1 2000000: 23: INCA_FLOAT_ADD 1 2000000: 24: LOAD_FAST 0 2000000: 27: LOAD_ATTR 3 2000000: 30: LOAD_CONST 1 2000000: 33: INCA_FLOAT_POWER 1 2000000: 34: INCA_FLOAT_ADD 1 2000000: 35: FAST_C_ONE 1 (call?) 2000000: 38: STORE_FAST 1 2000000: 41: LOAD_FAST 0 2000000: 44: LOAD_ATTR 1 2000000: 47: LOAD_FAST 1 2000000: 50: INCA_FLOAT_TRUE_DIVIDE 1 2000000: 51: LOAD_FAST 0 2000000: 54: STORE_ATTR 1 2000000: 57: LOAD_FAST 0 2000000: 60: LOAD_ATTR 2 2000000: 63: LOAD_FAST 1 2000000: 66: INCA_FLOAT_TRUE_DIVIDE 1 2000000: 67: LOAD_FAST 0 2000000: 70: STORE_ATTR 2 2000000: 73: LOAD_FAST 0 2000000: 76: LOAD_ATTR 3 2000000: 79: LOAD_FAST 1 2000000: 82: INCA_FLOAT_TRUE_DIVIDE 1 2000000: 83: LOAD_FAST 0 2000000: 86: STORE_ATTR 3 2000000: 89: LOAD_CONST 0 2000000: 92: RETURN_VALUE 0 It is clear from these data that INCA_FLOAT_POWER and INCA_FLOAT_TRUE_DIVIDE are the most frequently executed float operations during run time. However, the code-gen only inlines the simple cases of INCA_FLOAT_{ADD, SUBTRACT, and MULT}. In consequence, we see a lower optimization potential around here. One solution to improving this benchmarks' performance would be to inline the POWER and TRUE_DIVIDE operations into the corresponding derivatives (for TRUE_DIVIDE this is easy, I'll try and run the numbers again.) The dynamic instruction frequencies show that there are many data movement instructions interleaved. If I find some spare time somewhere hidden in some future weekend, I'm going to focus on getting partial-stack-frame-caching to work, too, as this should have a noticeable impact on such code.

So I took a close look to what the performance problem was. Many of the benchmarks used by the perf.py py3k benchmarks use function calls for which there are no optimized derivatives available. In this case the function trying to do the quickening (aptly called "quickening_call_function") did not have a proper target instruction. Therefore, it was "stuck" with the additional quickening code in "quickening_call_function." This causes some overhead, or at least ate away some of the benefits of quickening. The new patch takes care of quickening to another function, if no better target is available, the only optimization being to have one instruction for calling a C function or a Python function/method (instruction labels being INCA_C_CALL, and INCA_PYTHON_CALL respectively. The new interpreter dispatch loop therefore has 208 instructions.)

Furthermore, I have also added a baseline evaluation of running perf.py on my system, where I measure the vanilla Python 3.3 version against itself. I see quite some noise there, with some benchmarks showing up to 10pct outliers. On average, my interpretation is that around five percent variance needs to be accounted for.

The new patch shows an overall improvement. For the slower ones, I guess that's within the error margin demonstrated by the baseline evaluation run. The remaining speedups are acceptable, even though I could tune the code generator to have more optimized derivatives in place. I took a closer look at the float benchmark, because my experience with the computer language benchmarks game the INCA optimization performs quite well there.

These are the dynamic instruction frequencies for the two most frequently executed functions:

maximize:
---------
   Freq.  |pos | instruction                         | arg
   1999900:   0: LOAD_FAST                             0
   1999900:   3: LOAD_ATTR                             0
   1999900:   6: LOAD_FAST                             1
   1999900:   9: LOAD_ATTR                             0
   1999900:  12: INCA_CMP_FLOAT                        4
   1999900:  15: POP_JUMP_IF_FALSE                     27
   1996400:  18: LOAD_FAST                             0
   1996400:  21: LOAD_ATTR                             0
   1996400:  24: JUMP_FORWARD                          6
      3500:  27: LOAD_FAST                             1
      3500:  30: LOAD_ATTR                             0
   1999900:  33: LOAD_FAST                             0
   1999900:  36: STORE_ATTR                            0
   1999900:  39: LOAD_FAST                             0
   1999900:  42: LOAD_ATTR                             1
   1999900:  45: LOAD_FAST                             1
   1999900:  48: LOAD_ATTR                             1
   1999900:  51: INCA_CMP_FLOAT                        4
   1999900:  54: POP_JUMP_IF_FALSE                     66
   1999900:  57: LOAD_FAST                             0
   1999900:  60: LOAD_ATTR                             1
   1999900:  63: JUMP_FORWARD                          6
   1999900:  72: LOAD_FAST                             0
   1999900:  75: STORE_ATTR                            1
   1999900:  78: LOAD_FAST                             0
   1999900:  81: LOAD_ATTR                             2
   1999900:  84: LOAD_FAST                             1
   1999900:  87: LOAD_ATTR                             2
   1999900:  90: INCA_CMP_FLOAT                        4
   1999900:  93: POP_JUMP_IF_FALSE                     105
   1993800:  96: LOAD_FAST                             0
   1993800:  99: LOAD_ATTR                             2
   1993800: 102: JUMP_FORWARD                          6
      6100: 105: LOAD_FAST                             1
      6100: 108: LOAD_ATTR                             2
   1999900: 111: LOAD_FAST                             0
   1999900: 114: STORE_ATTR                            2
   1999900: 117: LOAD_CONST                            0
   1999900: 120: RETURN_VALUE                          0

normalize:
----------
   Freq.  |pos | instruction                         | arg
   2000000:   0: LOAD_GLOBAL                           0
   2000000:   3: LOAD_FAST                             0
   2000000:   6: LOAD_ATTR                             1
   2000000:   9: LOAD_CONST                            1
   2000000:  12: INCA_FLOAT_POWER                      1
   2000000:  13: LOAD_FAST                             0
   2000000:  16: LOAD_ATTR                             2
   2000000:  19: LOAD_CONST                            1
   2000000:  22: INCA_FLOAT_POWER                      1
   2000000:  23: INCA_FLOAT_ADD                        1
   2000000:  24: LOAD_FAST                             0
   2000000:  27: LOAD_ATTR                             3
   2000000:  30: LOAD_CONST                            1
   2000000:  33: INCA_FLOAT_POWER                      1
   2000000:  34: INCA_FLOAT_ADD                        1
   2000000:  35: FAST_C_ONE                            1            (call?)
   2000000:  38: STORE_FAST                            1
   2000000:  41: LOAD_FAST                             0
   2000000:  44: LOAD_ATTR                             1
   2000000:  47: LOAD_FAST                             1
   2000000:  50: INCA_FLOAT_TRUE_DIVIDE                1
   2000000:  51: LOAD_FAST                             0
   2000000:  54: STORE_ATTR                            1
   2000000:  57: LOAD_FAST                             0
   2000000:  60: LOAD_ATTR                             2
   2000000:  63: LOAD_FAST                             1
   2000000:  66: INCA_FLOAT_TRUE_DIVIDE                1
   2000000:  67: LOAD_FAST                             0
   2000000:  70: STORE_ATTR                            2
   2000000:  73: LOAD_FAST                             0
   2000000:  76: LOAD_ATTR                             3
   2000000:  79: LOAD_FAST                             1
   2000000:  82: INCA_FLOAT_TRUE_DIVIDE                1
   2000000:  83: LOAD_FAST                             0
   2000000:  86: STORE_ATTR                            3
   2000000:  89: LOAD_CONST                            0
   2000000:  92: RETURN_VALUE                          0


It is clear from these data that INCA_FLOAT_POWER and INCA_FLOAT_TRUE_DIVIDE are the most frequently executed float operations during run time. However, the code-gen only inlines the simple cases of INCA_FLOAT_{ADD, SUBTRACT, and MULT}. In consequence, we see a lower optimization potential around here. One solution to improving this benchmarks' performance would be to inline the POWER and TRUE_DIVIDE operations into the corresponding derivatives (for TRUE_DIVIDE this is easy, I'll try and run the numbers again.)
The dynamic instruction frequencies show that there are many data movement instructions interleaved. If I find some spare time somewhere hidden in some future weekend, I'm going to focus on getting partial-stack-frame-caching to work, too, as this should have a noticeable impact on such code.

History
Date	User	Action	Args
2012-05-11 22:14:52	sbrunthaler	set	recipients: + sbrunthaler, pitrou, skrah, dmalcolm, eric.snow
2012-05-11 22:14:52	sbrunthaler	set	messageid: <1336774492.93.0.662185862617.issue14757@psf.upfronthosting.co.za>
2012-05-11 22:14:52	sbrunthaler	link	issue14757 messages
2012-05-11 22:14:52	sbrunthaler	create