classification
Title: Faster opcode dispatch on gcc
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: Ringding, aimacintyre, ajaksu2, akuchling, bboissin, belopolsky, blaisorblade, christian.heimes, collinwinter, djc, facundobatista, ggenellina, gregory.p.smith, jab, jcea, jyasskin, kevinwatters, lemburg, mdionisio, pitrou, ralph.corderoy, rhettinger, spiv, theatrus, willp
Priority: normal Keywords: patch

Created on 2008-12-26 21:09 by pitrou, last changed 2010-07-19 23:55 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
amd-athlon64-x2-pybench.txt alexandre.vassalotti, 2008-12-31 21:20
intel-core2-mobile-pybench.txt alexandre.vassalotti, 2008-12-31 21:21
amd-athlon64-x2-suncc-pybench.txt alexandre.vassalotti, 2009-01-01 21:58
amd-athlon64-x2-gcc-sunos-pybench.txt alexandre.vassalotti, 2009-01-01 21:58
threadedceval5.patch pitrou, 2009-01-01 22:59
pybench.sum.PPC skip.montanaro, 2009-01-02 20:26 Pybench summary for Mac G5
pybench.sum.Intel skip.montanaro, 2009-01-02 20:27 Pybench summary for MacBook Pro (Intel Core2 Duo)
amd-athlon64-x2-64bit-pybench.txt alexandre.vassalotti, 2009-01-03 00:45
ceval.i.unthreaded skip.montanaro, 2009-01-03 00:58 PPC (G5) unthreaded eval assembler
ceval.i.threaded skip.montanaro, 2009-01-03 01:00 PPC (G5) threaded eval assembler
abstract-switch.diff alexandre.vassalotti, 2009-01-04 19:29
abstract-switch-reduced.diff blaisorblade, 2009-01-07 10:11 Fixed version of abstract-switch.diff, without extraneous changes (IMHO)
restore-old-oparg-load.diff blaisorblade, 2009-01-07 10:12 Additional fix, on top of threadedceval5.patch and abstract-switch-reduced.diff, to restore back performances and have no switch at all
reenable-static-prediction.diff blaisorblade, 2009-01-07 11:09 Reenable static opcode prediction on top of the rest
pitrou_dispatch_2.7.patch jyasskin, 2009-01-11 17:37 threadedceval5.patch ported to trunk (2.7)
vmgen_2.7.patch jyasskin, 2009-01-12 17:58 vmgen (GForth)-based dispatch optimization with simple superinstructions
threadedceval6.patch pitrou, 2009-01-16 23:08
threadedceval6-py254.patch Ringding, 2009-01-21 14:02 threadedceval6 for Python 2.5.4
ceval.c.threadcode-tidyup.patch aimacintyre, 2009-03-22 07:37 threaded code tidy up in eval.c
newfile.zip mdionisio, 2009-07-18 19:58 ceval.c and opcode_targets.h
Messages (109)
msg78306 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-26 21:09
This patch implements what is usually called "threaded code" for the
ceval loop on compilers which support it (only gcc). The idea is that
there is a separate opcode dispatch epilog at the end of each opcode,
which allows the CPU to make much better use of its branch prediction
capabilities. The net result is a 15-20% average speedup on pybench and
pystone, with higher speedups on very tight loops (see below for the
full pybench result chart).

The opcode jump table is generated by a separate script which is called
as part of the Makefile (just as the asdl building script already is).

On compilers other than gcc, performance will of course be unchanged.


Test                             minimum run-time        average  run-time
                                 this    other   diff    this    other 
 diff
-------------------------------------------------------------------------------
          BuiltinFunctionCalls:   100ms   107ms   -7.1%   101ms   110ms
  -8.2%
           BuiltinMethodLookup:    76ms   106ms  -28.1%    78ms   106ms
 -26.5%
                 CompareFloats:   108ms   141ms  -23.2%   108ms   141ms
 -23.2%
         CompareFloatsIntegers:   171ms   188ms   -9.4%   173ms   204ms
 -15.3%
               CompareIntegers:   165ms   213ms  -22.6%   168ms   224ms
 -25.1%
        CompareInternedStrings:   127ms   169ms  -24.6%   127ms   169ms
 -24.8%
                  CompareLongs:    95ms   124ms  -23.1%    95ms   126ms
 -24.5%
                CompareStrings:   109ms   136ms  -20.2%   111ms   139ms
 -19.9%
    ComplexPythonFunctionCalls:   131ms   150ms  -12.4%   136ms   151ms
 -10.2%
                 ConcatStrings:   159ms   171ms   -6.9%   160ms   173ms
  -7.4%
               CreateInstances:   148ms   157ms   -5.6%   150ms   158ms
  -4.9%
            CreateNewInstances:   112ms   117ms   -4.3%   112ms   118ms
  -4.6%
       CreateStringsWithConcat:   144ms   198ms  -27.3%   148ms   199ms
 -25.7%
                  DictCreation:    90ms   104ms  -13.3%    90ms   104ms
 -13.1%
             DictWithFloatKeys:   117ms   153ms  -23.7%   117ms   154ms
 -24.0%
           DictWithIntegerKeys:   104ms   153ms  -32.3%   104ms   154ms
 -32.5%
            DictWithStringKeys:    90ms   140ms  -35.7%    90ms   141ms
 -36.3%
                      ForLoops:   100ms   161ms  -38.1%   100ms   161ms
 -38.1%
                    IfThenElse:   123ms   170ms  -28.0%   125ms   171ms
 -27.1%
                   ListSlicing:   142ms   141ms   +0.3%   142ms   142ms
  +0.2%
                NestedForLoops:   135ms   190ms  -29.0%   135ms   190ms
 -29.0%
          NormalClassAttribute:   249ms   281ms  -11.5%   249ms   281ms
 -11.3%
       NormalInstanceAttribute:   110ms   153ms  -28.2%   111ms   154ms
 -28.1%
           PythonFunctionCalls:   106ms   130ms  -18.7%   108ms   131ms
 -17.2%
             PythonMethodCalls:   151ms   169ms  -10.1%   152ms   169ms
  -9.8%
                     Recursion:   183ms   242ms  -24.7%   191ms   243ms
 -21.4%
                  SecondImport:   142ms   138ms   +2.7%   144ms   139ms
  +3.4%
           SecondPackageImport:   146ms   149ms   -2.3%   148ms   150ms
  -1.5%
         SecondSubmoduleImport:   201ms   193ms   +3.9%   201ms   195ms
  +3.4%
       SimpleComplexArithmetic:    90ms   112ms  -19.6%    90ms   112ms
 -19.8%
        SimpleDictManipulation:   172ms   230ms  -25.2%   173ms   231ms
 -25.0%
         SimpleFloatArithmetic:    98ms   133ms  -26.3%    99ms   137ms
 -27.9%
      SimpleIntFloatArithmetic:   134ms   175ms  -23.6%   138ms   176ms
 -21.6%
       SimpleIntegerArithmetic:   134ms   183ms  -26.8%   141ms   183ms
 -23.1%
        SimpleListManipulation:    91ms   143ms  -36.3%    93ms   143ms
 -35.1%
          SimpleLongArithmetic:    88ms   108ms  -17.9%    91ms   109ms
 -16.2%
                    SmallLists:   127ms   162ms  -21.6%   129ms   164ms
 -21.2%
                   SmallTuples:   149ms   177ms  -15.6%   151ms   178ms
 -15.1%
         SpecialClassAttribute:   423ms   426ms   -0.7%   426ms   430ms
  -0.9%
      SpecialInstanceAttribute:   110ms   154ms  -28.2%   111ms   154ms
 -28.3%
                StringMappings:   428ms   443ms   -3.4%   432ms   449ms
  -3.7%
              StringPredicates:   124ms   161ms  -23.1%   125ms   162ms
 -22.7%
                 StringSlicing:   207ms   223ms   -7.1%   208ms   228ms
  -8.7%
                     TryExcept:    72ms   166ms  -56.3%    73ms   166ms
 -56.2%
                    TryFinally:    93ms   120ms  -22.9%    93ms   124ms
 -25.2%
                TryRaiseExcept:    52ms    64ms  -19.2%    52ms    65ms
 -19.2%
                  TupleSlicing:   177ms   195ms   -9.1%   178ms   198ms
 -10.2%
                   WithFinally:   147ms   163ms  -10.2%   147ms   164ms
 -10.1%
               WithRaiseExcept:   156ms   173ms  -10.1%   157ms   174ms
  -9.7%
-------------------------------------------------------------------------------
Totals:                          6903ms  8356ms  -17.4%  6982ms  8443ms
 -17.3%
msg78307 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-26 21:22
Armin, by reading the pypy-dev mailing-list it looks like you were
interested in this.
msg78362 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-12-27 13:21
On 2008-12-26 22:09, Antoine Pitrou wrote:
> This patch implements what is usually called "threaded code" for the
> ceval loop on compilers which support it (only gcc). The idea is that
> there is a separate opcode dispatch epilog at the end of each opcode,
> which allows the CPU to make much better use of its branch prediction
> capabilities. The net result is a 15-20% average speedup on pybench and
> pystone, with higher speedups on very tight loops (see below for the
> full pybench result chart).

Now I know why you want opcode stats in pybench :-)

This looks like a promising approach. Is is possible to backport
this change to Python 2.x as well ?
msg78363 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-27 13:27
> This looks like a promising approach. Is is possible to backport
> this change to Python 2.x as well ?

Certainly.
msg78364 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-27 13:29
This new patch uses a statically initialized jump table rather than
specific initialization code.
msg78616 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-12-31 16:02
I'm having trouble understanding the technique of the jump table. Can
you provide some links to papers that explain the threaded code? I'm
interested in learning more.
How does your implementation compare to the GForth based threaded code
speedwise?
msg78620 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-31 16:15
> I'm having trouble understanding the technique of the jump table. Can
> you provide some links to papers that explain the threaded code? I'm
> interested in learning more.

I haven't read any papers. Having a jump table in itself isn't special
(the compiler does exactly that when compiling the switch() statement).
What's special is that a dedicated indirect jump instruction at the end
of each opcode helps the CPU make a separate prediction for which opcode
follows the other one, which is not possible with a switch statement
where the jump instruction is shared by all opcodes. I believe that's
where most of the speedup comes from.

If you read the patch it will probably be easy to understand.

I had the idea to try this after a thread on pypy-dev, there are more
references there:
http://codespeak.net/pipermail/pypy-dev/2008q4/004916.html

> How does your implementation compare to the GForth based threaded code
> speedwise?

Don't know! Your experiments are welcome. My patch is far simpler to
integrate though (it's small, introduces very few changes and does not
break any existing tests).
msg78628 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-12-31 16:37
> I haven't read any papers. Having a jump table in itself isn't special
> (the compiler does exactly that when compiling the switch() statement).
> What's special is that a dedicated indirect jump instruction at the end
> of each opcode helps the CPU make a separate prediction for which opcode
> follows the other one, which is not possible with a switch statement
> where the jump instruction is shared by all opcodes. I believe that's
> where most of the speedup comes from.
> 
> If you read the patch it will probably be easy to understand.

You are right. It's easier to understand after I've learned how the
opcode_targets table is working. Previously I didn't know that one can
store the address of a label in an array. Before I got it I wondered
where the pointers were defined. Is this a special GCC feature? I
haven't seen it before.

> Don't know! Your experiments are welcome. My patch is far simpler to
> integrate though (it's small, introduces very few changes and does not
> break any existing tests).

Yes, your patch is much smaller, less intrusive and easier to understand
with a little background in CS.
msg78629 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-31 16:40
> Is this a special GCC feature?

Yes, it is.
http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
msg78651 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-31 20:24
This new patch adds some detailed comments, at Jason Orendorff's request.
msg78653 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-12-31 21:20
You may want to check out issue1408710 in which a similar patch was
provided, but failed to deliver the desired results.

I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2
laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark
results.

The patch looks pretty clean. Here is a few things that caught my
attention while reading your patch.

First, you should rename opcode_targets.c to opcode_targets.h. This will
make it explicit that the file is not compiled, but just included.

Also, the macro USE_THREADED_CODE should be renamed to something else;
the word "thread" is too tightly associated with multi-threading.
Furthermore, threaded code simply refers to code consisting only of
function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
better.

Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or
LLTRACE is enabled? I tested your patch with both enabled and I didn't
see any test failures.

By the way, SUNCC also supports GCC's syntax for labels as values
(http://docs.sun.com/app/docs/doc/819-5265/bjabt?l=en&a=view).
msg78656 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-12-31 22:12
Works pretty well for me on my MacBook Pro, but on my G5 it performed
abysmally.  In fact, it ran so much worse that I cleaned up my sandbox 
and did both checks all over again to make sure I didn't mess something
up.  It looks like my MacBook Pro saw about a 7% improvement while my
G5 saw a 14% degradation.  Both computers are running Mac OS X 10.5.6
with the latest Xcode - 3.1.2.  On both computers gcc -v reports 4.0.1,
Apple build 5490.

If this is applied to the core I think it will have to select for more
than just gcc.  It will also have to select based on the instruction set
architecture.
msg78657 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-31 22:17
> Works pretty well for me on my MacBook Pro, but on my G5 it performed
> abysmally.  In fact, it ran so much worse that I cleaned up my sandbox 
> and did both checks all over again to make sure I didn't mess something
> up.  It looks like my MacBook Pro saw about a 14% degradation.  Both
> computers are running Mac OS X 10.5.6 with the latest Xcode - 3.1.2.
> On both computers gcc -v reports 4.0.1, Apple build 5490.

You're sure you didn't compile in debug mode or something? Just
checking.
msg78658 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-12-31 22:24
Antoine> You're sure you didn't compile in debug mode or something? Just
    Antoine> checking.

There was a cut-n-paste error in that one which I noticed right after
submitting (man, do I hate the crappy editing capability of <textarea>
widgets).  I removed it within a minute or two and replaced it with a
correct version.  Short answer: Intel thumbs up, PowerPC thumbs down.

Skip
msg78660 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-31 22:41
Hello,

> You may want to check out issue1408710 in which a similar patch was
> provided, but failed to deliver the desired results.
> 
> I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2
> laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark
> results.

Thanks. The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.

If you want to investigate, you can output the assembler code for
ceval.c; the command-line should be something like:

gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes  -I. -IInclude -I./Include   -DPy_BUILD_CORE -S -dA Python/ceval.c

and then count the number of indirect jump instructions in ceval.c:

grep -E "jmp[[:space:]]\*%" ceval.s

There should be 85 to 90 of them, roughly. If there are many less, then
the compiler has tried to optimize them by "sharing" them.

> First, you should rename opcode_targets.c to opcode_targets.h. This will
> make it explicit that the file is not compiled, but just included.

Ok.

> Also, the macro USE_THREADED_CODE should be renamed to something else;
> the word "thread" is too tightly associated with multi-threading.
> Furthermore, threaded code simply refers to code consisting only of
> function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
> better.

Ok.

> Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or
> LLTRACE is enabled? I tested your patch with both enabled and I didn't
> see any test failures.

Because otherwise the measurements these options are meant to do would
be meaningless.

> By the way, SUNCC also supports GCC's syntax for labels as values

I don't have a Sun machine to test, so I'll leave to someone else to
check and enable if they want to.
msg78663 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-31 22:53
Attached new patch for fixes suggested by Alexandre (rename
opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with
USE_COMPUTED_GOTOS).
msg78685 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 03:44
Mentioning other versions as well.
The patch is so easy that it can be backported to all supported
versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as
far as I can see this patch cannot be accepted there, sadly).
msg78686 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-01 04:10
Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch
    Paolo> cannot be accepted there, sadly).

You could backport it to 2.4 & 2.5 and just put it up on PyPI...
msg78687 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 04:28
Some other comments.
The time saving of indirect threading are also associated with the
removal of the range check, but better branch prediction is the main
advantage.

> Also, the macro USE_THREADED_CODE should be renamed to something else;
> the word "thread" is too tightly associated with multi-threading.

That's true.

> Furthermore, threaded code simply refers to code consisting only of
> function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
> better.

I'd prefer USE_DIRECT_DISPATCH (or better, USE_THREADED_DISPATCH) rather
than USE_COMPUTED_GOTO, since the latter is just the used language
construct.

"indirect threading" is the standard name in CS literature to define
this technique. "Direct threading" is a variation where in the bytecode
array, opcode is replaced by the pointer opcode_handler[opcode], so that
dispatch is slightly faster. Most interpreters use indirect threading to
save memory, and because it enables to switch the opcode handlers table
to activate for instance debugging.

The best paper about this is:
"The Structure and Performance of Efficient Interpreters, M. Anton Ertl
and David Gregg, 2003".

The original paper about (direct) threaded code is this:
"Threaded Code, James R. Bell, Comm. of ACM, 1973", but I don't feel it
relevant at all today.
Indirect threading was introduced in "Indirect Threaded Code, Robert B.
K. Dewar, Communications of the ACM, 1975" (that's just a bit more
relevant, but still).
msg78688 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 04:36
> You may want to check out issue1408710 in which a similar patch was
> provided, but failed to deliver the desired results.
It's not really similar, because you don't duplicate the dispatch code.
It took me some time to understand why you didn't change the "goto
fast_next_opcode", but that's where you miss the speedup.

The only difference with your change is that you save the range check
for the switch, so the slowdown probably comes from some minor output
change from GCC I guess.

Anyway, this suggests that the speedup really comes from better branch
prediction and not from saving the range check. The 1st paper I
mentioned simply states that saving the range check might make a small
differences. The point is that sometimes, when you are going to flush
the pipeline, it's like adding a few instructions, even conditional
jumps, does not make a difference. I've observed this behaviour quite a
few times while building from scratch a small Python interpreter.

I guess (but this might be wrong) that's because the execution units
were not used at their fullest, and adding conditional jumps doesn't
make a differeence because flushing a pipeline once or twice is almost
the same (the second flush removes just few instructions). Or something
like that, I'm not expert enough of CPU architecture to be sure of such
guesses.
msg78689 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 05:03
Topics
1) About different speedups on 32bits vs 64 bits
2) About PPC slowdown
3) PyPI

======= About different speedups on 32bits vs 64 bits =======
An interpreter is very register-hungry, so on x86_64 it spends much less
time on register spill (i.e. moving locals from/to memory), so
instruction dispatch takes a bigger share of execution time. If the rest
of the interpreter is too slow, indirect threading gives no advantage.

Look at the amount of register variables in PyEval_EvalFrameEx() (btw,
do you support any compiler where nowadays 'register' still makes a
difference? That's quite weird). Lots of them are simply cached copies
of fields of the current frame and of the current function; without
copying them to locals, the compiler should assume they could change at
any function call.

In fact, adding locals this way gave huge speedups on tight loops on the
Python interpreter I built with a colleague for our student project, to
experiment with speeding up Python.

And adding a write to memory in the dispatch code (to f->last_i) gave a
20% slowdown. Since my interpreter uses a copying garbage collector and
CPython uses reference counting, which is much slower (if you don't know
this, show me a fast JVM with reference counting), I'm even surprised
you can get such a big speedup from threading.

======= About PPC slowdown =======
Somebody should try the patch on Pentium4 as well.

During our VM course, threading slowed down a toy interpreter with 4 toy
opcodes. Our teachers suggested that with such a small interpreter,
since threaded code takes more space (in that case, ~64 vs ~100 bytes),
this could give problems with code caches, but suggested checking that
idea using performance counters. I'm not sure about why.

I don't have right now neither a Pentium4 nor a PowerPC available, so I
can't check myself. But this is the best way to analyze the performance
unexpected behaviour.

======= PyPI =======
Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch
Paolo> cannot be accepted there, sadly).

Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...
I was thinking to a private backport as well.
I didn't know about PyPI, it looks like PyPI is more for contributed
modules than for this, would that work?
msg78691 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 05:15
== On the patch itself ==
Why don't you use the C preprocessor instead of that Python code?
Sample code:

#define OPCODE_LIST(DEFINE_OPCODE) \
    DEFINE_OPCODE(STOP_CODE, 0)                 \
    DEFINE_OPCODE(POP_TOP, 1)                   \
    DEFINE_OPCODE(ROT_TWO, 2)                   \
    DEFINE_OPCODE(ROT_THREE, 3)                 \
    DEFINE_OPCODE(DUP_TOP, 4)                   \
    DEFINE_OPCODE(ROT_FOUR, 5)                  \
    DEFINE_OPCODE(NOP, 9)                       \
    ...

# define DECL_OPCODE(opcode)                    \
        [opcode] = && label_ ## opcode,

        void *opcodes[] = {
                OPCODE_LIST(DECL_OPCODE)
        };
# undef DECL_OPCODE

There are also other ways to do it, but using higher-order functions
within the preprocessor in this way is something that I learned from the
V8 source code.
It has the advantage that OPCODE_LIST can be used in a lot of other
places (maybe when implementing the 'opcode' module, if it's written in C).
msg78708 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-01 13:06
Hi,

> Why don't you use the C preprocessor instead of that Python code?
> Sample code:

We would have to change opcode.h for this to be truely useful (in order
to re-use OPCODE_LIST()). I think that should be the subject of a
separate bug entry for code reorganization.

Thanks for all the explanation and pointers! About register allocation,
I wonder if the "temporary variables" u,v,w could be declared separately
in each opcode rather than at the top of the eval function, so that the
compiler doesn't try to store their values between two opcodes.

As for the "register" declarations, I think they're just remains of the
past.

Antoine.
msg78711 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-01 14:00
Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...

    Paolo> I was thinking to a private backport as well.  I didn't know
    Paolo> about PyPI, it looks like PyPI is more for contributed modules
    Paolo> than for this, would that work?

I don't see why it wouldn't.

Skip
msg78722 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 19:43
> We would have to change opcode.h for this to be truely useful (in
order to re-use OPCODE_LIST()).

Yep.

> I think that should be the subject of a separate bug entry for code
reorganization.

Agreed, I'll maybe try to find time for it.

> Thanks for all the explanation and pointers!
You're welcome, thanks to you for writing the patch! And 
> About register allocation, I wonder if the "temporary variables" u,v,w
could be declared separately in each opcode rather than at the top of
the eval function, so that the compiler doesn't try to store their
values between two opcodes.

I didn't look at the output, but that shouldn't be needed with decent
optimizers, since they are local variables, so the compiler has a full
picture of their usage (this does not happen with the content of the
heap, where the frame object may lie).

I think that change would just save some compilation time for dataflow
analysis, maybe :-). Or could make clearer which variables is used
where, but that is a matter of taste (what's there is fine for me).

I just studied liveness analysis in compilers, and it computes whether a
variable is live before and after each statement; if the value of a
variable is not used in some piece of code until the next write to the
variable, it is considered dead in that piece of code, and that variable
does not take space; since u, v, w are either unused or are written to
before usage in all opcodes, they're dead at the beginning of each
opcode, so they're also dead just before dispatch.


The only important thing is that the content of the jump table are known
to the compiler and that the compiler makes use of that. Simply passing
a non-const jump table to some function defined elsewhere (which could
in theory modify it) would make the output much worse.
msg78730 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-01 21:58
Antoine Pitrou wrote:
> [...] count the number of indirect jump instructions in ceval.c:
> 
> grep -E "jmp[[:space:]]\*%" ceval.s
>
> There should be 85 to 90 of them, roughly. If there are many less, then
> the compiler has tried to optimize them by "sharing" them.

I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a
single computed goto. Is there some hidden option to make GCC avoid
sharing jumps? 

> Because otherwise the measurements these options are meant to do would
> be meaningless.

Ah, I see now. Maybe you should add a quick comment that mentions this. 

> I don't have a Sun machine to test, so I'll leave to someone else to
> check and enable if they want to.

I tested it and it worked, no test failures to report. Just change the
macro test: 

#ifdef __GNUC__ && \
...

to

#ifdef (__GNUC__ || __SUNPRO_C) && \
...


I attached some additional benchmarks on SunOS. So far, it seems the
benefits of the proposed optimization are highly compiler-dependent.
msg78731 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-01 22:00
> Attached new patch for fixes suggested by Alexandre (rename
> opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with
> USE_COMPUTED_GOTOS).

You forgot to update your script to use the new name.
msg78735 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-01 22:48
> I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a
> single computed goto. Is there some hidden option to make GCC avoid
> sharing jumps? 

Try -fno-crossjumping.

> I tested it and it worked, no test failures to report. Just change the
> macro test: 
> 
> #ifdef __GNUC__ && \
> ...
> 
> to
> 
> #ifdef (__GNUC__ || __SUNPRO_C) && \
> ...

Thanks.

> You forgot to update your script to use the new name.

Ah, that's rather dumb :)
msg78736 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-01 22:59
I've updated the comments as per Alexandre's request, added support for
SUN CC, and fixed the generation script to use the new filename.
msg78748 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-02 02:53
> I attached some additional benchmarks on SunOS. So far, it seems the
benefits of the proposed optimization are highly compiler-dependent.

Well, it would be more correct to say that as you verified for GCC 3.4,
"miscompilation" of the code happens easily.

Any literature research shows that threading in a fast interpreter does
help. My experience shows two exceptions to this rule:
a) bad compiler output
b) interpreters which are not efficient enough - when other operations
are even slower than instruction dispatch (which is really slow due to
costly mispredictions), threading can't help.

This is shown by the number of interpreters using threading.

Wikipedia has more pointers on this:
http://en.wikipedia.org/wiki/Threaded_code
Note that what I called "indirect threading" is called there instead
"token threading".

Another example of the importance of threading is also shown in this
article:
http://webkit.org/blog/189/announcing-squirrelfish/

Some clues about why Python does not use threading:

http://mail.python.org/pipermail/python-list/1999-May/002003.html

It is important to note that people in that mail are not aware of why
threading gives a speedup.

==
For SunCC, I can't say anything without looking at:
a) the generated code; if jump targets were aligned only for switch but
not for computed gotos, for instance, that could maybe explain such a
slowdown. Lots of other details might be relevant.
b) performance counters results, especially regarding mispredictions of
indirect branches.
msg78808 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-01-02 14:57
On 2009-01-01 23:59, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> I've updated the comments as per Alexandre's request, added support for
> SUN CC, and fixed the generation script to use the new filename.

Since the patch heavily relies on the compiler doing the right
thing (which esp. GCC often doesn't, unfortunately), I think that the
opcode dispatch code should only be enabled via a configure option.

This is safer than enabling the support unconditionally for GCC and
the SUN Pro C compiler, since it is rather likely that some GCC versions
have bugs which could render Python unusable if compiled with the
dispatching support enabled.

A configure option also has the additional benefit that you can enable
the support for compilers which support the syntax, but are not included
in the hard-coded list of compilers included in the patch.
msg78828 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-02 16:10
> This is safer than enabling the support unconditionally for GCC and
> the SUN Pro C compiler, since it is rather likely that some GCC versions
> have bugs which could render Python unusable if compiled with the
> dispatching support enabled.

What do you mean, "unusable"? 10% slower? Well, Python 3.x is already
unusable (compared to 2.x) by that metric... Until now, only Skip has
reported a slowdown on his PPC machine, while x86 systems have all seen
a positive (if tiny, sometimes) improvement.

I fear that with a configure option, disabled by default, the code will
get very poor testing and perhaps get broken in some subtle way without
anyone noticing.
msg78831 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-02 16:17
Antoine> I fear that with a configure option, disabled by default, the
    Antoine> code will get very poor testing and perhaps get broken in some
    Antoine> subtle way without anyone noticing.

That can be fixed by enabling that option on the buildbots where it is
presumed to help.  I see more slowdown on PPC than people are reporting as
speedups on Intel.  Is there something I can do to help debug the problem?
It doesn't appear the Apple version of gcc supports the -fno-crossjumping
flag.  If I dump the assembler code for ceval.c will that help others debug
the problem?

Skip
msg78834 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-02 16:28
> If I dump the assembler code for ceval.c will that help others debug
> the problem?

Well, I'm no PPC expert but it can be useful. Can you dump it with "-S
-dA"?

(also, can you try the latest patch? I've made some tiny adjustement in
the opcode epilogues, I don't think it will make a difference but who
knows)
msg78837 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-01-02 16:40
On 2009-01-02 17:10, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> This is safer than enabling the support unconditionally for GCC and
>> the SUN Pro C compiler, since it is rather likely that some GCC versions
>> have bugs which could render Python unusable if compiled with the
>> dispatching support enabled.
> 
> What do you mean, "unusable"? 

Well, not working. GCC versions often have optimizer bugs (esp. the
3.x series and early 4.x versions) and I wouldn't bet on them always
getting the dispatch optimizations right.

Trying to compile Python with an unconditionally enabled dispatch
patch on such a system would render Python unusable.

> 10% slower? Well, Python 3.x is already
> unusable (compared to 2.x) by that metric... Until now, only Skip has
> reported a slowdown on his PPC machine, while x86 systems have all seen
> a positive (if tiny, sometimes) improvement.
> 
> I fear that with a configure option, disabled by default, the code will
> get very poor testing and perhaps get broken in some subtle way without
> anyone noticing.

Like Skip said: the buildbots could take care of identifying such
problems.

People using the option would certainly report problems as well and
I'm sure that Linux distributions would compile Python with the switch
after verifying that their GCC version works correctly.
msg78871 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-02 20:26
OK, I think I'm misreading the output of pybench.  Let me reset.  Ignore
anything I've written previously on this topic.  Instead, I will just
post the output of my pybench comparison runs and let more expert people
interpret as appropriate.  The first file is the result of the run on
PowerPC (Mac G5).
msg78872 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-02 20:27
The next is the result of running on my MacBook Pro (Intel Core 2 Duo).
msg78898 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-02 22:57
> OK, I think I'm misreading the output of pybench.  Let me reset.  Ignore
> anything I've written previously on this topic.  Instead, I will just
> post the output of my pybench comparison runs and let more expert people
> interpret as appropriate.  The first file is the result of the run on
> PowerPC (Mac G5).

Ok, so the threaded version is actually faster by 20% on your PPC, and
slower by 5% on your Core 2 Duo. Thanks for doing the measurements!
msg78899 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-02 23:03
Antoine> Ok, so the threaded version is actually faster by 20% on your
    Antoine> PPC, and slower by 5% on your Core 2 Duo. Thanks for doing the
    Antoine> measurements!

Confirmed by pystone runs as well.  Sorry for the earlier misdirection.

Skip
msg78910 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-03 00:45
The patch make a huge difference on 64-bit Linux. I get a 20% speed-up
and the lowest run time so far. That is quite impressive!

At first glance, it seems the extra registers of the x86-64 architecture
permit GCC to avoid spilling registers onto the stack (see assembly just
below). However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

This is the x86 assembly generated by GCC 4.3 (annotated and
slightly edited for readability):

    movl    -440(%ebp), %eax  # tmp = next_instr
    movl    $145, %esi        # opcode = LIST_APPEND
    movl    8(%ebp), %ecx     # f
    subl    -408(%ebp), %eax  # tmp -= first_instr
    movl    %eax, 60(%ecx)    # f->f_lasti = tmp
    movl    -440(%ebp), %ebx  # next_instr
    movzbl  (%ebx), %eax      # tmp = *next_instr
    addl    $1, %ebx          # next_instr++
    movl    %ebx, -440(%ebp)  # next_instr
    movl    opcode_targets(,%eax,4), %eax  # tmp = opcode_targets[tmp]
    jmp     *%eax             # goto *tmp


And this is the x86-64 assembly generated also by GCC 4.3:

    movl    %r15d, %eax      # tmp = next_instr
    subl    76(%rsp), %eax   # tmp -= first_instr
    movl    $145, %ebp       # opcode = LIST_APPEND
    movl    %eax, 120(%r14)  # f->f_lasti = tmp
    movzbl  (%r15), %eax     # tmp = *next_instr
    addq    $1, %r15         # next_instr++
    movq    opcode_targets(,%rax,8), %rax  # tmp = opcode_targets[tmp]
    jmp     *%rax            # goto *tmp


The above assemblies are equivalent to the following C code:

    opcode = LIST_APPEND;
    f->f_lasti = ((int)(next_instr - first_instr));
    goto *opcode_targets[*next_instr++];

On the register-starved x86 architecture, the assembly has 4 stack load
and 1 store operations. While on the x86-64 architecture, most variables
are kept in registers thus it only uses 1 stack store operation. And
from what I saw from the assemblies, the extra registers with the
traditional switch dispatch aren't much used, especially with the opcode
prediction macros which avoid manipulations of f->f_lasti.

That said, I am glad to hear the patch makes Python on PowerPC faster,
because this supports the hypothesis that extra registers are better
used with indirect threading (PowerPC has 32 general-purpose registers).
msg78911 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-03 00:48
One more thing, the patch causes the following warnings to be emited by
GCC when USE_COMPUTED_GOTOS is undefined.  

Python/ceval.c: In function ‘PyEval_EvalFrameEx’:
Python/ceval.c:2420: warning: label ‘_make_function’ defined but not used
Python/ceval.c:2374: warning: label ‘_call_function_var_kw’ defined but
not used
Python/ceval.c:2280: warning: label ‘_setup_finally’ defined but not used
msg78915 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-03 00:58
Alexandre's last comment reminded me I forgot to post the PPC assembler
code.  Next two files are the output as requested by Antoine.
msg78921 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2009-01-03 01:37
> Alexandre Vassalotti <alexandre@peadrop.com> added the comment:
> The patch make a huge difference on 64-bit Linux. I get a 20% speed-up
> and the lowest run time so far. That is quite impressive!

I'm really, REALLY impressed by the speed up. Good work!

I'm not an expert in this kind of optimizations. Could we gain more
speed by making the dispatcher table more dense? Python has less than
128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
smaller table. I naively assume a smaller table increases the amount of
cache hits.

Christian
msg78923 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-03 01:51
About miscompilations: the current patch is a bit weird for GCC, because
you keep both the switch and the computed goto.

But actually, there is no case in which the switch is needed, and
computed goto give less room to GCC's choices.

So, can you try dropping the switch altogether, using always computed
goto and seeing how does the resulting code get compiled? I see you'll
need two labels (before and after argument fetch) per opcode and two
dispatch tabels, but that's no big deal (except for code alignment -
align just the common branch target).

An important warning is that by default, on my system, GCC 4.2 aligns
branch targets for switch to a 16-byte boundary (as recommended by the
Intel optimization guide), by adding a ".p2align 4,,7" GAS directive,
and it does not do that for computed goto.

Adding the directive by hand gave a small speedup, 2% I think; I should
try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled
by -O2), since that is supposed to give the same result.

Please use that yourself as well, and verify it works for labels, even
if I fear it doesn't.

> However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

It's Amdahl's law, even if this is not about parallel code. When the
rest is faster (x86_64), the same speedup on dispatch gives a bigger
overall speedup.

To be absolutely clear: x86_64 has more registers, so the rest of the
interpreter is faster than x86, but dispatch still takes the same
absolute time, which is 70% on x86_64, but only 50% on x86 (those are
realistic figures); if this patch halved dispatch time on both (we're
not so lucky), we would save 35% on x86_64 but only 25% on x86.
In fact, on inefficient interpreters, indirect threading is useless
altogether.

So, do those extra register help _so_ much? Yes. In my toy interpreter,
computing last_i for each dispatch doesn't give any big slowdown, but
storing it in f->last_i gives a ~20% slowdown - I cross-checked multiple
times because I was astonished. Conversely, when the program counter had
to be stored in memory, I think it was like 2x slower.
msg78925 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-03 02:20
> I'm not an expert in this kind of optimizations. Could we gain more
speed by making the dispatcher table more dense? Python has less than
128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
smaller table. I naively assume a smaller table increases the amount of
cache hits.

Well, you have no binary compatibility constraint with a new release, so
it can be tried and benchmarked, or it can be done anyway!
On x86_64 the impact of the jump table is 8 bytes per pointer * 256
pointers = 2KiB, and the L1 data cache of Pentium4 can be 8KiB or 16KiB
wide.
But I don't expect this to be noticeable in most synthetic
microbenchmarks. Matrix multiplication would be the perfect one I guess;
the repeated column access would kill the L1 data cache, if the whole
matrixes don't fit.
msg78956 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-03 13:22
> I'm not an expert in this kind of optimizations. Could we gain more
> speed by making the dispatcher table more dense? Python has less than
> 128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
> smaller table. I naively assume a smaller table increases the amount of
> cache hits.

I don't think so. The end of the current table, which doesn't correspond
to any valid opcodes, will not be cached anyway. The upper limit to be
considered is the maximum value for a valid opcode, which is 147.
Reducing that to 113 may reduce cache pressure, but only by a tiny bit.

Of course, only experimentation could tell us for sure :)
msg79032 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-01-04 02:00
IIUC, this is what gcc 4.2.4 generates on a Celeron M for the code
Alexandre posted:
        movl    -272(%ebp), %eax
        movl    8(%ebp), %edx
        subl    -228(%ebp), %eax
        movl    %eax, 60(%edx)
        movl    -272(%ebp), %ecx
        movzbl  (%ecx), %eax
-
        addl    $1, %ecx
        movl    %ecx, -272(%ebp)
        movl    opcode_targets.9311(,%eax,4), %ecx
        movl    %eax, %ebx
-
        jmp     *%ecx


And this is what ICC 11.0 generates for (what I think is) the same bit:
        movl      360(%esp), %ecx
        movl      %esi, %edi
        subl      304(%esp), %edi
        movl      %edi, 60(%ecx)
        movzbl    (%esi), %ecx 
-
        movl      opcode_targets.2239.0.34(,%ecx,4), %eax
        incl      %esi
-
        jmp       ..B12.137     # Prob 100%
        # ..B12.137: jmp       *%eax  


Does this mean that icc handles register starvation better here?

FWIW, on this CPU, compiling with icc gives a 20% speed boost in pybench
regardless of this patch.
msg79033 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 02:28
1st note: is that code from the threaded version? Note that you need to
modify the source to make it accept also ICC to try that.
In case you already did that, I guess the patch is not useful at all
with ICC since, as far as I can see, the jump is shared. It is vital to
this patch that the jump is not shared, something similar to
-fno-crossjumping should be found.

2nd note: the answer to your questions seems yes, ICC has less register
spills. Look for instance at:
       movl    -272(%ebp), %ecx
       movzbl  (%ecx), %eax
       addl    $1, %ecx

and
       movzbl    (%esi), %ecx
       incl      %esi

This represents the increment of the program counter after loading the
next opcode. In the code you posted, one can see that the program
counter is spilled to memory by GCC, but isn't by ICC. Either the spill
is elsewhere, or ICC is better here. And it's widely known that ICC has
a much better optimizer in many cases, and I remember that GCC register
allocator really needs improvement.

Finally, I'm a bit surprised by "addl $1, %ecx", since any peephole
optimizer should remove that; I'm not shocked just because I've never
seen perfect GCC output.
msg79034 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 02:29
Daniel, I forgot to ask for the compilation command line you used, since
they make a lot of difference. Can you post them? Thanks
msg79035 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-01-04 03:07
Paolo 'Blaisorblade' Giarrusso  wrote:
>
> 1st note: is that code from the threaded version? [...] It is vital to
> this patch that the jump is not shared, something similar to
> -fno-crossjumping should be found.

Yes, threaded version by unconditionally defining USE_THREADED_CODE
(old patch version :).

Ok,  I'll try to find a way to get at -fno-crossjumping behavior. BTW,
man gcc suggests using -fno-gcse for programs that use computed gotos
(couldn't fin

[...]
>  In the code you posted, one can see that the program
> counter is spilled to memory by GCC, but isn't by ICC. Either the spill
> is elsewhere, or ICC is better here.
I can [mail you|attach here] icc's output if you want to check the
overall code, it's about 1.9M with the code annotations.

> Finally, I'm a bit surprised by "addl $1, %ecx", since any peephole
> optimizer should remove that; I'm not shocked just because I've never
> seen perfect GCC output.

I'm glad to see the same issue in Alexandre's output, not my fault then :D

The command line I used (after a clean build with gcc) was:
icc  -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -w  -I.
-IInclude -I./Include   -DPy_BUILD_CORE  -S -fcode-asm -fsource-asm
-dA Python/ceval.c

(same as with gcc, except for warnings and -fcode-asm -fsource-asm).
msg79037 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-04 04:51
Paolo wrote:
> So, can you try dropping the switch altogether, using always computed
> goto and seeing how does the resulting code get compiled?

Removing the switch won't be possible unless we change the semantic
EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth
the increased complexity.

> To be absolutely clear: x86_64 has more registers, so the rest of the
> interpreter is faster than x86, but dispatch still takes the same
> absolute time, which is 70% on x86_64, but only 50% on x86 (those are
> realistic figures);

I don't understand what you mean by "absolute time" here. Do you
actually mean the time spent interpreting bytecodes compared to the time
spent in the other parts of Python? If so, your figures are wrong for
CPython on x86-64. It is about 50% just like on x86 (when running
pybench). With the patch, this drops to 35% on x86-64 and to 45% on x86.

> In my toy interpreter, computing last_i for each dispatch doesn't give
> any big slowdown, but storing it in f->last_i gives a ~20% slowdown.

I patched ceval.c to minimize f->last_i manipulations in the dispatch
code.  On x86, I got an extra 9% speed up on pybench. However, the patch
is a bit clumsy and a few unit tests are failing. I will see if I can
improve it and open a new issue if worthwhile.
msg79051 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-04 13:14
I'm sure this is the wrong place to bring this up, but I had a
thought about simple JIT compilation coupled with the opcode
dispatch changes in this issue.

Consider this silly function:

    >>> def f(a, b):
    ...   result = 0
    ...   while b:
    ...     result += a
    ...     b -= 1
    ...   return result
    ... 

which compiles to

      2           0 LOAD_CONST               1 (0)
		  3 STORE_FAST               2 (result)

      3           6 SETUP_LOOP              32 (to 41)
	    >>    9 LOAD_FAST                1 (b)
		 12 JUMP_IF_FALSE           24 (to 39)
		 15 POP_TOP             

      4          16 LOAD_FAST                2 (result)
		 19 LOAD_FAST                0 (a)
		 22 INPLACE_ADD         
		 23 STORE_FAST               2 (result)

      5          26 LOAD_FAST                1 (b)
		 29 LOAD_CONST               2 (1)
		 32 INPLACE_SUBTRACT    
		 33 STORE_FAST               1 (b)
		 36 JUMP_ABSOLUTE            9
	    >>   39 POP_TOP             
		 40 POP_BLOCK           

      6     >>   41 LOAD_FAST                2 (result)
		 44 RETURN_VALUE        

What if you built and compiled a "Mini Me" version of
PyEval_EvalFrameEx on-the-fly which only contained the prologue and
epilog of the real function and a small switch statement which only
knew about the the byte-code instructions used by f()?  Would the
compiler be better able to optimize the code?  Would the
instructions' placement nearer to each other provide better cache
behavior?  Would branch prediction by CPU be improved?

Another possibility would be to eliminate the for(;;) ... switch
altogether and just inline the code for the individual instructions.
It would help if the body of each bytecode instruction was
implemented as a macro, e.g.:

    #define _POP_TOP() \
	PREDICTED(POP_TOP); \
	TARGET(POP_TOP) \
	v = POP(); \
	Py_DECREF(v); \
	FAST_DISPATCH();

The above function could (lots of hand-waving here) be "compiled" to
something like

    PyObject *
    _MiniMe(PyFrameObject *f, int throwflag)
    {
	_PyEVAL_FRAMEEX_PROLOG

	_LOAD_CONST(1)
	_STORE_FAST(2)
	_SETUP_LOOP(_41)
    _9:
	_LOAD_FAST(1)
	_JUMP_IF_FALSE(_39)
	_POP_TOP()
	_LOAD_FAST(2)
	_LOAD_FAST(0)
	_INPLACE_ADD()
	_STORE_FAST(2)
    _26:
	_LOAD_FAST(1)
	_LOAD_CONST(2)
	_INPLACE_SUBTRACT()
	_STORE_FAST(1)
	_JUMP_ABSOLUTE(_9)
    _39:
	_POP_TOP()
	_POP_BLOCK()
	_LOAD_FAST(2)
	_RETURN_VALUE()

	_PyEVAL_FRAMEEX_EPILOG
    }

and the resulting binary code saved as an attribute of the code
object.  Presumably there would be some decision made about whether
to compile a function into this form (maybe only after it's been
called N times?).
msg79053 - (view) Author: Ralph Corderoy (ralph.corderoy) Date: 2009-01-04 13:21
Regarding compressing the opcode table to make better use of cache; 
what if the most frequently occurring opcodes where placed together,
e.g. the opcodes were ordered by frequency, most frequent first.  Just
based on a one-off static analysis of a body of code.  A level one cache
line can be, what, 64 bytes == 16 32-bit pointers.
msg79056 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 13:40
@Alexandre:
> > So, can you try dropping the switch altogether, using always computed
> > goto and seeing how does the resulting code get compiled?

> Removing the switch won't be possible unless we change the semantic
> EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth
> the increased complexity.
OK, it's time that I post code to experiment with that - there is no
need to break EXTENDED_ARG. And the point is to fight miscompilations.

> Do you actually mean the time spent interpreting bytecodes compared to
the time spent in the other parts of Python? If so, your figures are
wrong for CPython on x86-64. It is about 50% just like on x86 (when
running pybench). With the patch, this drops to 35% on x86-64 and to 45%
on x86.

More or less, I mean that, but I was making an example, and I made up
reasonable figures.
70%, or even more, just for _dispatch_ (i.e. just for the mispredicted
indirect jump), is valid for real-world Smalltalk interpreters for
instance, or for the ones in "The Structure and Performance of Efficient
Interpreters".
But, when you say "intepreting opcodes", I do not know which part you
refer to, if just the computed goto or for the whole code in the
interpreter function.
msg79058 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 14:38
@Skip: if one decides to generate binary code, there is no need to use
switches. Inline threading (also known as "code copying" in some
research papers) is what you are probably looking for:

http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/

For references and background on threading techniques mentioned there, see:

http://en.wikipedia.org/wiki/Threaded_code
http://www.complang.tuwien.ac.at/forth/threaded-code.html
msg79077 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-04 18:52
> Removing the switch won't be possible unless we change the semantic
> EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth
> the increased complexity.

Nevermind what I have said. I managed to remove switch pretty easily by
moving opcode fetching in the FAST_DISPATCH macro and abstracting the
control flow of the switch. There is no speed difference on pybench on
x86; on x86-64, the code is slower due to the opcode fetching change.

> I patched ceval.c to minimize f->last_i manipulations in the dispatch
> code.  On x86, I got an extra 9% speed up on pybench. However, the
> patch is a bit clumsy and a few unit tests are failing. I will see
> if I can improve it and open a new issue if worthwhile.

Nevermind that too. I found out f->last_i can be accessed anytime via
frame.getlineno(). So, you cannot really change how f->last_i is used
like I did.
msg79080 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-04 19:29
> I managed to remove switch pretty easily by moving opcode fetching
> in the FAST_DISPATCH macro and abstracting the control flow of the
> switch.

Here is the diff against threadceval5.patch.
msg79105 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-05 02:39
@alexandre: if you add two labels per opcode and two dispatch tables,
one before (like now) and one after the parameter fetch (where we have
the 'case'), you can keep the same speed.
And under the hood we also had two dispatch tables before, with the
switch, so it's not a big deal; finally, the second table is only used
in the slow path (i.e. EXTENDED_ARG, or when additional checks are needed).

About f->last_i, when I have time I want to try optimizing it. Somewhere
you can be sure it's not going to be used.
But you have some changes about that in the abstract-switch patch, I
think that was not intended?
msg79123 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-05 10:37
Le lundi 05 janvier 2009 à 02:39 +0000, Paolo 'Blaisorblade' Giarrusso a
écrit :
> About f->last_i, when I have time I want to try optimizing it. Somewhere
> you can be sure it's not going to be used.

There are lots of places which can call into arbitrary Python code. A
few opcodes could be optimized for sure.
msg79264 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-06 14:30
FWIW, I have made a quick attempt at removing the f->f_lasti assignment
in the few places where it could be removed, but it didn't make a
difference on my machine. The problem being that there are very few
places where it is legitimate to remove the assignment (even a call to
Py_DECREF can invoke arbitrary external code through destructors, so
even a very simple opcode such as POP_TOP needs the f->f_lasti assignment).

Another approach would be to create agregate opcodes, e.g.
BINARY_ADD_FAST would combine LOAD_FAST and BINARY_ADD. That's
orthogonal to the current issue though.
msg79302 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-07 07:35
@pitrou:
<ranting mode>
Argh, reference counting hinders even that?
</ranting mode>
I just discovered another problem caused by refcounting.

Various techniques allow to create binary code from the interpreter
binary, by just pasting together the code for the common interpreters
cases and producing calls to the other. But, guess what, on most
platforms (except plain x86, but including x86_64 and maybe x86 for the
shared library case) this does not work if the copied code includes
function calls (on x86_64 that's due to RIP-relative addressing, and on
similar issues on other platforms).

So, Python could not even benefit from that! That's a real pity...
I'll have to try with subroutine threading, to see if that's faster than
indirect threading on current platforms or if it is still slower.
msg79321 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-07 11:09
I finally implemented my suggestion for the switch elimination.
On top of threadedceval5.patch, apply abstract-switch-reduced.diff and
then restore-old-oparg-load.diff to test it.

This way, only computed goto's are used. I would like who had
miscompilation problems, or didn't get advantage from the patch, to try
compiling and benchmarking this version.

I've also been able to reenable static prediction (PREDICT_*) on top of
computed gotos, and that may help CPU prediction even more (the BTB for
the computed goto will be used to predict the 2nd most frequent target);
obviously it may instead cause a slowdown, I'd need stats on opcode
frequency to try guessing in advance (I'll try gathering them later
through DYNAMIC_EXECUTION_PROFILE).

Apply reenable-static-prediction.diff on top of the rest to get this.

I'll have to finish other stuff before closing everything to run
pybench, I can't get stable timings otherwise, so it'll take some time
(too busy, sorry). However I ran the check for regressions and they show
none.

====
abstract-switch-reduced.diff is the fixed abstract-switch.diff -
actually there was just one hunk which changed the handling of f_lasti,
and that looked extraneous. See the end of the message.

--- a/Python/ceval.c    Thu Jan 01 23:54:01 2009 +0100
+++ b/Python/ceval.c    Sun Jan 04 14:21:16 2009 -0500
@@ -1063,12 +1072,12 @@
                }

        fast_next_opcode:
-               f->f_lasti = INSTR_OFFSET();

                /* line-by-line tracing support */

                if (_Py_TracingPossible &&
                    tstate->c_tracefunc != NULL && !tstate->tracing) {
+                       f->f_lasti = INSTR_OFFSET();
                        /* see maybe_call_line_trace
                           for expository comments */
                        f->f_stacktop = stack_pointer;
msg79335 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-07 15:09
Paolo> Various techniques allow to create binary code from the
    Paolo> interpreter binary, by just pasting together the code for the
    Paolo> common interpreters cases and producing calls to the other. But,
    Paolo> guess what, on most platforms (except plain x86, but including
    Paolo> x86_64 and maybe x86 for the shared library case) this does not
    Paolo> work if the copied code includes function calls (on x86_64 that's
    Paolo> due to RIP-relative addressing, and on similar issues on other
    Paolo> platforms).

I don't understand.  I know little or nothing about the details of various
instruction set architectures or linkage methods.  Can you break it down
into a simple example?

Skip
msg79365 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-07 19:26
@skip:
In simple words, the x86 call:
  call 0x2000
placed at address 0x1000 becomes:
  call %rip + 0x1000

RIP holds the instruction pointer, which will be 0x1000 in this case
(actually, I'm ignoring the detail that when executing the call, RIP
points to the first byte of the next instruction).

If I execute the same instruction from a different location (i.e.
different RIP), things will break. So, only code for opcodes without
real calls, nor access to globals can be copied like this (inlines are OK).
With refcounting, not even POP_TOP is safe since it can call
destructors. DUP_TOP is still safe, I guess.
msg79505 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-01-09 21:02
Paolo,
Applying your patches makes no difference with gcc 4.2 and gives a
barely noticeable (~2%) slowdown with icc. These results are from a
Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to
run benchmarks on.
msg79530 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-10 04:52
@ ajaksu2
> Applying your patches makes no difference with gcc 4.2 and gives a
> barely noticeable (~2%) slowdown with icc.
"Your patches" is something quite unclear :-)
Which are the patch sets you are comparing?
And on 32 or 64 bits? But does Yonah supports 64bits? IIRC no, but I'm
not sure.
I would be surprised from slowdowns for restore-old-oparg-load.diff,
really surprised.
And I would be just surprised by slowdowns on
reenable-static-prediction.diff.
Also, about ICC output, we still need to ensure that it's properly
compiled (see above the instructions for counting "jmp *" or similar).
In the measurements above, ICC did miscompile the patch with the switch.
By "properly compiled" I mean that separate indirect branches are
generated, instead of just one.

> These results are from a
> Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to
> run benchmarks on.

Not really - at the very least we should listen to results on Pentium 4,
Core (i.e. Yonah) and Core 2, and I would also add Pentium3/Pentium M to
represent the P6 family.
Anyway, I have to do my benchmarks on this, I hope this weekend I'll
have time.
msg79532 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-10 08:52
The standing question is still: can we get ICC to produce the expected 
output? It looks like we still didn't manage, and since ICC is the best 
compiler out there, this matters.
Some problems with SunCC, even if it doesn't do jump sharing, it seems 
that one doesn't get the speedups - I guess that on most platforms we 
should select the most common alternative for interpreters (i.e. no 
switch, one jump table, given by threadedceval5.patch + 
abstract-switch-reduced.diff).

On core platforms we can spend time on fine-tuning - and the definition 
of "core platforms" is given by "do developers want to test for that?".

When that's fixed, I think that we just have to choose the simpler form 
and merge that.

@alexandre:
[about removing the switch]
> There is no speed difference on pybench on x86; on x86-64, the code 
is slower due to the opcode fetching change.

Actually, on my machine it looks like the difference is caused by the 
different layout caused by switch removal, or something like that, 
because fixing the opcode fetching doesn't make a difference here (see 
below).

Indeed, I did my benchmarking duties. Results are that 
abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% 
slowdown, and that all the others don't make a significant difference. 
The differences in the assembly output seem to be due to a different 
code layout for some branches, I didn't have a closer look.

However, experimenting with -falign-labels=16 can give a small speedup, 
I'm trying to improve the results (what I actually want is to align 
just the opcode handlers, I'll probably do that by hand).

reenable-static-prediction can give either a slowdown or a speedup by 
around 1%, i.e. around the statistical noise.

Note that on my machine, I get only a 10% speedup with the base patch, 
and that is more reasonable here. In the original thread on PyPy-dev, I 
got a 20% one with the Python interpreter I built for my student 
project, since that one is faster* (by a 2-3x factor, like PyVM), so 
the dispatch cost is more significant, and reducing it has a bigger 
impact. In fact, I couldn't believe that Python got the same speedup.

This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, 
and since it's a laptop I expect it to have slower RAM than a desktop.

@alexandre:
> The patch make a huge difference on 64-bit Linux. I get a 20% 
speed-up and the lowest run time so far. That is quite impressive!
Which processor is that?

@pitrou:
> The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.

Which is the processor? I guess the bigger speedups should be on 
Pentium4, since it has the bigger mispredict penalties.

====
*DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is 
not complete, has some bugs, and the source code has not yet been 
published, so discussion about why it is faster shall not happen here - 
I want to avoid any flame.
I believe it's not because of skipped runtime checks or such stuff, but 
because we used garbage collection instead of refcounting, indirect  
threading and tagged integers, but I don't have time to discuss that 
yet.
The original thread on pypy-dev has some insights if you are interested 
on this.
msg79536 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-10 09:44
> @pitrou:
> > The machine I got the 15% speedup on is in 64-bit mode with gcc
> 4.3.2.
> 
> Which is the processor? I guess the bigger speedups should be on 
> Pentium4, since it has the bigger mispredict penalties.

Athlon X2 3600+.
msg79537 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-10 09:54
> It looks like we still didn't manage, and since ICC is the best 
> compiler out there, this matters.

Well, from the perspective of Python, what matters mostly is the
commonly used compilers (that is, gcc and MSVC). I doubt many people
compile Python with icc, honestly.

Same for CPU-specific tuning: I don't think we want to ship Python with
compiler flags which depend on the particular CPU being used.
msg79539 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-01-10 11:29
On 2009-01-10 10:55, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> It looks like we still didn't manage, and since ICC is the best 
>> compiler out there, this matters.
> 
> Well, from the perspective of Python, what matters mostly is the
> commonly used compilers (that is, gcc and MSVC). I doubt many people
> compile Python with icc, honestly.

Agreed. Our main targets should be GCC for Linux and MSVC for Windows.

On other platforms such as Solaris and AIX, the native vendor compilers
are commonly used for compiling Python.

That said, with a configure option to turn the optimization on and
off, there shouldn't be any problem with slowdowns.

> Same for CPU-specific tuning: I don't think we want to ship Python with
> compiler flags which depend on the particular CPU being used.

Certainly not in the binaries we release on python.org.

Of course, people are still free to setup OPT to have the compiler
generate CPU specific code.
msg79541 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-10 12:08
> Same for CPU-specific tuning: I don't think we want to ship Python
with compiler flags which depend on the particular CPU being used.

I wasn't suggesting this - but since different CPUs have different
optimization rules, something like "oh, 20% performance slowdown on
PowerPC" or "on P4" is important to know (and yeah, configure options
are a good solution).

Which is the barrier for platform-specific tricks, as long as the code
is still portable? I'd like to experiment with __builtin_expect and with
manual alignment (through 'asm volatile(".p2align 4")' on x86/x86_64
with GAS - PPC might need a different alignment probably).

All hidden through macros to make it disappear on unsupported platforms,
without any configure option for them (there shouldn't be the need for
that).

> I doubt many people compile Python with icc, honestly.

Yep :-(. <rant>Why don't distributors do it?</rant> (First culprit might
be license/compatibility problems I guess, but the speedup would be
worth the time to fix the troubles IMHO).
msg79545 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-10 15:03
> (First culprit might
> be license/compatibility problems I guess, but the speedup would be
> worth the time to fix the troubles IMHO).

That would be the obvious reason IMO. And Intel is the only one who can
"fix the troubles".
msg79605 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2009-01-11 17:37
Here's a port of threadedceval5.patch to trunk. It passes the tests. I
haven't benchmarked this exact patch, but on one Intel Core2, a similar
patch got an 11%-14% speedup (on 2to3 and pybench).

I've also cleaned up Jakob Sievers' vmgen patch (representing
forth-style dispatch) a bit so that it passes all the tests, and on the
same machine it got a 13%-17% speedup. The vmgen patch is not quite at
feature parity (it throws out support for LLTRACE and a couple other
#defines), and there are fairly good arguments against committing it to
python at all (it requires installing and modifying vmgen to build), but
I'll post it after I've ported it to trunk.

Re skip and paolo: JITting and machine-specific assembly will probably
be important to speeding up Python in the long run, but they'll also
take a long while to get right, so we shouldn't let them distract us
from committing the dispatch optimization.
msg79627 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-01-11 20:46
Benchmarking pitrou_dispatch_2.7.patch applied to trunk r68522 on a 32-
bit Efficeon (x86) using gcc 4.2.4-1ubuntu3 yields a 10% pybench 
speedup.
msg79688 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2009-01-12 17:58
Here's the vmgen-based patch for comparison. Again, it passes all the
tests, but isn't complete outside of that and (unless consensus develops
that a couple percent is worth requiring vmgen) shouldn't distract from
reviewing Antoine's patch. I'll look over threadedceval5.patch in detail
next.
msg79695 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-12 19:03
A couple percent maybe is not worth vmgen-ing. But even if I'm not a
vmgen expert, I read many papers from Ertl about superinstructions and
replication, so the expected speedup from vmgen'ing is much bigger.
Is there some more advanced feature we are not using and could use?
Have the previous static predictions be converted to superinstructions?
Have other common sequences be treated like that?
Is there an existing discussion on this?
msg79699 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2009-01-12 20:19
I've left some line-by-line comments at
http://codereview.appspot.com/11905. Sorry if there was already a
Rietveld thread; I didn't see one.
msg79720 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2009-01-13 05:56
@Paolo: I'm going to be looking into converting more common sequences
into superinstructions. We only have LOAD_CONST+XXX so far. The others
are difficult because vmgen doesn't provide easy ways to deal with error
handling, but Jakob and I have come up with a couple ideas to get around
that.

Replication would be trickier since we want the bytecode generation to
be deterministic, but it's probably doable too. I'll post any
significant results I get.
msg79722 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-13 06:45
Ok, then vmgen adds almost just direct threading instead of indirect
threading.

Since the purpose of superinstructions is to eliminate dispatch
overhead, and that's more important when little actual work is done,
what about all ones which unconditionally end with FAST_DISPATCH and are
common? DUP_TOP, POP_TOP, DUP_TOPX(2,3) and other stack handling staff
which can't fail? To have any of them + XXX without error handling
problems? Removing handling of DUP_TOPX{4,5} is implied, you shouldn't
check functionality of the compiler during interpretation - indeed, even
the idea using a parameter for that is a waste. Have DUP_TOPX2 and
DUP_TOPX3, like JVM, is just simpler.

> Replication would be trickier since we want the bytecode generation to
be deterministic, but it's probably doable too.

Bytecode conversion during I/O is perfectly fine, to convert from the
actual bytecode to one of the chosen replicas. Conversion in a rescan
pass can be also OK (less cache friendly though, so if it's easy to
avoid, please do).
msg79734 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-13 11:37
As for superinstructions, you can find an example here: #4715.
msg79835 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-14 06:31
#4715 is interesting, but is not really about superinstructions.
Superinstructions are not created because they make sense; any common
sequence of opcodes can become a superinstruction, just for the point of
saving dispatches. And the creation can even be dynamic!

However, when I'll have substantial time for coding, I'd like to spend
it experimenting with subroutine threading. vmgen author despises it,
but nowadays it probably became even faster, as discussed by the article
"Context threading: A flexible and efficient dispatch technique for
virtual machine interpreters".
msg79980 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-16 21:58
Here is an updated patch with a dedicated configure option
(--with-computed-gotos, disabled by default), rather than a compiler
detection switch.

(sorry, the patch is very long because it seems running autoconf changes
a lot of things in the configure script)
msg79984 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-01-16 23:02
Antoine> (sorry, the patch is very long because it seems running
    Antoine> autoconf changes a lot of things in the configure script)

Normal practice is to not include the configure script in such patches and
indicate to people that they will need to run autoconf.

Skip
msg79985 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-16 23:08
Thanks Skip, it makes sense... so here is a patch without the configure
script.

(I wonder however if those huge configure changes, when checked into the
SVN, could break something silently somewhere)
msg80324 - (view) Author: Stefan Ring (Ringding) Date: 2009-01-21 14:02
Hi,

I ported threadedceval6.patch to Python 2.5.4, in case anyone is
interested...

Note that you need to run autoconf and autoheader.
msg80489 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2009-01-25 00:15
In the comment, you might mention both -fno-crossjumping and -fno-gcse.
-fno-crossjumping's description looks like it ought to prevent combining
computed gotos, but
http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html says
-fno-gcse actually does it, and in my brief tests, the manual is
actually correct (compiling with just -fno-crossjumping combined gotos
anyway).

Otherwise, threadedceval6.patch looks good to submit to me.
msg80515 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-25 16:42
Committed in py3k in r68924.
I won't backport it to trunk myself but it should be easy enough,
provided people are interested.
msg80571 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-26 15:28
-fno-gcse is controversial.
Even if it might avoid jumps sharing, the impact of that option has to
be measured, since common subexpression elimination allows omitting some
recalculations, so disabling global CSE might have a negative impact on
other code.

It would be maybe better to disable GCSE only for the interpreter loop,
but that would make some intra-file inlining impossible.
msg80669 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-01-27 21:40
I'll take on the two remaining tasks for this:

* add configure magic to detect when the compiler supports this so
  that it can default to --with-computed-gotos on modern systems.
* commit the back port to 2.7 trunk.
msg80717 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-28 16:42
For the record, I've compiled py3k on an embarassingly fast Core2-based
server (Xeon E5410), and the computed gotos option gives a 16% speedup
on pybench and pystone.

(with gcc 4.3.2 in 64-bit mode)
msg80828 - (view) Author: Kevin Watters (kevinwatters) Date: 2009-01-30 17:00
Does anyone know the equivalent ICC command line option for GCC's -fno-
gcse? (Or if it has one?) I can't find a related option in the docs.

It looks like ICC hits the same combining goto problem, as was 
mentioned: without changing any options, I applied pitrou_dispatch_2.7.patch to release-26maint and pybench reported 
literally no difference, +0.0%.

Even if stock Python is built with MSVC, devs like myself who ship 
Python would like to see the benefit.
msg80832 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-01-30 19:04
The x86 gentoo buildbot is failing to compile, with error:

/Python/makeopcodetargets.py ./Python/opcode_targets.h
  File "./Python/makeopcodetargets.py", line 28
    f.write(",\n".join("\t&&%s" % s for s in targets))
                                      ^
SyntaxError: invalid syntax
make: *** [Python/opcode_targets.h] Error 1

I suspect that it's because the system Python on this buildbot is Python 
2.3, which doesn't understand generator expressions.  Are there any 
objections to me adding a couple of square brackets to this line to turn 
the argument of join into a list comprehension?
msg80833 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-01-30 19:21
One other thought:  it seems that as a result of this change, the py3k 
build process now depends on having some version of Python already 
installed;  before this, it didn't.  Is this true, or am I misinterpreting 
something?

Might it be worth adding the file Python/opcode_targets.h to the 
distribution to avoid this problem?
msg80834 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-01-30 19:23
Sorry:  ignore that last.  Python/opcode_targets.h is already part of the 
distribution.  I don't know what I was doing wrong.
msg80838 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-30 21:38
Mark:

"""Are there any objections to me adding a couple of square brackets to
this line to turn the argument of join into a list comprehension?"""

No problems for me. You might also add to the top comments of the file
that it is 2.3-compatible.
msg80867 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-01-31 12:27
Square brackets added in r69133.  The gentoo x86 3.x buildbot seems to be 
passing the compile stage now.  (Though not the test stage, of course:  
one can't have everything!)
msg80881 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-31 17:17
> Square brackets added in r69133.  The gentoo x86 3.x buildbot seems to be 
> passing the compile stage now.  (Though not the test stage, of course:  
> one can't have everything!)

The test failure also happens on trunk, it may be related to the recent
tk changes.
msg80882 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-01-31 19:36
> The test failure also happens on trunk, it may be related to the recent
> tk changes.

Yes; sorry---I didn't mean to suggest that the test failures were in any 
way related to the opcode dispatch stuff.  Apart from the ttk teething 
difficulties, there's a weird 'Unknown signal 32' error that's been going 
on on the gentoo buildbot for many months now.  But that's a separate 
issue (issue #4970, to be precise).
msg81120 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-04 01:05
This has been checked in, right?  Might I suggest that the TARGET and
TARGET_WITH_IMPL macros not include the trailing colon?  I think that
will make it more friendly toward "smart" editors such as Emacs' C
mode.  I definitely get better indentation with

    TARGET(NOP):

than with

    TARGET(NOP)

S
msg81166 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-02-04 21:42
> Might I suggest that the TARGET and TARGET_WITH_IMPL macros not 
> include the trailing colon? 

Yes, please!
msg81341 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-07 17:43
Skip, removing the colon doesn't work if the macro adds code after the
colon :)
msg81361 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-08 01:31
Antoine> Skip, removing the colon doesn't work if the macro adds code
    Antoine> after the colon :)

When I looked I thought both TARGET and TARGET_WITH_IMPL ended with a colon,
but I see that's not the case.  How about removing TARGET_WITH_IMPL and just
include the goto explicitly?  There are only a few instances of the
TARGET_WITH_IMPL used.

Skip
msg83959 - (view) Author: Andrew I MacIntyre (aimacintyre) * Date: 2009-03-22 07:37
Out of interest, the attached patch against the py3k branch at r70516
cleans up the threaded code changes a little:
- gets rid of TARGET_WITH_IMPL macro;
- TARGET(op) is followed by a colon, so that it looks like a label (for
editors that make use of that).

On my systems (all older AMD with old versions of gcc), this patch has
performance slightly better than SVN r70516, and performance is
usually very close to the NO_COMPUTED_GOTOS build.
msg84736 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2009-03-31 01:19
Is a backport to 2.7 still planned?
msg84764 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-03-31 08:54
On 2009-03-31 03:19, A.M. Kuchling wrote:
> A.M. Kuchling <lists@amk.ca> added the comment:
> 
> Is a backport to 2.7 still planned?

I hope it is.
msg84770 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-31 12:58
Andrew, your patch disables the optimization that HAS_ARG(op) is a
constant when op is known by the compiler (that is, inside a
"TARGET_##op" label), so I'd rather keep the version which is currently
in SVN.
msg85838 - (view) Author: Andrew I MacIntyre (aimacintyre) * Date: 2009-04-10 02:15
Antoine, in my testing the "loss" of the HAS_ARG() optimisation in my
patch appears to have negligible cost on i386, but starts to look
significant on amd64.

On an Intel E8200 cpu running FreeBSD 7.1 amd64, with gcc 7.2.1 and the
3.1a2 sources, the computed goto version is ~8% faster (average time of
all rounds) for pybench (with warp factor set to 3 rather than the
default 10, to get the round time up over 10s) than without computed
gotos.  With my patch applied, the computed goto version is ~5.5% faster
than without computed gotos by the same measure.  On this platform,
Pystone rates at ~86k (no computed gotos), ~85k (computed gotos) and
~82k (computed gotos + my patch).

For comparison, this machine running Windows XP (32 bit) with the
python.org builds rates ~92k pystones for 2.6.1 and ~81k for 3.1a2. 
Pybench isn't distributed in the MSI installers :-(
msg90688 - (view) Author: Michele Dionisio (mdionisio) Date: 2009-07-18 19:58
I have patch the code of python3.1 to use computed goto tecnique also
with Visual Studio. The performance result is not good (I really don't
know why). But it is a good work-araound for use the computed goto also
on windows.
The only diffentes is that the opcode_targets vector is filled at run-time.
msg110842 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-19 23:55
This is too late for 2.x now, closing.
History
Date User Action Args
2010-07-19 23:55:22pitrousetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg110842
2010-05-20 20:36:27skip.montanarosetnosy: - skip.montanaro
2010-02-14 18:23:32willpsetnosy: + willp
2009-07-18 19:58:12mdionisiosetfiles: + newfile.zip
versions: + Python 3.1, - Python 2.7
nosy: + mdionisio

messages: + msg90688
2009-07-02 16:23:23jceasetnosy: + jcea
2009-04-11 18:08:05alexandre.vassalottisetnosy: - alexandre.vassalotti
2009-04-11 09:06:18mark.dickinsonsetnosy: - mark.dickinson
2009-04-10 02:15:40aimacintyresetmessages: + msg85838
2009-03-31 12:58:47pitrousetmessages: + msg84770
versions: - Python 3.1
2009-03-31 08:54:01lemburgsetmessages: + msg84764
2009-03-31 01:19:06akuchlingsetnosy: + akuchling
messages: + msg84736
2009-03-22 07:37:30aimacintyresetfiles: + ceval.c.threadcode-tidyup.patch
nosy: + aimacintyre
messages: + msg83959

2009-02-20 22:13:16jabsetnosy: + jab
2009-02-08 01:31:50skip.montanarosetmessages: + msg81361
2009-02-07 17:43:44pitrousetmessages: + msg81341
2009-02-04 21:42:08ggenellinasetnosy: + ggenellina
messages: + msg81166
2009-02-04 01:05:18skip.montanarosetmessages: + msg81120
2009-01-31 19:36:09mark.dickinsonsetmessages: + msg80882
2009-01-31 17:17:57pitrousetmessages: + msg80881
2009-01-31 12:27:57mark.dickinsonsetmessages: + msg80867
2009-01-30 21:38:51pitrousetmessages: + msg80838
2009-01-30 19:23:20mark.dickinsonsetmessages: + msg80834
2009-01-30 19:21:01mark.dickinsonsetmessages: + msg80833
2009-01-30 19:04:28mark.dickinsonsetnosy: + mark.dickinson
messages: + msg80832
2009-01-30 17:00:12kevinwatterssetmessages: + msg80828
2009-01-28 16:42:30pitrousetmessages: + msg80717
2009-01-27 21:40:50gregory.p.smithsetstatus: pending -> open
assignee: gregory.p.smith
messages: + msg80669
versions: + Python 3.1
2009-01-26 15:56:40kevinwatterssetnosy: + kevinwatters
2009-01-26 15:28:34blaisorbladesetmessages: + msg80571
2009-01-25 16:42:16pitrousetstatus: open -> pending
messages: + msg80515
versions: - Python 2.6, Python 3.0, Python 3.1
resolution: accepted
stage: patch review -> resolved
2009-01-25 00:15:11jyasskinsetmessages: + msg80489
2009-01-21 14:02:29Ringdingsetfiles: + threadedceval6-py254.patch
nosy: + Ringding
messages: + msg80324
2009-01-16 23:08:57pitrousetfiles: - threadedceval6.patch
2009-01-16 23:08:42pitrousetfiles: + threadedceval6.patch
messages: + msg79985
2009-01-16 23:02:37skip.montanarosetmessages: + msg79984
2009-01-16 21:59:31pitrousetfiles: + threadedceval6.patch
messages: + msg79980
2009-01-14 06:31:21blaisorbladesetmessages: + msg79835
2009-01-13 11:37:23pitrousetmessages: + msg79734
2009-01-13 06:45:37blaisorbladesetmessages: + msg79722
2009-01-13 05:56:42jyasskinsetmessages: + msg79720
2009-01-12 20:19:13jyasskinsetmessages: + msg79699
2009-01-12 19:03:15blaisorbladesetmessages: + msg79695
2009-01-12 17:59:08jyasskinsetfiles: + vmgen_2.7.patch
messages: + msg79688
2009-01-11 20:46:17gregory.p.smithsetmessages: + msg79627
2009-01-11 17:37:23jyasskinsetfiles: + pitrou_dispatch_2.7.patch
messages: + msg79605
2009-01-11 13:32:09spivsetnosy: + spiv
2009-01-11 02:53:57belopolskysetnosy: + belopolsky
2009-01-10 15:03:23pitrousetmessages: + msg79545
2009-01-10 12:08:54blaisorbladesetmessages: + msg79541
2009-01-10 11:29:36lemburgsetmessages: + msg79539
2009-01-10 09:54:59pitrousetmessages: + msg79537
2009-01-10 09:44:37pitrousetmessages: + msg79536
2009-01-10 08:52:30blaisorbladesetmessages: + msg79532
2009-01-10 04:52:53blaisorbladesetmessages: + msg79530
2009-01-09 21:02:19ajaksu2setmessages: + msg79505
2009-01-08 01:23:33gregory.p.smithsetnosy: + gregory.p.smith
2009-01-07 19:26:44blaisorbladesetmessages: + msg79365
2009-01-07 15:09:28skip.montanarosetmessages: + msg79335
2009-01-07 11:09:44blaisorbladesetfiles: + reenable-static-prediction.diff
messages: + msg79321
2009-01-07 10:12:53blaisorbladesetfiles: + restore-old-oparg-load.diff
2009-01-07 10:11:24blaisorbladesetfiles: + abstract-switch-reduced.diff
2009-01-07 07:35:37blaisorbladesetmessages: + msg79302
2009-01-06 14:30:08pitrousetmessages: + msg79264
2009-01-06 04:44:56jyasskinsetnosy: + collinwinter, jyasskin
2009-01-05 10:37:39pitrousetmessages: + msg79123
2009-01-05 02:39:45blaisorbladesetmessages: + msg79105
2009-01-04 19:29:57alexandre.vassalottisetfiles: + abstract-switch.diff
messages: + msg79080
2009-01-04 18:52:25alexandre.vassalottisetmessages: + msg79077
2009-01-04 14:38:36blaisorbladesetmessages: + msg79058
2009-01-04 13:40:47blaisorbladesetmessages: + msg79056
2009-01-04 13:22:40ralph.corderoysetnosy: lemburg, skip.montanaro, rhettinger, facundobatista, pitrou, christian.heimes, ajaksu2, alexandre.vassalotti, djc, ralph.corderoy, bboissin, blaisorblade, theatrus
2009-01-04 13:21:17ralph.corderoysetnosy: + ralph.corderoy
messages: + msg79053
2009-01-04 13:14:16skip.montanarosetmessages: + msg79051
2009-01-04 13:11:35facundobatistasetnosy: + facundobatista
2009-01-04 04:51:14alexandre.vassalottisetmessages: + msg79037
2009-01-04 03:07:09ajaksu2setmessages: + msg79035
2009-01-04 02:29:09blaisorbladesetmessages: + msg79034
2009-01-04 02:28:10blaisorbladesetmessages: + msg79033
2009-01-04 02:00:50ajaksu2setnosy: + ajaksu2
messages: + msg79032
2009-01-04 00:34:54bboissinsetnosy: + bboissin
2009-01-03 22:15:14theatrussetnosy: + theatrus
2009-01-03 17:24:35djcsetnosy: + djc
2009-01-03 13:22:38pitrousetmessages: + msg78956
2009-01-03 02:20:58blaisorbladesetmessages: + msg78925
2009-01-03 01:51:47blaisorbladesetmessages: + msg78923
2009-01-03 01:37:18christian.heimessetmessages: + msg78921
2009-01-03 01:00:51skip.montanarosetfiles: + ceval.i.threaded
2009-01-03 00:59:21skip.montanarosetfiles: + ceval.i.unthreaded
messages: + msg78915
2009-01-03 00:48:07alexandre.vassalottisetmessages: + msg78911
2009-01-03 00:45:28alexandre.vassalottisetfiles: + amd-athlon64-x2-64bit-pybench.txt
messages: + msg78910
2009-01-02 23:03:16skip.montanarosetmessages: + msg78899
2009-01-02 22:57:07pitrousetmessages: + msg78898
2009-01-02 20:27:54skip.montanarosetfiles: + pybench.sum.Intel
messages: + msg78872
2009-01-02 20:26:57skip.montanarosetfiles: + pybench.sum.PPC
messages: + msg78871
2009-01-02 16:40:37lemburgsetmessages: + msg78837
2009-01-02 16:28:28pitrousetmessages: + msg78834
2009-01-02 16:17:41skip.montanarosetmessages: + msg78831
2009-01-02 16:10:34pitrousetmessages: + msg78828
2009-01-02 14:57:15lemburgsetmessages: + msg78808
2009-01-02 02:53:57blaisorbladesetmessages: + msg78748
2009-01-01 23:02:19pitrousetfiles: - threadedceval4.patch
2009-01-01 22:59:50pitrousetfiles: + threadedceval5.patch
messages: + msg78736
2009-01-01 22:48:36pitrousetmessages: + msg78735
2009-01-01 22:00:25alexandre.vassalottisetmessages: + msg78731
2009-01-01 21:58:40alexandre.vassalottisetfiles: + amd-athlon64-x2-gcc-sunos-pybench.txt
2009-01-01 21:58:26alexandre.vassalottisetfiles: + amd-athlon64-x2-suncc-pybench.txt
messages: + msg78730
2009-01-01 19:43:17blaisorbladesetmessages: + msg78722
2009-01-01 14:00:39skip.montanarosetmessages: + msg78711
2009-01-01 13:06:20pitrousetmessages: + msg78708
2009-01-01 10:09:43arigosetnosy: - arigo
2009-01-01 05:15:11blaisorbladesetmessages: + msg78691
2009-01-01 05:03:04blaisorbladesetmessages: + msg78689
2009-01-01 04:36:56blaisorbladesetmessages: + msg78688
2009-01-01 04:28:09blaisorbladesetmessages: + msg78687
2009-01-01 04:10:21skip.montanarosetmessages: + msg78686
2009-01-01 03:44:47blaisorbladesetmessages: + msg78685
versions: + Python 2.6, Python 3.0, Python 2.7
2009-01-01 03:42:49blaisorbladesetnosy: + blaisorblade
2008-12-31 23:01:37pitrousetfiles: - threadedceval2.patch
2008-12-31 23:00:50pitrousetfiles: - threadedceval3.patch
2008-12-31 22:53:44pitrousetfiles: + threadedceval4.patch
messages: + msg78663
2008-12-31 22:41:01pitrousetmessages: + msg78660
2008-12-31 22:24:15skip.montanarosetmessages: + msg78658
2008-12-31 22:17:17pitrousetmessages: + msg78657
2008-12-31 22:12:28skip.montanarosetmessages: + msg78656
2008-12-31 22:11:49skip.montanarosetmessages: - msg78655
2008-12-31 22:10:48skip.montanarosetmessages: + msg78655
2008-12-31 21:28:41rhettingersetnosy: + rhettinger
2008-12-31 21:21:35alexandre.vassalottisetfiles: + intel-core2-mobile-pybench.txt
2008-12-31 21:21:01alexandre.vassalottisetfiles: + amd-athlon64-x2-pybench.txt
nosy: + alexandre.vassalotti
messages: + msg78653
2008-12-31 20:25:01pitrousetfiles: + threadedceval3.patch
messages: + msg78651
2008-12-31 17:26:47skip.montanarosetnosy: + skip.montanaro
2008-12-31 16:40:49pitrousetmessages: + msg78629
2008-12-31 16:37:23christian.heimessetmessages: + msg78628
2008-12-31 16:15:21pitrousetmessages: + msg78620
2008-12-31 16:02:04christian.heimessetmessages: + msg78616
2008-12-31 15:45:19christian.heimessetnosy: + christian.heimes
2008-12-27 13:30:05pitrousetfiles: - threadedceval1.patch
2008-12-27 13:29:26pitrousetfiles: + threadedceval2.patch
messages: + msg78364
2008-12-27 13:27:21pitrousetmessages: + msg78363
2008-12-27 13:21:07lemburgsetnosy: + lemburg
messages: + msg78362
2008-12-26 21:22:47pitrousetnosy: + arigo
messages: + msg78307
2008-12-26 21:09:38pitroucreate