Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster opcode dispatch on gcc #49003

Closed
pitrou opened this issue Dec 26, 2008 · 117 comments
Closed

Faster opcode dispatch on gcc #49003

pitrou opened this issue Dec 26, 2008 · 117 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@pitrou
Copy link
Member

pitrou commented Dec 26, 2008

BPO 4753
Nosy @malemburg, @akuchling, @rhettinger, @facundobatista, @gpshead, @jcea, @db3l, @abalkin, @pitrou, @tiran, @devdanzin, @rbtcollins, @ned-deily, @djc, @bitdancer, @jab
Files
  • amd-athlon64-x2-pybench.txt
  • intel-core2-mobile-pybench.txt
  • amd-athlon64-x2-suncc-pybench.txt
  • amd-athlon64-x2-gcc-sunos-pybench.txt
  • threadedceval5.patch
  • pybench.sum.PPC: Pybench summary for Mac G5
  • pybench.sum.Intel: Pybench summary for MacBook Pro (Intel Core2 Duo)
  • amd-athlon64-x2-64bit-pybench.txt
  • ceval.i.unthreaded: PPC (G5) unthreaded eval assembler
  • ceval.i.threaded: PPC (G5) threaded eval assembler
  • abstract-switch.diff
  • abstract-switch-reduced.diff: Fixed version of abstract-switch.diff, without extraneous changes (IMHO)
  • restore-old-oparg-load.diff: Additional fix, on top of threadedceval5.patch and abstract-switch-reduced.diff, to restore back performances and have no switch at all
  • reenable-static-prediction.diff: Reenable static opcode prediction on top of the rest
  • pitrou_dispatch_2.7.patch: threadedceval5.patch ported to trunk (2.7)
  • vmgen_2.7.patch: vmgen (GForth)-based dispatch optimization with simple superinstructions
  • threadedceval6.patch
  • threadedceval6-py254.patch: threadedceval6 for Python 2.5.4
  • ceval.c.threadcode-tidyup.patch: threaded code tidy up in eval.c
  • newfile.zip: ceval.c and opcode_targets.h
  • cgoto_py2710_hg_final.patch: Computed Goto Patch for Python 2.7.10
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2010-07-19.23:55:22.123>
    created_at = <Date 2008-12-26.21:09:38.283>
    labels = ['interpreter-core', 'performance']
    title = 'Faster opcode dispatch on gcc'
    updated_at = <Date 2015-06-02.21:38:03.306>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2015-06-02.21:38:03.306>
    actor = 'db3l'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2010-07-19.23:55:22.123>
    closer = 'pitrou'
    components = ['Interpreter Core']
    creation = <Date 2008-12-26.21:09:38.283>
    creator = 'pitrou'
    dependencies = []
    files = ['12512', '12513', '12521', '12522', '12524', '12545', '12546', '12551', '12553', '12555', '12584', '12633', '12634', '12635', '12687', '12705', '12769', '12824', '13392', '14521', '39527']
    hgrepos = []
    issue_num = 4753
    keywords = ['patch']
    message_count = 117.0
    messages = ['78306', '78307', '78362', '78363', '78364', '78616', '78620', '78628', '78629', '78651', '78653', '78656', '78657', '78658', '78660', '78663', '78685', '78686', '78687', '78688', '78689', '78691', '78708', '78711', '78722', '78730', '78731', '78735', '78736', '78748', '78808', '78828', '78831', '78834', '78837', '78871', '78872', '78898', '78899', '78910', '78911', '78915', '78921', '78923', '78925', '78956', '79032', '79033', '79034', '79035', '79037', '79051', '79053', '79056', '79058', '79077', '79080', '79105', '79123', '79264', '79302', '79321', '79335', '79365', '79505', '79530', '79532', '79536', '79537', '79539', '79541', '79545', '79605', '79627', '79688', '79695', '79699', '79720', '79722', '79734', '79835', '79980', '79984', '79985', '80324', '80489', '80515', '80571', '80669', '80717', '80828', '80832', '80833', '80834', '80838', '80867', '80881', '80882', '81120', '81166', '81341', '81361', '83959', '84736', '84764', '84770', '85838', '90688', '110842', '244229', '244231', '244256', '244329', '244617', '244620', '244628', '244693']
    nosy_count = 31.0
    nosy_names = ['lemburg', 'akuchling', 'collinwinter', 'rhettinger', 'facundobatista', 'gregory.p.smith', 'jcea', 'spiv', 'db3l', 'aimacintyre', 'belopolsky', 'ggenellina', 'pitrou', 'christian.heimes', 'ajaksu2', 'jyasskin', 'rbcollins', 'kevinwatters', 'ned.deily', 'djc', 'ralph.corderoy', 'bboissin', 'r.david.murray', 'blaisorblade', 'theatrus', 'Ringding', 'jab', 'mdionisio', 'willp', 'python-dev', 'parasa']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue4753'
    versions = ['Python 3.1']

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 26, 2008

    This patch implements what is usually called "threaded code" for the
    ceval loop on compilers which support it (only gcc). The idea is that
    there is a separate opcode dispatch epilog at the end of each opcode,
    which allows the CPU to make much better use of its branch prediction
    capabilities. The net result is a 15-20% average speedup on pybench and
    pystone, with higher speedups on very tight loops (see below for the
    full pybench result chart).

    The opcode jump table is generated by a separate script which is called
    as part of the Makefile (just as the asdl building script already is).

    On compilers other than gcc, performance will of course be unchanged.

    Test minimum run-time average run-time
    this other diff this other
    diff
    -------------------------------------------------------------------------------
    BuiltinFunctionCalls: 100ms 107ms -7.1% 101ms 110ms
    -8.2%
    BuiltinMethodLookup: 76ms 106ms -28.1% 78ms 106ms
    -26.5%
    CompareFloats: 108ms 141ms -23.2% 108ms 141ms
    -23.2%
    CompareFloatsIntegers: 171ms 188ms -9.4% 173ms 204ms
    -15.3%
    CompareIntegers: 165ms 213ms -22.6% 168ms 224ms
    -25.1%
    CompareInternedStrings: 127ms 169ms -24.6% 127ms 169ms
    -24.8%
    CompareLongs: 95ms 124ms -23.1% 95ms 126ms
    -24.5%
    CompareStrings: 109ms 136ms -20.2% 111ms 139ms
    -19.9%
    ComplexPythonFunctionCalls: 131ms 150ms -12.4% 136ms 151ms
    -10.2%
    ConcatStrings: 159ms 171ms -6.9% 160ms 173ms
    -7.4%
    CreateInstances: 148ms 157ms -5.6% 150ms 158ms
    -4.9%
    CreateNewInstances: 112ms 117ms -4.3% 112ms 118ms
    -4.6%
    CreateStringsWithConcat: 144ms 198ms -27.3% 148ms 199ms
    -25.7%
    DictCreation: 90ms 104ms -13.3% 90ms 104ms
    -13.1%
    DictWithFloatKeys: 117ms 153ms -23.7% 117ms 154ms
    -24.0%
    DictWithIntegerKeys: 104ms 153ms -32.3% 104ms 154ms
    -32.5%
    DictWithStringKeys: 90ms 140ms -35.7% 90ms 141ms
    -36.3%
    ForLoops: 100ms 161ms -38.1% 100ms 161ms
    -38.1%
    IfThenElse: 123ms 170ms -28.0% 125ms 171ms
    -27.1%
    ListSlicing: 142ms 141ms +0.3% 142ms 142ms
    +0.2%
    NestedForLoops: 135ms 190ms -29.0% 135ms 190ms
    -29.0%
    NormalClassAttribute: 249ms 281ms -11.5% 249ms 281ms
    -11.3%
    NormalInstanceAttribute: 110ms 153ms -28.2% 111ms 154ms
    -28.1%
    PythonFunctionCalls: 106ms 130ms -18.7% 108ms 131ms
    -17.2%
    PythonMethodCalls: 151ms 169ms -10.1% 152ms 169ms
    -9.8%
    Recursion: 183ms 242ms -24.7% 191ms 243ms
    -21.4%
    SecondImport: 142ms 138ms +2.7% 144ms 139ms
    +3.4%
    SecondPackageImport: 146ms 149ms -2.3% 148ms 150ms
    -1.5%
    SecondSubmoduleImport: 201ms 193ms +3.9% 201ms 195ms
    +3.4%
    SimpleComplexArithmetic: 90ms 112ms -19.6% 90ms 112ms
    -19.8%
    SimpleDictManipulation: 172ms 230ms -25.2% 173ms 231ms
    -25.0%
    SimpleFloatArithmetic: 98ms 133ms -26.3% 99ms 137ms
    -27.9%
    SimpleIntFloatArithmetic: 134ms 175ms -23.6% 138ms 176ms
    -21.6%
    SimpleIntegerArithmetic: 134ms 183ms -26.8% 141ms 183ms
    -23.1%
    SimpleListManipulation: 91ms 143ms -36.3% 93ms 143ms
    -35.1%
    SimpleLongArithmetic: 88ms 108ms -17.9% 91ms 109ms
    -16.2%
    SmallLists: 127ms 162ms -21.6% 129ms 164ms
    -21.2%
    SmallTuples: 149ms 177ms -15.6% 151ms 178ms
    -15.1%
    SpecialClassAttribute: 423ms 426ms -0.7% 426ms 430ms
    -0.9%
    SpecialInstanceAttribute: 110ms 154ms -28.2% 111ms 154ms
    -28.3%
    StringMappings: 428ms 443ms -3.4% 432ms 449ms
    -3.7%
    StringPredicates: 124ms 161ms -23.1% 125ms 162ms
    -22.7%
    StringSlicing: 207ms 223ms -7.1% 208ms 228ms
    -8.7%
    TryExcept: 72ms 166ms -56.3% 73ms 166ms
    -56.2%
    TryFinally: 93ms 120ms -22.9% 93ms 124ms
    -25.2%
    TryRaiseExcept: 52ms 64ms -19.2% 52ms 65ms
    -19.2%
    TupleSlicing: 177ms 195ms -9.1% 178ms 198ms
    -10.2%
    WithFinally: 147ms 163ms -10.2% 147ms 164ms
    -10.1%
    WithRaiseExcept: 156ms 173ms -10.1% 157ms 174ms
    -9.7%
    -------------------------------------------------------------------------------
    Totals: 6903ms 8356ms -17.4% 6982ms 8443ms
    -17.3%

    @pitrou pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 26, 2008
    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 26, 2008

    Armin, by reading the pypy-dev mailing-list it looks like you were
    interested in this.

    @malemburg
    Copy link
    Member

    On 2008-12-26 22:09, Antoine Pitrou wrote:

    This patch implements what is usually called "threaded code" for the
    ceval loop on compilers which support it (only gcc). The idea is that
    there is a separate opcode dispatch epilog at the end of each opcode,
    which allows the CPU to make much better use of its branch prediction
    capabilities. The net result is a 15-20% average speedup on pybench and
    pystone, with higher speedups on very tight loops (see below for the
    full pybench result chart).

    Now I know why you want opcode stats in pybench :-)

    This looks like a promising approach. Is is possible to backport
    this change to Python 2.x as well ?

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 27, 2008

    This looks like a promising approach. Is is possible to backport
    this change to Python 2.x as well ?

    Certainly.

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 27, 2008

    This new patch uses a statically initialized jump table rather than
    specific initialization code.

    @tiran
    Copy link
    Member

    tiran commented Dec 31, 2008

    I'm having trouble understanding the technique of the jump table. Can
    you provide some links to papers that explain the threaded code? I'm
    interested in learning more.
    How does your implementation compare to the GForth based threaded code
    speedwise?

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 31, 2008

    I'm having trouble understanding the technique of the jump table. Can
    you provide some links to papers that explain the threaded code? I'm
    interested in learning more.

    I haven't read any papers. Having a jump table in itself isn't special
    (the compiler does exactly that when compiling the switch() statement).
    What's special is that a dedicated indirect jump instruction at the end
    of each opcode helps the CPU make a separate prediction for which opcode
    follows the other one, which is not possible with a switch statement
    where the jump instruction is shared by all opcodes. I believe that's
    where most of the speedup comes from.

    If you read the patch it will probably be easy to understand.

    I had the idea to try this after a thread on pypy-dev, there are more
    references there:
    http://codespeak.net/pipermail/pypy-dev/2008q4/004916.html

    How does your implementation compare to the GForth based threaded code
    speedwise?

    Don't know! Your experiments are welcome. My patch is far simpler to
    integrate though (it's small, introduces very few changes and does not
    break any existing tests).

    @tiran
    Copy link
    Member

    tiran commented Dec 31, 2008

    I haven't read any papers. Having a jump table in itself isn't special
    (the compiler does exactly that when compiling the switch() statement).
    What's special is that a dedicated indirect jump instruction at the end
    of each opcode helps the CPU make a separate prediction for which opcode
    follows the other one, which is not possible with a switch statement
    where the jump instruction is shared by all opcodes. I believe that's
    where most of the speedup comes from.

    If you read the patch it will probably be easy to understand.

    You are right. It's easier to understand after I've learned how the
    opcode_targets table is working. Previously I didn't know that one can
    store the address of a label in an array. Before I got it I wondered
    where the pointers were defined. Is this a special GCC feature? I
    haven't seen it before.

    Don't know! Your experiments are welcome. My patch is far simpler to
    integrate though (it's small, introduces very few changes and does not
    break any existing tests).

    Yes, your patch is much smaller, less intrusive and easier to understand
    with a little background in CS.

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 31, 2008

    Is this a special GCC feature?

    Yes, it is.
    http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 31, 2008

    This new patch adds some detailed comments, at Jason Orendorff's request.

    @avassalotti
    Copy link
    Member

    You may want to check out bpo-1408710 in which a similar patch was
    provided, but failed to deliver the desired results.

    I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2
    laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark
    results.

    The patch looks pretty clean. Here is a few things that caught my
    attention while reading your patch.

    First, you should rename opcode_targets.c to opcode_targets.h. This will
    make it explicit that the file is not compiled, but just included.

    Also, the macro USE_THREADED_CODE should be renamed to something else;
    the word "thread" is too tightly associated with multi-threading.
    Furthermore, threaded code simply refers to code consisting only of
    function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
    better.

    Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or
    LLTRACE is enabled? I tested your patch with both enabled and I didn't
    see any test failures.

    By the way, SUNCC also supports GCC's syntax for labels as values
    (http://docs.sun.com/app/docs/doc/819-5265/bjabt?l=en&a=view).

    @smontanaro
    Copy link
    Contributor

    Works pretty well for me on my MacBook Pro, but on my G5 it performed
    abysmally. In fact, it ran so much worse that I cleaned up my sandbox
    and did both checks all over again to make sure I didn't mess something
    up. It looks like my MacBook Pro saw about a 7% improvement while my
    G5 saw a 14% degradation. Both computers are running Mac OS X 10.5.6
    with the latest Xcode - 3.1.2. On both computers gcc -v reports 4.0.1,
    Apple build 5490.

    If this is applied to the core I think it will have to select for more
    than just gcc. It will also have to select based on the instruction set
    architecture.

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 31, 2008

    Works pretty well for me on my MacBook Pro, but on my G5 it performed
    abysmally. In fact, it ran so much worse that I cleaned up my sandbox
    and did both checks all over again to make sure I didn't mess something
    up. It looks like my MacBook Pro saw about a 14% degradation. Both
    computers are running Mac OS X 10.5.6 with the latest Xcode - 3.1.2.
    On both computers gcc -v reports 4.0.1, Apple build 5490.

    You're sure you didn't compile in debug mode or something? Just
    checking.

    @smontanaro
    Copy link
    Contributor

    Antoine> You're sure you didn't compile in debug mode or something? Just
    Antoine> checking.

    There was a cut-n-paste error in that one which I noticed right after
    submitting (man, do I hate the crappy editing capability of <textarea>
    widgets). I removed it within a minute or two and replaced it with a
    correct version. Short answer: Intel thumbs up, PowerPC thumbs down.

    Skip

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 31, 2008

    Hello,

    You may want to check out bpo-1408710 in which a similar patch was
    provided, but failed to deliver the desired results.

    I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2
    laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark
    results.

    Thanks. The machine I got the 15% speedup on is in 64-bit mode with gcc
    4.3.2.

    If you want to investigate, you can output the assembler code for
    ceval.c; the command-line should be something like:

    gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -IInclude -I./Include -DPy_BUILD_CORE -S -dA Python/ceval.c

    and then count the number of indirect jump instructions in ceval.c:

    grep -E "jmp[[:space:]]\*%" ceval.s

    There should be 85 to 90 of them, roughly. If there are many less, then
    the compiler has tried to optimize them by "sharing" them.

    First, you should rename opcode_targets.c to opcode_targets.h. This will
    make it explicit that the file is not compiled, but just included.

    Ok.

    Also, the macro USE_THREADED_CODE should be renamed to something else;
    the word "thread" is too tightly associated with multi-threading.
    Furthermore, threaded code simply refers to code consisting only of
    function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
    better.

    Ok.

    Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or
    LLTRACE is enabled? I tested your patch with both enabled and I didn't
    see any test failures.

    Because otherwise the measurements these options are meant to do would
    be meaningless.

    By the way, SUNCC also supports GCC's syntax for labels as values

    I don't have a Sun machine to test, so I'll leave to someone else to
    check and enable if they want to.

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 31, 2008

    Attached new patch for fixes suggested by Alexandre (rename
    opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with
    USE_COMPUTED_GOTOS).

    @blaisorblade
    Copy link
    Mannequin

    blaisorblade mannequin commented Jan 1, 2009

    Mentioning other versions as well.
    The patch is so easy that it can be backported to all supported
    versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as
    far as I can see this patch cannot be accepted there, sadly).

    @smontanaro
    Copy link
    Contributor

    Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch
    Paolo> cannot be accepted there, sadly).

    You could backport it to 2.4 & 2.5 and just put it up on PyPI...

    @blaisorblade
    Copy link
    Mannequin

    blaisorblade mannequin commented Jan 1, 2009

    Some other comments.
    The time saving of indirect threading are also associated with the
    removal of the range check, but better branch prediction is the main
    advantage.

    Also, the macro USE_THREADED_CODE should be renamed to something else;
    the word "thread" is too tightly associated with multi-threading.

    That's true.

    Furthermore, threaded code simply refers to code consisting only of
    function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
    better.

    I'd prefer USE_DIRECT_DISPATCH (or better, USE_THREADED_DISPATCH) rather
    than USE_COMPUTED_GOTO, since the latter is just the used language
    construct.

    "indirect threading" is the standard name in CS literature to define
    this technique. "Direct threading" is a variation where in the bytecode
    array, opcode is replaced by the pointer opcode_handler[opcode], so that
    dispatch is slightly faster. Most interpreters use indirect threading to
    save memory, and because it enables to switch the opcode handlers table
    to activate for instance debugging.

    The best paper about this is:
    "The Structure and Performance of Efficient Interpreters, M. Anton Ertl
    and David Gregg, 2003".

    The original paper about (direct) threaded code is this:
    "Threaded Code, James R. Bell, Comm. of ACM, 1973", but I don't feel it
    relevant at all today.
    Indirect threading was introduced in "Indirect Threaded Code, Robert B.
    K. Dewar, Communications of the ACM, 1975" (that's just a bit more
    relevant, but still).

    @blaisorblade
    Copy link
    Mannequin

    blaisorblade mannequin commented Jan 1, 2009

    You may want to check out bpo-1408710 in which a similar patch was
    provided, but failed to deliver the desired results.
    It's not really similar, because you don't duplicate the dispatch code.
    It took me some time to understand why you didn't change the "goto
    fast_next_opcode", but that's where you miss the speedup.

    The only difference with your change is that you save the range check
    for the switch, so the slowdown probably comes from some minor output
    change from GCC I guess.

    Anyway, this suggests that the speedup really comes from better branch
    prediction and not from saving the range check. The 1st paper I
    mentioned simply states that saving the range check might make a small
    differences. The point is that sometimes, when you are going to flush
    the pipeline, it's like adding a few instructions, even conditional
    jumps, does not make a difference. I've observed this behaviour quite a
    few times while building from scratch a small Python interpreter.

    I guess (but this might be wrong) that's because the execution units
    were not used at their fullest, and adding conditional jumps doesn't
    make a differeence because flushing a pipeline once or twice is almost
    the same (the second flush removes just few instructions). Or something
    like that, I'm not expert enough of CPU architecture to be sure of such
    guesses.

    @blaisorblade
    Copy link
    Mannequin

    blaisorblade mannequin commented Jan 1, 2009

    Topics

    1. About different speedups on 32bits vs 64 bits
    2. About PPC slowdown
    3. PyPI

    ======= About different speedups on 32bits vs 64 bits =======
    An interpreter is very register-hungry, so on x86_64 it spends much less
    time on register spill (i.e. moving locals from/to memory), so
    instruction dispatch takes a bigger share of execution time. If the rest
    of the interpreter is too slow, indirect threading gives no advantage.

    Look at the amount of register variables in PyEval_EvalFrameEx() (btw,
    do you support any compiler where nowadays 'register' still makes a
    difference? That's quite weird). Lots of them are simply cached copies
    of fields of the current frame and of the current function; without
    copying them to locals, the compiler should assume they could change at
    any function call.

    In fact, adding locals this way gave huge speedups on tight loops on the
    Python interpreter I built with a colleague for our student project, to
    experiment with speeding up Python.

    And adding a write to memory in the dispatch code (to f->last_i) gave a
    20% slowdown. Since my interpreter uses a copying garbage collector and
    CPython uses reference counting, which is much slower (if you don't know
    this, show me a fast JVM with reference counting), I'm even surprised
    you can get such a big speedup from threading.

    ======= About PPC slowdown =======
    Somebody should try the patch on Pentium4 as well.

    During our VM course, threading slowed down a toy interpreter with 4 toy
    opcodes. Our teachers suggested that with such a small interpreter,
    since threaded code takes more space (in that case, ~64 vs ~100 bytes),
    this could give problems with code caches, but suggested checking that
    idea using performance counters. I'm not sure about why.

    I don't have right now neither a Pentium4 nor a PowerPC available, so I
    can't check myself. But this is the best way to analyze the performance
    unexpected behaviour.

    ======= PyPI =======
    Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch
    Paolo> cannot be accepted there, sadly).

    Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...
    I was thinking to a private backport as well.
    I didn't know about PyPI, it looks like PyPI is more for contributed
    modules than for this, would that work?

    @blaisorblade
    Copy link
    Mannequin

    blaisorblade mannequin commented Jan 1, 2009

    == On the patch itself ==
    Why don't you use the C preprocessor instead of that Python code?
    Sample code:

    #define OPCODE_LIST(DEFINE_OPCODE) \
        DEFINE_OPCODE(STOP_CODE, 0)                 \
        DEFINE_OPCODE(POP_TOP, 1)                   \
        DEFINE_OPCODE(ROT_TWO, 2)                   \
        DEFINE_OPCODE(ROT_THREE, 3)                 \
        DEFINE_OPCODE(DUP_TOP, 4)                   \
        DEFINE_OPCODE(ROT_FOUR, 5)                  \
        DEFINE_OPCODE(NOP, 9)                       \
        ...

    # define DECL_OPCODE(opcode) \
    [opcode] = && label_ ## opcode,

            void *opcodes[] = {
                    OPCODE_LIST(DECL_OPCODE)
            };
    # undef DECL_OPCODE

    There are also other ways to do it, but using higher-order functions
    within the preprocessor in this way is something that I learned from the
    V8 source code.
    It has the advantage that OPCODE_LIST can be used in a lot of other
    places (maybe when implementing the 'opcode' module, if it's written in C).

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 1, 2009

    Hi,

    Why don't you use the C preprocessor instead of that Python code?
    Sample code:

    We would have to change opcode.h for this to be truely useful (in order
    to re-use OPCODE_LIST()). I think that should be the subject of a
    separate bug entry for code reorganization.

    Thanks for all the explanation and pointers! About register allocation,
    I wonder if the "temporary variables" u,v,w could be declared separately
    in each opcode rather than at the top of the eval function, so that the
    compiler doesn't try to store their values between two opcodes.

    As for the "register" declarations, I think they're just remains of the
    past.

    Antoine.

    @smontanaro
    Copy link
    Contributor

    Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...

    Paolo> I was thinking to a private backport as well.  I didn't know
    Paolo> about PyPI, it looks like PyPI is more for contributed modules
    Paolo> than for this, would that work?
    

    I don't see why it wouldn't.

    Skip

    @blaisorblade
    Copy link
    Mannequin

    blaisorblade mannequin commented Jan 1, 2009

    We would have to change opcode.h for this to be truely useful (in
    order to re-use OPCODE_LIST()).

    Yep.

    I think that should be the subject of a separate bug entry for code
    reorganization.

    Agreed, I'll maybe try to find time for it.

    Thanks for all the explanation and pointers!
    You're welcome, thanks to you for writing the patch! And
    About register allocation, I wonder if the "temporary variables" u,v,w
    could be declared separately in each opcode rather than at the top of
    the eval function, so that the compiler doesn't try to store their
    values between two opcodes.

    I didn't look at the output, but that shouldn't be needed with decent
    optimizers, since they are local variables, so the compiler has a full
    picture of their usage (this does not happen with the content of the
    heap, where the frame object may lie).

    I think that change would just save some compilation time for dataflow
    analysis, maybe :-). Or could make clearer which variables is used
    where, but that is a matter of taste (what's there is fine for me).

    I just studied liveness analysis in compilers, and it computes whether a
    variable is live before and after each statement; if the value of a
    variable is not used in some piece of code until the next write to the
    variable, it is considered dead in that piece of code, and that variable
    does not take space; since u, v, w are either unused or are written to
    before usage in all opcodes, they're dead at the beginning of each
    opcode, so they're also dead just before dispatch.

    The only important thing is that the content of the jump table are known
    to the compiler and that the compiler makes use of that. Simply passing
    a non-const jump table to some function defined elsewhere (which could
    in theory modify it) would make the output much worse.

    @avassalotti
    Copy link
    Member

    Antoine Pitrou wrote:

    [...] count the number of indirect jump instructions in ceval.c:

    grep -E "jmp[[:space:]]\*%" ceval.s

    There should be 85 to 90 of them, roughly. If there are many less, then
    the compiler has tried to optimize them by "sharing" them.

    I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a
    single computed goto. Is there some hidden option to make GCC avoid
    sharing jumps?

    Because otherwise the measurements these options are meant to do would
    be meaningless.

    Ah, I see now. Maybe you should add a quick comment that mentions this.

    I don't have a Sun machine to test, so I'll leave to someone else to
    check and enable if they want to.

    I tested it and it worked, no test failures to report. Just change the
    macro test:

    #ifdef __GNUC__ && \
    ...

    to

    #ifdef (__GNUC__ || __SUNPRO_C) && \
    ...

    I attached some additional benchmarks on SunOS. So far, it seems the
    benefits of the proposed optimization are highly compiler-dependent.

    @avassalotti
    Copy link
    Member

    Attached new patch for fixes suggested by Alexandre (rename
    opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with
    USE_COMPUTED_GOTOS).

    You forgot to update your script to use the new name.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 1, 2009

    I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a
    single computed goto. Is there some hidden option to make GCC avoid
    sharing jumps?

    Try -fno-crossjumping.

    I tested it and it worked, no test failures to report. Just change the
    macro test:

    #ifdef __GNUC__ && \
    ...

    to

    #ifdef (GNUC || __SUNPRO_C) && \
    ...

    Thanks.

    You forgot to update your script to use the new name.

    Ah, that's rather dumb :)

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 28, 2009

    For the record, I've compiled py3k on an embarassingly fast Core2-based
    server (Xeon E5410), and the computed gotos option gives a 16% speedup
    on pybench and pystone.

    (with gcc 4.3.2 in 64-bit mode)

    @kevinwatters
    Copy link
    Mannequin

    kevinwatters mannequin commented Jan 30, 2009

    Does anyone know the equivalent ICC command line option for GCC's -fno-
    gcse? (Or if it has one?) I can't find a related option in the docs.

    It looks like ICC hits the same combining goto problem, as was
    mentioned: without changing any options, I applied pitrou_dispatch_2.7.patch to release-26maint and pybench reported
    literally no difference, +0.0%.

    Even if stock Python is built with MSVC, devs like myself who ship
    Python would like to see the benefit.

    @mdickinson
    Copy link
    Member

    The x86 gentoo buildbot is failing to compile, with error:

    /Python/makeopcodetargets.py ./Python/opcode_targets.h
    File "./Python/makeopcodetargets.py", line 28
    f.write(",\n".join("\t&&%s" % s for s in targets))
    ^
    SyntaxError: invalid syntax
    make: *** [Python/opcode_targets.h] Error 1

    I suspect that it's because the system Python on this buildbot is Python
    2.3, which doesn't understand generator expressions. Are there any
    objections to me adding a couple of square brackets to this line to turn
    the argument of join into a list comprehension?

    @mdickinson
    Copy link
    Member

    One other thought: it seems that as a result of this change, the py3k
    build process now depends on having some version of Python already
    installed; before this, it didn't. Is this true, or am I misinterpreting
    something?

    Might it be worth adding the file Python/opcode_targets.h to the
    distribution to avoid this problem?

    @mdickinson
    Copy link
    Member

    Sorry: ignore that last. Python/opcode_targets.h is already part of the
    distribution. I don't know what I was doing wrong.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 30, 2009

    Mark:

    """Are there any objections to me adding a couple of square brackets to
    this line to turn the argument of join into a list comprehension?"""

    No problems for me. You might also add to the top comments of the file
    that it is 2.3-compatible.

    @mdickinson
    Copy link
    Member

    Square brackets added in r69133. The gentoo x86 3.x buildbot seems to be
    passing the compile stage now. (Though not the test stage, of course:
    one can't have everything!)

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 31, 2009

    Square brackets added in r69133. The gentoo x86 3.x buildbot seems to be
    passing the compile stage now. (Though not the test stage, of course:
    one can't have everything!)

    The test failure also happens on trunk, it may be related to the recent
    tk changes.

    @mdickinson
    Copy link
    Member

    The test failure also happens on trunk, it may be related to the recent
    tk changes.

    Yes; sorry---I didn't mean to suggest that the test failures were in any
    way related to the opcode dispatch stuff. Apart from the ttk teething
    difficulties, there's a weird 'Unknown signal 32' error that's been going
    on on the gentoo buildbot for many months now. But that's a separate
    issue (issue bpo-4970, to be precise).

    @smontanaro
    Copy link
    Contributor

    This has been checked in, right? Might I suggest that the TARGET and
    TARGET_WITH_IMPL macros not include the trailing colon? I think that
    will make it more friendly toward "smart" editors such as Emacs' C
    mode. I definitely get better indentation with

    TARGET(NOP):
    

    than with

        TARGET(NOP)

    S

    @ggenellina
    Copy link
    Mannequin

    ggenellina mannequin commented Feb 4, 2009

    Might I suggest that the TARGET and TARGET_WITH_IMPL macros not
    include the trailing colon?

    Yes, please!

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 7, 2009

    Skip, removing the colon doesn't work if the macro adds code after the
    colon :)

    @smontanaro
    Copy link
    Contributor

    Antoine> Skip, removing the colon doesn't work if the macro adds code
    Antoine> after the colon :)

    When I looked I thought both TARGET and TARGET_WITH_IMPL ended with a colon,
    but I see that's not the case. How about removing TARGET_WITH_IMPL and just
    include the goto explicitly? There are only a few instances of the
    TARGET_WITH_IMPL used.

    Skip

    @aimacintyre
    Copy link
    Mannequin

    aimacintyre mannequin commented Mar 22, 2009

    Out of interest, the attached patch against the py3k branch at r70516
    cleans up the threaded code changes a little:

    • gets rid of TARGET_WITH_IMPL macro;
    • TARGET(op) is followed by a colon, so that it looks like a label (for
      editors that make use of that).

    On my systems (all older AMD with old versions of gcc), this patch has
    performance slightly better than SVN r70516, and performance is
    usually very close to the NO_COMPUTED_GOTOS build.

    @akuchling
    Copy link
    Member

    Is a backport to 2.7 still planned?

    @malemburg
    Copy link
    Member

    On 2009-03-31 03:19, A.M. Kuchling wrote:

    A.M. Kuchling <lists@amk.ca> added the comment:

    Is a backport to 2.7 still planned?

    I hope it is.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 31, 2009

    Andrew, your patch disables the optimization that HAS_ARG(op) is a
    constant when op is known by the compiler (that is, inside a
    "TARGET_##op" label), so I'd rather keep the version which is currently
    in SVN.

    @aimacintyre
    Copy link
    Mannequin

    aimacintyre mannequin commented Apr 10, 2009

    Antoine, in my testing the "loss" of the HAS_ARG() optimisation in my
    patch appears to have negligible cost on i386, but starts to look
    significant on amd64.

    On an Intel E8200 cpu running FreeBSD 7.1 amd64, with gcc 7.2.1 and the
    3.1a2 sources, the computed goto version is ~8% faster (average time of
    all rounds) for pybench (with warp factor set to 3 rather than the
    default 10, to get the round time up over 10s) than without computed
    gotos. With my patch applied, the computed goto version is ~5.5% faster
    than without computed gotos by the same measure. On this platform,
    Pystone rates at ~86k (no computed gotos), ~85k (computed gotos) and
    ~82k (computed gotos + my patch).

    For comparison, this machine running Windows XP (32 bit) with the
    python.org builds rates ~92k pystones for 2.6.1 and ~81k for 3.1a2.
    Pybench isn't distributed in the MSI installers :-(

    @mdionisio
    Copy link
    Mannequin

    mdionisio mannequin commented Jul 18, 2009

    I have patch the code of python3.1 to use computed goto tecnique also
    with Visual Studio. The performance result is not good (I really don't
    know why). But it is a good work-araound for use the computed goto also
    on windows.
    The only diffentes is that the opcode_targets vector is filled at run-time.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jul 19, 2010

    This is too late for 2.x now, closing.

    @pitrou pitrou closed this as completed Jul 19, 2010
    @parasa
    Copy link
    Mannequin

    parasa mannequin commented May 28, 2015

    Hi All,

    This is Vamsi from Server Scripting Languages Optimization team at Intel Corporation.

    Would like to submit a request to enable the computed goto based dispatch in Python 2.x (which happens to be enabled by default in Python 3 given its performance benefits on a wide range of workloads). We talked about this patch with Guido and he encouraged us to submit a request on Python-dev (email conversation with Guido shown at the bottom of this email).

    Attached is the computed goto patch (along with instructions to run) for Python 2.7.10 (based on the patch submitted by Jeffrey Yasskin at http://bugs.python.org/issue4753). We built and tested this patch for Python 2.7.10 on a Linux machine (Ubuntu 14.04 LTS server, Intel Xeon – Haswell EP CPU with 18 cores, hyper-threading off, turbo off).

    Below is a summary of the performance we saw on the “grand unified python benchmarks” suite (available at https://hg.python.org/benchmarks/). We made 3 rigorous runs of the following benchmarks. In each rigorous run, a benchmark is run 100 times with and without the computed goto patch. Below we show the average performance boost for the 3 rigorous runs.
    -----
    Instructions to run the computed goto patch

    1. Apply the patch and then generate the new configure script (autoconf configure.ac > configure)
    2. Enable execute permissions for Python/makeopcodetargets.py ( sudo chmod +x Python/makeopcodetargets.py)
    3. To enable computed gotos, do: ./configure --with-computed-gotos
    4. Build the new python binary using make && sudo make install

    Python 2.7.10 (original) vs Computed Goto performance
    Benchmark Delta (rigorous run #1) % Delta (rigorous run 2) % Delta (rigorous run #3) % Avg. Delta %
    iterative_count 24.48 24.36 23.78 24.2
    unpack_sequence 19.06 18.47 19.48 19.0
    slowspitfire 14.36 13.41 16.65 14.8
    threaded_count 15.85 13.43 13.93 14.4
    pystone 10.68 11.67 11.08 11.1
    nbody 10.25 8.93 9.28 9.5
    go 7.96 8.76 7.69 8.1
    pybench 6.3 6.8 7.2 6.8
    spectral_norm 5.49 9.37 4.62 6.5
    float 6.09 6.2 6.96 6.4
    richards 6.19 6.41 6.42 6.3
    slowunpickle 6.37 8.78 3.55 6.2
    json_dump_v2 1.96 12.53 3.57 6.0
    call_simple 6.37 5.91 3.92 5.4
    chaos 4.57 5.34 3.85 4.6
    call_method_slots 2.63 3.27 7.71 4.5
    telco 5.18 1.83 6.47 4.5
    simple_logging 3.48 1.57 7.4 4.2
    call_method 2.61 5.4 3.88 4.0
    chameleon 2.03 6.26 3.2 3.8
    fannkuch 3.89 3.19 4.39 3.8
    silent_logging 4.33 3.07 3.39 3.6
    slowpickle 5.72 -1.12 6.06 3.6
    2to3 2.99 3.6 3.45 3.3
    etree_iterparse 3.41 2.51 3 3.0
    regex_compile 3.44 2.48 2.84 2.9
    mako_v2 2.14 1.29 5.22 2.9
    meteor_contest 2.01 2.2 3.88 2.7
    django 6.68 -1.23 2.56 2.7
    formatted_logging 1.97 5.82 -0.11 2.6
    hexiom2 2.83 2.1 2.55 2.5
    django_v2 1.93 2.53 2.92 2.5
    etree_generate 2.38 2.13 2.51 2.3
    mako -0.3 9.66 -3.11 2.1
    bzr_startup 0.35 1.97 3 1.8
    etree_process 1.84 1.01 1.9 1.6
    spambayes 1.76 0.76 0.48 1.0
    regex_v8 1.96 -0.66 1.63 1.0
    html5lib 0.83 0.72 0.97 0.8
    normal_startup 1.41 0.39 0.24 0.7
    startup_nosite 1.2 0.41 0.42 0.7
    etree_parse 0.24 0.9 0.79 0.6
    json_load 1.38 0.56 -0.25 0.6
    pidigits 0.45 0.33 0.28 0.4
    hg_startup 0.32 2.07 -1.41 0.3
    rietveld 0.05 0.91 -0.43 0.2
    tornado_http 2.34 -0.92 -1.27 0.1
    call_method_unknown 0.72 1.26 -1.85 0.0
    raytrace -0.35 -0.75 0.94 -0.1
    regex_effbot 1.97 -1.18 -2.57 -0.6
    fastunpickle -1.65 0.5 -0.88 -0.7
    nqueens -2.24 -1.53 -0.81 -1.5
    fastpickle -0.74 1.98 -6.26 -1.7

    Thanks,
    Vamsi

    ------------------------------------------------------------------------------------------------------------------------------------------------------------
    From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum
    Sent: Tuesday, May 19, 2015 1:59 PM
    To: Cohn, Robert S
    Cc: R. David Murray (r.david.murray@murrayandwalker.com)
    Subject: Re: meeting at PyCon

    Hi Robert and David,
    I just skimmed that thread. There were a lot of noises about backporting it to 2.7 but the final message on the topic, by Antoine, claimed it was too late for 2.7. However, that was before we had announced the EOL extension of 2.7 till 2020, and perhaps we were also in denial about 3.x uptake vs. 2.x. So I think it's definitively worth bringing this up. I would start with a post on python-dev linking to the source code for your patch, and adding a message to the original tracker issue too (without reopening it though -- just so the people who were on the bug will be pinged about it).
    Because of backwards compatibility with previous 2.7.x releases, it's very important that the patch not break anything -- in particular this means you can't add opcodes or change their specification. You will also undoubtedly be asked to test this on a variety of platforms 32-bit and 64-bit that people care about. But I'm sure you're expecting all that. :-)

    You might also check with Benjamin Peterson, who is the 2.7 release manager. I think he just announced 2.7.10, so it's too late for that, but I assume we'll keep doing 2.7.x releases until 2020.
    Good luck,

    --Guido

    PS. I am assuming you are contributing this under a PSF-accepted license, e.g. Apache 2.0, otherwise it's an automatic nogo.

    On Tue, May 19, 2015 at 9:33 AM, Cohn, Robert S <robert.s.cohn@intel.com> wrote:
    Hi Guido,

    When we met for lunch at pycon, I asked if performance related patches would be ok for python 2.x. My understanding is that you thought it was possible if it did not create a maintainability problem. We have an example now, a 2.7 patch for computed goto based on the implementation in python 3 http://bugs.python.org/issue4753 It increases performance by up to 10% across a wide range of workloads.

    As I mentioned at lunch, we hired David Murray’s company, and he is guiding intel through the development process for cpython. David and I thought it would be good to run this by you before raising the issue on python-dev. Do you have a specific concern about this patch or a more general concern about performance patches to 2.7? Thanks.

    Robert
    --------

    @rbtcollins
    Copy link
    Member

    FWIW I'm interested and willing to poke at this if more testers/reviewers are needed.

    @ned-deily
    Copy link
    Member

    @vamsi, could you please open a new issue and attach your patch there so it can be properly tracked for 2.7? This issue has been closed for five years and the code has been out in the field for a long time in Python 3. Thanks!

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 28, 2015

    New changeset 17d3bbde60d2 by Benjamin Peterson in branch '2.7':
    backport computed gotos (bpo-4753)
    https://hg.python.org/cpython/rev/17d3bbde60d2

    @db3l
    Copy link
    Contributor

    db3l commented Jun 1, 2015

    The 2.7 back-ported version of this patch appears to have broken compilation on the Windows XP buildbot, during the OpenSSL build process, when the newly built Python is used to execute the build_ssl.py script.

    After this patch, when that stage executes, and prior to any output from the build script, the python_d process goes to 100% CPU and sticks there until the build process times out 1200s later and kills it.

    I don't think it's really ssl related though, as after doing some debugging the exact same thing happens if I simply run python_d (I never see a prompt - it just sits there burning CPU). So I think build_ssl.py is just the first use of the generated python_d during the build process.

    I did try attaching to the CPU-stuck version of python_d from VS, and so far from what I can see, the process never gets past the Py_Initialize() call in Py_Main(). It's all over the place in terms of locations if I try interrupting it, but it's always stuck inside that first Py_Initialize call.

    I'm not sure if it's something environmental on my slave, or a difference with a debug vs. production build (I haven't had a chance to try building a release version yet).

    -- David

    @db3l
    Copy link
    Contributor

    db3l commented Jun 1, 2015

    I ran a few more tests, and the generated executable hangs in both release and debug builds. The closest I can get at the moment is that it's stuck importing errno from the "import sys, errno" line in os.py - at least no matter how long I wait after starting a process before breaking out, output with -v looks like:

    > python_d -v
    # installing zipimport hook
    import zipimport # builtin
    # installed zipimport hook
    # D:\cygwin\home\db3l\test\python2.7\lib\site.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\site.py
    import site # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\site.pyc
    # D:\cygwin\home\db3l\test\python2.7\lib\os.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\os.py
    import os # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\os.pyc
    import errno # builtin
    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\test\python2.7\lib\site.py", line 62, in <module>
        import os
      File "D:\cygwin\home\db3l\test\python2.7\lib\os.py", line 26, in <module>
        import sys, errno
    KeyboardInterrupt
    # clear __builtin__._
    # clear sys.path
    # clear sys.argv
    # clear sys.ps1
    # clear sys.ps2
    # clear sys.exitfunc
    # clear sys.exc_type
    # clear sys.exc_value
    # clear sys.exc_traceback
    # clear sys.last_type
    # clear sys.last_value
    # clear sys.last_traceback
    # clear sys.path_hooks
    # clear sys.path_importer_cache
    # clear sys.meta_path
    # clear sys.flags
    # clear sys.float_info
    # restore sys.stdin
    # restore sys.stdout
    # restore sys.stderr
    # cleanup __main__
    # cleanup[1] zipimport
    # cleanup[1] errno
    # cleanup[1] signal
    # cleanup[1] exceptions
    # cleanup[1] _warnings
    # cleanup sys
    # cleanup __builtin__
    [8991 refs]
    # cleanup ints: 6 unfreed ints
    # cleanup floats

    I never have a problem interrupting the process, so KeyboardInterrupt is processed normally - it just looks like it gets stuck in an infinite loop during startup.

    -- David

    @bitdancer
    Copy link
    Member

    Please open a new issue with the details about your problem.

    @db3l
    Copy link
    Contributor

    db3l commented Jun 2, 2015

    Oops, sorry, I had just followed the commit comment to this issue. For the record here, it looks like Benjamin has committed an update (5e8fa1b13516) that resolves the problem.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests