This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author blaisorblade
Recipients alexandre.vassalotti, arigo, blaisorblade, christian.heimes, lemburg, pitrou, rhettinger, skip.montanaro
Date 2009-01-01.05:03:02
SpamBayes Score 4.47613e-11
Marked as misclassified No
Message-id <>
1) About different speedups on 32bits vs 64 bits
2) About PPC slowdown
3) PyPI

======= About different speedups on 32bits vs 64 bits =======
An interpreter is very register-hungry, so on x86_64 it spends much less
time on register spill (i.e. moving locals from/to memory), so
instruction dispatch takes a bigger share of execution time. If the rest
of the interpreter is too slow, indirect threading gives no advantage.

Look at the amount of register variables in PyEval_EvalFrameEx() (btw,
do you support any compiler where nowadays 'register' still makes a
difference? That's quite weird). Lots of them are simply cached copies
of fields of the current frame and of the current function; without
copying them to locals, the compiler should assume they could change at
any function call.

In fact, adding locals this way gave huge speedups on tight loops on the
Python interpreter I built with a colleague for our student project, to
experiment with speeding up Python.

And adding a write to memory in the dispatch code (to f->last_i) gave a
20% slowdown. Since my interpreter uses a copying garbage collector and
CPython uses reference counting, which is much slower (if you don't know
this, show me a fast JVM with reference counting), I'm even surprised
you can get such a big speedup from threading.

======= About PPC slowdown =======
Somebody should try the patch on Pentium4 as well.

During our VM course, threading slowed down a toy interpreter with 4 toy
opcodes. Our teachers suggested that with such a small interpreter,
since threaded code takes more space (in that case, ~64 vs ~100 bytes),
this could give problems with code caches, but suggested checking that
idea using performance counters. I'm not sure about why.

I don't have right now neither a Pentium4 nor a PowerPC available, so I
can't check myself. But this is the best way to analyze the performance
unexpected behaviour.

======= PyPI =======
Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch
Paolo> cannot be accepted there, sadly).

Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...
I was thinking to a private backport as well.
I didn't know about PyPI, it looks like PyPI is more for contributed
modules than for this, would that work?
Date User Action Args
2009-01-01 05:03:05blaisorbladesetrecipients: + blaisorblade, lemburg, skip.montanaro, arigo, rhettinger, pitrou, christian.heimes, alexandre.vassalotti
2009-01-01 05:03:05blaisorbladesetmessageid: <>
2009-01-01 05:03:04blaisorbladelinkissue4753 messages
2009-01-01 05:03:02blaisorbladecreate