Message79532
The standing question is still: can we get ICC to produce the expected
output? It looks like we still didn't manage, and since ICC is the best
compiler out there, this matters.
Some problems with SunCC, even if it doesn't do jump sharing, it seems
that one doesn't get the speedups - I guess that on most platforms we
should select the most common alternative for interpreters (i.e. no
switch, one jump table, given by threadedceval5.patch +
abstract-switch-reduced.diff).
On core platforms we can spend time on fine-tuning - and the definition
of "core platforms" is given by "do developers want to test for that?".
When that's fixed, I think that we just have to choose the simpler form
and merge that.
@alexandre:
[about removing the switch]
> There is no speed difference on pybench on x86; on x86-64, the code
is slower due to the opcode fetching change.
Actually, on my machine it looks like the difference is caused by the
different layout caused by switch removal, or something like that,
because fixing the opcode fetching doesn't make a difference here (see
below).
Indeed, I did my benchmarking duties. Results are that
abstract-switch-reduced.diff (the one removing the switch) gives a 1-3%
slowdown, and that all the others don't make a significant difference.
The differences in the assembly output seem to be due to a different
code layout for some branches, I didn't have a closer look.
However, experimenting with -falign-labels=16 can give a small speedup,
I'm trying to improve the results (what I actually want is to align
just the opcode handlers, I'll probably do that by hand).
reenable-static-prediction can give either a slowdown or a speedup by
around 1%, i.e. around the statistical noise.
Note that on my machine, I get only a 10% speedup with the base patch,
and that is more reasonable here. In the original thread on PyPy-dev, I
got a 20% one with the Python interpreter I built for my student
project, since that one is faster* (by a 2-3x factor, like PyVM), so
the dispatch cost is more significant, and reducing it has a bigger
impact. In fact, I couldn't believe that Python got the same speedup.
This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache,
and since it's a laptop I expect it to have slower RAM than a desktop.
@alexandre:
> The patch make a huge difference on 64-bit Linux. I get a 20%
speed-up and the lowest run time so far. That is quite impressive!
Which processor is that?
@pitrou:
> The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.
Which is the processor? I guess the bigger speedups should be on
Pentium4, since it has the bigger mispredict penalties.
====
*DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is
not complete, has some bugs, and the source code has not yet been
published, so discussion about why it is faster shall not happen here -
I want to avoid any flame.
I believe it's not because of skipped runtime checks or such stuff, but
because we used garbage collection instead of refcounting, indirect
threading and tagged integers, but I don't have time to discuss that
yet.
The original thread on pypy-dev has some insights if you are interested
on this. |
|
Date |
User |
Action |
Args |
2009-01-10 08:52:32 | blaisorblade | set | recipients:
+ blaisorblade, lemburg, skip.montanaro, collinwinter, rhettinger, facundobatista, gregory.p.smith, pitrou, christian.heimes, ajaksu2, alexandre.vassalotti, jyasskin, djc, ralph.corderoy, bboissin, theatrus |
2009-01-10 08:52:31 | blaisorblade | set | messageid: <1231577551.9.0.477655297426.issue4753@psf.upfronthosting.co.za> |
2009-01-10 08:52:30 | blaisorblade | link | issue4753 messages |
2009-01-10 08:52:26 | blaisorblade | create | |
|