Message 79532 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	blaisorblade
Recipients	ajaksu2, alexandre.vassalotti, bboissin, blaisorblade, christian.heimes, collinwinter, djc, facundobatista, gregory.p.smith, jyasskin, lemburg, pitrou, ralph.corderoy, rhettinger, skip.montanaro, theatrus
Date	2009-01-10.08:52:26
SpamBayes Score	0.015545695
Marked as misclassified	No
Message-id	<1231577551.9.0.477655297426.issue4753@psf.upfronthosting.co.za>
In-reply-to

Content
The standing question is still: can we get ICC to produce the expected output? It looks like we still didn't manage, and since ICC is the best compiler out there, this matters. Some problems with SunCC, even if it doesn't do jump sharing, it seems that one doesn't get the speedups - I guess that on most platforms we should select the most common alternative for interpreters (i.e. no switch, one jump table, given by threadedceval5.patch + abstract-switch-reduced.diff). On core platforms we can spend time on fine-tuning - and the definition of "core platforms" is given by "do developers want to test for that?". When that's fixed, I think that we just have to choose the simpler form and merge that. @alexandre: [about removing the switch] > There is no speed difference on pybench on x86; on x86-64, the code is slower due to the opcode fetching change. Actually, on my machine it looks like the difference is caused by the different layout caused by switch removal, or something like that, because fixing the opcode fetching doesn't make a difference here (see below). Indeed, I did my benchmarking duties. Results are that abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% slowdown, and that all the others don't make a significant difference. The differences in the assembly output seem to be due to a different code layout for some branches, I didn't have a closer look. However, experimenting with -falign-labels=16 can give a small speedup, I'm trying to improve the results (what I actually want is to align just the opcode handlers, I'll probably do that by hand). reenable-static-prediction can give either a slowdown or a speedup by around 1%, i.e. around the statistical noise. Note that on my machine, I get only a 10% speedup with the base patch, and that is more reasonable here. In the original thread on PyPy-dev, I got a 20% one with the Python interpreter I built for my student project, since that one is faster* (by a 2-3x factor, like PyVM), so the dispatch cost is more significant, and reducing it has a bigger impact. In fact, I couldn't believe that Python got the same speedup. This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, and since it's a laptop I expect it to have slower RAM than a desktop. @alexandre: > The patch make a huge difference on 64-bit Linux. I get a 20% speed-up and the lowest run time so far. That is quite impressive! Which processor is that? @pitrou: > The machine I got the 15% speedup on is in 64-bit mode with gcc 4.3.2. Which is the processor? I guess the bigger speedups should be on Pentium4, since it has the bigger mispredict penalties. ==== *DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is not complete, has some bugs, and the source code has not yet been published, so discussion about why it is faster shall not happen here - I want to avoid any flame. I believe it's not because of skipped runtime checks or such stuff, but because we used garbage collection instead of refcounting, indirect threading and tagged integers, but I don't have time to discuss that yet. The original thread on pypy-dev has some insights if you are interested on this.

The standing question is still: can we get ICC to produce the expected 
output? It looks like we still didn't manage, and since ICC is the best 
compiler out there, this matters.
Some problems with SunCC, even if it doesn't do jump sharing, it seems 
that one doesn't get the speedups - I guess that on most platforms we 
should select the most common alternative for interpreters (i.e. no 
switch, one jump table, given by threadedceval5.patch + 
abstract-switch-reduced.diff).

On core platforms we can spend time on fine-tuning - and the definition 
of "core platforms" is given by "do developers want to test for that?".

When that's fixed, I think that we just have to choose the simpler form 
and merge that.

@alexandre:
[about removing the switch]
> There is no speed difference on pybench on x86; on x86-64, the code 
is slower due to the opcode fetching change.

Actually, on my machine it looks like the difference is caused by the 
different layout caused by switch removal, or something like that, 
because fixing the opcode fetching doesn't make a difference here (see 
below).

Indeed, I did my benchmarking duties. Results are that 
abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% 
slowdown, and that all the others don't make a significant difference. 
The differences in the assembly output seem to be due to a different 
code layout for some branches, I didn't have a closer look.

However, experimenting with -falign-labels=16 can give a small speedup, 
I'm trying to improve the results (what I actually want is to align 
just the opcode handlers, I'll probably do that by hand).

reenable-static-prediction can give either a slowdown or a speedup by 
around 1%, i.e. around the statistical noise.

Note that on my machine, I get only a 10% speedup with the base patch, 
and that is more reasonable here. In the original thread on PyPy-dev, I 
got a 20% one with the Python interpreter I built for my student 
project, since that one is faster* (by a 2-3x factor, like PyVM), so 
the dispatch cost is more significant, and reducing it has a bigger 
impact. In fact, I couldn't believe that Python got the same speedup.

This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, 
and since it's a laptop I expect it to have slower RAM than a desktop.

@alexandre:
> The patch make a huge difference on 64-bit Linux. I get a 20% 
speed-up and the lowest run time so far. That is quite impressive!
Which processor is that?

@pitrou:
> The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.

Which is the processor? I guess the bigger speedups should be on 
Pentium4, since it has the bigger mispredict penalties.

====
*DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is 
not complete, has some bugs, and the source code has not yet been 
published, so discussion about why it is faster shall not happen here - 
I want to avoid any flame.
I believe it's not because of skipped runtime checks or such stuff, but 
because we used garbage collection instead of refcounting, indirect  
threading and tagged integers, but I don't have time to discuss that 
yet.
The original thread on pypy-dev has some insights if you are interested 
on this.

History
Date	User	Action	Args
2009-01-10 08:52:32	blaisorblade	set	recipients: + blaisorblade, lemburg, skip.montanaro, collinwinter, rhettinger, facundobatista, gregory.p.smith, pitrou, christian.heimes, ajaksu2, alexandre.vassalotti, jyasskin, djc, ralph.corderoy, bboissin, theatrus
2009-01-10 08:52:31	blaisorblade	set	messageid: <1231577551.9.0.477655297426.issue4753@psf.upfronthosting.co.za>
2009-01-10 08:52:30	blaisorblade	link	issue4753 messages
2009-01-10 08:52:26	blaisorblade	create