Author blaisorblade
Recipients alexandre.vassalotti, blaisorblade, christian.heimes, lemburg, pitrou, rhettinger, skip.montanaro
Date 2009-01-03.01:51:45
SpamBayes Score 3.18594e-11
Marked as misclassified No
Message-id <1230947507.57.0.625781170189.issue4753@psf.upfronthosting.co.za>
In-reply-to
Content
About miscompilations: the current patch is a bit weird for GCC, because
you keep both the switch and the computed goto.

But actually, there is no case in which the switch is needed, and
computed goto give less room to GCC's choices.

So, can you try dropping the switch altogether, using always computed
goto and seeing how does the resulting code get compiled? I see you'll
need two labels (before and after argument fetch) per opcode and two
dispatch tabels, but that's no big deal (except for code alignment -
align just the common branch target).

An important warning is that by default, on my system, GCC 4.2 aligns
branch targets for switch to a 16-byte boundary (as recommended by the
Intel optimization guide), by adding a ".p2align 4,,7" GAS directive,
and it does not do that for computed goto.

Adding the directive by hand gave a small speedup, 2% I think; I should
try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled
by -O2), since that is supposed to give the same result.

Please use that yourself as well, and verify it works for labels, even
if I fear it doesn't.

> However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

It's Amdahl's law, even if this is not about parallel code. When the
rest is faster (x86_64), the same speedup on dispatch gives a bigger
overall speedup.

To be absolutely clear: x86_64 has more registers, so the rest of the
interpreter is faster than x86, but dispatch still takes the same
absolute time, which is 70% on x86_64, but only 50% on x86 (those are
realistic figures); if this patch halved dispatch time on both (we're
not so lucky), we would save 35% on x86_64 but only 25% on x86.
In fact, on inefficient interpreters, indirect threading is useless
altogether.

So, do those extra register help _so_ much? Yes. In my toy interpreter,
computing last_i for each dispatch doesn't give any big slowdown, but
storing it in f->last_i gives a ~20% slowdown - I cross-checked multiple
times because I was astonished. Conversely, when the program counter had
to be stored in memory, I think it was like 2x slower.
History
Date User Action Args
2009-01-03 01:51:47blaisorbladesetrecipients: + blaisorblade, lemburg, skip.montanaro, rhettinger, pitrou, christian.heimes, alexandre.vassalotti
2009-01-03 01:51:47blaisorbladesetmessageid: <1230947507.57.0.625781170189.issue4753@psf.upfronthosting.co.za>
2009-01-03 01:51:47blaisorbladelinkissue4753 messages
2009-01-03 01:51:45blaisorbladecreate