Message 78923 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	blaisorblade
Recipients	alexandre.vassalotti, blaisorblade, christian.heimes, lemburg, pitrou, rhettinger, skip.montanaro
Date	2009-01-03.01:51:45
SpamBayes Score	3.1859404e-11
Marked as misclassified	No
Message-id	<1230947507.57.0.625781170189.issue4753@psf.upfronthosting.co.za>
In-reply-to

Content
About miscompilations: the current patch is a bit weird for GCC, because you keep both the switch and the computed goto. But actually, there is no case in which the switch is needed, and computed goto give less room to GCC's choices. So, can you try dropping the switch altogether, using always computed goto and seeing how does the resulting code get compiled? I see you'll need two labels (before and after argument fetch) per opcode and two dispatch tabels, but that's no big deal (except for code alignment - align just the common branch target). An important warning is that by default, on my system, GCC 4.2 aligns branch targets for switch to a 16-byte boundary (as recommended by the Intel optimization guide), by adding a ".p2align 4,,7" GAS directive, and it does not do that for computed goto. Adding the directive by hand gave a small speedup, 2% I think; I should try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled by -O2), since that is supposed to give the same result. Please use that yourself as well, and verify it works for labels, even if I fear it doesn't. > However, I don't know why the speed up due to the patch is much more significant on x86-64 than on x86. It's Amdahl's law, even if this is not about parallel code. When the rest is faster (x86_64), the same speedup on dispatch gives a bigger overall speedup. To be absolutely clear: x86_64 has more registers, so the rest of the interpreter is faster than x86, but dispatch still takes the same absolute time, which is 70% on x86_64, but only 50% on x86 (those are realistic figures); if this patch halved dispatch time on both (we're not so lucky), we would save 35% on x86_64 but only 25% on x86. In fact, on inefficient interpreters, indirect threading is useless altogether. So, do those extra register help _so_ much? Yes. In my toy interpreter, computing last_i for each dispatch doesn't give any big slowdown, but storing it in f->last_i gives a ~20% slowdown - I cross-checked multiple times because I was astonished. Conversely, when the program counter had to be stored in memory, I think it was like 2x slower.

About miscompilations: the current patch is a bit weird for GCC, because
you keep both the switch and the computed goto.

But actually, there is no case in which the switch is needed, and
computed goto give less room to GCC's choices.

So, can you try dropping the switch altogether, using always computed
goto and seeing how does the resulting code get compiled? I see you'll
need two labels (before and after argument fetch) per opcode and two
dispatch tabels, but that's no big deal (except for code alignment -
align just the common branch target).

An important warning is that by default, on my system, GCC 4.2 aligns
branch targets for switch to a 16-byte boundary (as recommended by the
Intel optimization guide), by adding a ".p2align 4,,7" GAS directive,
and it does not do that for computed goto.

Adding the directive by hand gave a small speedup, 2% I think; I should
try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled
by -O2), since that is supposed to give the same result.

Please use that yourself as well, and verify it works for labels, even
if I fear it doesn't.

> However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

It's Amdahl's law, even if this is not about parallel code. When the
rest is faster (x86_64), the same speedup on dispatch gives a bigger
overall speedup.

To be absolutely clear: x86_64 has more registers, so the rest of the
interpreter is faster than x86, but dispatch still takes the same
absolute time, which is 70% on x86_64, but only 50% on x86 (those are
realistic figures); if this patch halved dispatch time on both (we're
not so lucky), we would save 35% on x86_64 but only 25% on x86.
In fact, on inefficient interpreters, indirect threading is useless
altogether.

So, do those extra register help _so_ much? Yes. In my toy interpreter,
computing last_i for each dispatch doesn't give any big slowdown, but
storing it in f->last_i gives a ~20% slowdown - I cross-checked multiple
times because I was astonished. Conversely, when the program counter had
to be stored in memory, I think it was like 2x slower.

History
Date	User	Action	Args
2009-01-03 01:51:47	blaisorblade	set	recipients: + blaisorblade, lemburg, skip.montanaro, rhettinger, pitrou, christian.heimes, alexandre.vassalotti
2009-01-03 01:51:47	blaisorblade	set	messageid: <1230947507.57.0.625781170189.issue4753@psf.upfronthosting.co.za>
2009-01-03 01:51:47	blaisorblade	link	issue4753 messages
2009-01-03 01:51:45	blaisorblade	create