Message82599
It finally occurred to me that what might be killing 32-bit performance
is the divisions, rather than the multiplications.
To test this, here's a version of 30bit_longdigit17.patch that replaces
just *two* of the divisions in Objects/longsobject.c by the appropriate
x86 divl assembler instruction. The result for pydigits is an
astonishing 10% speedup!
Results of running python pydigits_bestof.py 2000 on 32-bit OS X
10.5.6/Core 2 Duo:
upatched py3k
-------------
Best Time; 2212.6 ms
30bit_longdigit17.patch
-----------------------
Best Time; 2283.9 ms (-3.1% relative to py3k)
30bit_longdigit17+asm.patch
---------------------------
Best Time; 2085.7 ms (+6.1% relative to py3k)
The problem is that (e.g., in the main loop of x_divrem) we're doing a
64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit
remainder. From the analysis of the algorithm, *we* know that the
quotient will always fit into 32 bits, so that e.g., on x86, a divl
instruction is appropriate. But unless the compiler is extraordinarily
clever it doesn't know this, so it produces an expensive library call to
a function that probably involves multiple divisions and/or some
branches, that produces the full 64-bit quotient.
On 32-bit PowerPC things are even worse, since there there isn't even a
64-by-32 bit divide instruction; only a 32-bit by 32-bit division.
So I could still be persuaded that 30-bit digits should only be enabled
by default on 64-bit machines... |
|
Date |
User |
Action |
Args |
2009-02-22 12:19:33 | mark.dickinson | set | recipients:
+ mark.dickinson, tim.peters, loewis, collinwinter, gregory.p.smith, pitrou, pernici, vstinner, christian.heimes, jyasskin, schuppenies |
2009-02-22 12:19:33 | mark.dickinson | set | messageid: <1235305173.04.0.505536544801.issue4258@psf.upfronthosting.co.za> |
2009-02-22 12:19:28 | mark.dickinson | link | issue4258 messages |
2009-02-22 12:19:28 | mark.dickinson | create | |
|