Message 82599 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mark.dickinson
Recipients	christian.heimes, collinwinter, gregory.p.smith, jyasskin, loewis, mark.dickinson, pernici, pitrou, schuppenies, tim.peters, vstinner
Date	2009-02-22.12:19:18
SpamBayes Score	1.3180805e-07
Marked as misclassified	No
Message-id	<1235305173.04.0.505536544801.issue4258@psf.upfronthosting.co.za>
In-reply-to

Content
It finally occurred to me that what might be killing 32-bit performance is the divisions, rather than the multiplications. To test this, here's a version of 30bit_longdigit17.patch that replaces just two of the divisions in Objects/longsobject.c by the appropriate x86 divl assembler instruction. The result for pydigits is an astonishing 10% speedup! Results of running python pydigits_bestof.py 2000 on 32-bit OS X 10.5.6/Core 2 Duo: upatched py3k ------------- Best Time; 2212.6 ms 30bit_longdigit17.patch ----------------------- Best Time; 2283.9 ms (-3.1% relative to py3k) 30bit_longdigit17+asm.patch --------------------------- Best Time; 2085.7 ms (+6.1% relative to py3k) The problem is that (e.g., in the main loop of x_divrem) we're doing a 64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit remainder. From the analysis of the algorithm, we know that the quotient will always fit into 32 bits, so that e.g., on x86, a divl instruction is appropriate. But unless the compiler is extraordinarily clever it doesn't know this, so it produces an expensive library call to a function that probably involves multiple divisions and/or some branches, that produces the full 64-bit quotient. On 32-bit PowerPC things are even worse, since there there isn't even a 64-by-32 bit divide instruction; only a 32-bit by 32-bit division. So I could still be persuaded that 30-bit digits should only be enabled by default on 64-bit machines...

It finally occurred to me that what might be killing 32-bit performance 
is the divisions, rather than the multiplications.

To test this, here's a version of 30bit_longdigit17.patch that replaces 
just *two* of the divisions in Objects/longsobject.c by the appropriate 
x86 divl assembler instruction.  The result for pydigits is an 
astonishing 10% speedup!

Results of running python pydigits_bestof.py 2000 on 32-bit OS X 
10.5.6/Core 2 Duo:

upatched py3k
-------------
Best Time; 2212.6 ms

30bit_longdigit17.patch
-----------------------
Best Time; 2283.9 ms (-3.1% relative to py3k)

30bit_longdigit17+asm.patch
---------------------------
Best Time; 2085.7 ms (+6.1% relative to py3k)

The problem is that (e.g., in the main loop of x_divrem) we're doing a 
64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit 
remainder.  From the analysis of the algorithm, *we* know that the 
quotient will always fit into 32 bits, so that e.g., on x86, a divl 
instruction is appropriate.  But unless the compiler is extraordinarily 
clever it doesn't know this, so it produces an expensive library call to 
a function that probably involves multiple divisions and/or some 
branches, that produces the full 64-bit quotient.

On 32-bit PowerPC things are even worse, since there there isn't even a 
64-by-32 bit divide instruction;  only a 32-bit by 32-bit division.

So I could still be persuaded that 30-bit digits should only be enabled 
by default on 64-bit machines...

History
Date	User	Action	Args
2009-02-22 12:19:33	mark.dickinson	set	recipients: + mark.dickinson, tim.peters, loewis, collinwinter, gregory.p.smith, pitrou, pernici, vstinner, christian.heimes, jyasskin, schuppenies
2009-02-22 12:19:33	mark.dickinson	set	messageid: <1235305173.04.0.505536544801.issue4258@psf.upfronthosting.co.za>
2009-02-22 12:19:28	mark.dickinson	link	issue4258 messages
2009-02-22 12:19:28	mark.dickinson	create