Author mark.dickinson
Recipients christian.heimes, collinwinter, gregory.p.smith, jyasskin, loewis, mark.dickinson, pernici, pitrou, schuppenies, vstinner
Date 2009-02-20.16:01:35
SpamBayes Score 3.50906e-10
Marked as misclassified No
Message-id <1235145713.49.0.825099294592.issue4258@psf.upfronthosting.co.za>
In-reply-to
Content
Here's an updated patch that includes the x_divrem fixes and
optimizations.  I've also updated the Rietveld issue with these
fixes.

I think this is (modulo any requested changes) the version
that I'd like to get into py3k.  After that we can look
at the possibility of optimizing the multiplication algorithm;
the discussion for this should probably go back to issue 3944.

Here's a detailed list of the changes to x_divrem.

0. Most importantly, in the inner loop, we make sure that the
   multiplication is digit x digit -> twodigits; previously it was
   digit x twodigits -> twodigits, which is likely to expand to
   several instructions on a 32-bit machine.  I suspect that
   this is the main cause of the slowdown that Victor was
   seeing.

1. Make all variables unsigned.  This eliminates the need for
   Py_ARITHMETIC_RIGHT_SHIFT, and also removes the only essential use
   of stwodigits in the entire long object implementation.

2. Save an iteration of the outer loop when possible by comparing top
   digits.

3. Remove double tests in the main inner loop and correction loop.

4. Quotient digit correction step follows Knuth more closely, and uses
   fewer operations.  The extra exit condition (r >= PyLong_BASE) will
   be true more than 50% of the time, and should be cheaper than the
   main test.

5. In quotient digit estimation step, remove unnecessary special case
   when vj == w->ob_digits[w_size-1].  Knuth only needs this special 
case
   because his base is the same as the wordsize.

6. There's no need to fill the eliminated digits of v with zero;
   instead, set Py_SIZE(v) at the end of the outer loop.

7. More comments; some extra asserts.

There are many other optimizations possible;  I've tried only to include 
those that don't significantly complicate the code.  An obvious 
remaining inefficiency is that on every iteration of the outer loop we 
divide by the top word of w;  on many machines I suspect that it would 
be more efficient to precompute an inverse and multiply by that instead.
History
Date User Action Args
2009-02-20 16:01:53mark.dickinsonsetrecipients: + mark.dickinson, loewis, collinwinter, gregory.p.smith, pitrou, pernici, vstinner, christian.heimes, jyasskin, schuppenies
2009-02-20 16:01:53mark.dickinsonsetmessageid: <1235145713.49.0.825099294592.issue4258@psf.upfronthosting.co.za>
2009-02-20 16:01:52mark.dickinsonlinkissue4258 messages
2009-02-20 16:01:51mark.dickinsoncreate