classification
Title: Streamline integer division
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: haypo, mark.dickinson, pitrou
Priority: normal Keywords: patch

Created on 2009-03-18 22:11 by mark.dickinson, last changed 2009-03-24 09:11 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
faster_integer_division.patch mark.dickinson, 2009-03-18 22:11
pidigits_bestof.py mark.dickinson, 2009-03-18 22:54 Integer benchmark
pidigits-2.py haypo, 2009-03-18 23:22
faster_integer_division2.patch mark.dickinson, 2009-03-21 22:19
Messages (16)
msg83781 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 22:11
Here's a patch that streamlines the x_divrem function in 
Objects/longobject.c.  More benchmarks are needed, but the initial 
results are promising:  for py3k, on a 32-bit machine with 15-bit 
digits, Victor's pidigits benchmark runs almost twice as fast with this 
patch (numbers below).

Patch details
=============

- Normalize inputs by shifting instead of multiplying and dividing.
  This halves the number of divisions used in the algorithm.

- Streamline innermost loop.

- Save an iteration of outer loop around half the time.

- Save an object allocation:  only 3 allocations per x_divrem call
  instead of 4.

- Remove special case where initial quotient estimate is >= PyLong_BASE.
  There's no need for this, since the 'digit' type holds values up
  to 2*PyLong_BASE - 1.

- Make q type digit instead of type twodigits:  this halves the size
  of the multiplication in the innermost loop.

Benchmark results
=================

Using the pidigits_bestof.py script that's posted in the issue 4258 
discussion, on a non-debug build of py3k (r70465), on OS X 10.5.6/Core 2 
Duo:

Unpatched
---------
Macintosh-3:py3k dickinsm$ ./python.exe ../pidigits_bestof.py 2000
performing a warm up run...
running
sys.int_info= sys.int_info(bits_per_digit=15, sizeof_digit=2)
Time; 2234.6 ms
Time; 2232.2 ms
Time; 2227.9 ms
Time; 2225.7 ms
Time; 2229.8 ms
Best Time; 2225.7 ms

Patched
-------
dickinsm$ ./python.exe ../pidigits_bestof.py 2000
performing a warm up run...
running
sys.int_info= sys.int_info(bits_per_digit=15, sizeof_digit=2)
Time; 1175.6 ms
Time; 1176.5 ms
Time; 1177.3 ms
Time; 1179.5 ms
Time; 1168.5 ms
Best Time; 1168.5 ms

So the patch gives a speedup of around 90%.  This particular benchmark 
is heavy on the divisions though, so I'd expect lesser speedups for 
other benchmarks.
msg83782 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-18 22:17
Would it be possible to include integer benchmark tools to upstream? I 
mean the "pidigit" tool (I didn't wrote this test! I just ported it to 
Python3) and maybe my "bench_int" tool. It could useful to reproduce 
the benchmark on different Python versions and different computers. We 
may open a new issue for each tool?
msg83785 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 22:31
Sounds good to me!  An integer benchmark script in the Tools/ directory
would be handy.
msg83786 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-18 22:43
Well, the original code in pidigits is by me :-) (although it's not
stated in the source file).
I haven't tried the patch but the performance work you're doing is
really impressive, Mark!
msg83788 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-18 22:47
Results here (Athlon X2 3600+ with gcc in 64-bit mode with 30-bit
digits), "pidigits_bestof.py 2000":
- unpatched: 1644.9 ms
- patched: 694.8 ms
!
msg83789 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-18 22:48
Useful informations for an integer benchmark:
 - CPU type: 32 or 64 bits, endian
 - Memory size
 - Python version
 - Informations about how Python was compiled
 - Integer implementation: use int_info when it's available, otherwise 
base=2^15

Who has the last version of pidigit tool? Can you attach it here? :-)
msg83790 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 22:52
[Antoine]
> Well, the original code in pidigits is by me :-)

Apologies for the misattribution!
msg83791 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 22:54
Here's the version of pidigits that I was using.
msg83793 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-18 23:22
New version of pidigit:
 - works on python 2.6, trunk and py3k
 - display Python version, CPU info (bits, endian), PyLong info (base, 
digit size)
 - display usage if no argument is given
msg83794 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-18 23:22
Oh, there is no version number of the benchmark tool itself!
msg83795 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-18 23:27
My numbers:

# No patch
$ ./python /home/haypo/pidigits-2.py 3000
Python 3.1a1+ (py3k:70466, Mar 18 2009, 23:56:06)
[GCC 4.3.2]
CPU: 64 bits, little endian
PyLong: base=2^30, sizeof(digit)=32 bits
(...)
Best Time; 2300.2 ms

# With faster_integer_division.patch
(...)
Best Time; 1138.1 ms

Ok, it's two times faster on 64 bits CPU!!!

Other notes about pidigits:
 - missing header (no author name, no description)
 - pidigit should also write the number of digits in its output to 
avoid mistakes in comparaisons
msg83953 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-21 22:19
Updated patch.  Lots of cleanup, but only one significant change:  the 
inner loop now uses signed arithmetic instead of unsigned arithmetic.  This saves a negation and fixes a subtle bug: the previous inner loop code 
was incorrect when using 15-bit digits on machines with sizeof(short) == 
sizeof(long).  Not that I know of any such machines:  the Cray T3E 
famously has no 16-bit integer type, but there sizeof(short) is 4 and 
sizeof(long) is 8.

A few more timings, this time from doing a single huge integer division:  
10**1000000//10**500000 (so this effectively times just the inner loop, 
since all else will be insignificant).  All timings are best-of-5, from 
non-debug builds of py3k, on the same system: OS X 10.5.6/2.4 GHz Core 2 
Duo.  Times in brackets are the approximate per-inner-loop times 
(remembering that there are 4 times as many iterations of the inner loop 
for 15-bit digits).

32-bit build, 15-bit digits, unpatched:  92382.2 ms   (~7.5 ns)
32-bit build, 15-bit digits, patched:    36473.3 ms   (~3.0 ns)
64-bit build, 30-bit digits, unpatched:  14581.4 ms   (~4.8 ns)
64-bit build, 30-bit digits, patched:     7385.1 ms   (~2.4 ns)

... and just for fun, the other combinations:

64-bit build, 15-bit digits, unpatched:  61927.5 ms   (~5.1 ns)
64-bit build, 15-bit digits, patched:    43632.9 ms   (~3.6 ns)
32-bit build, 30-bit digits, unpatched:  62374.1 ms  (~20.3 ns)
32-bit build, 30-bit digits, patched:    26928.3 ms   (~8.8 ns)


Thanks for the updated pidigits script, Victor!  Maybe this is too small 
right now to be worth including in the Tools directory, but I hope we can 
fatten it up with some other benchmarks.  What do you think?
msg84027 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-23 20:04
Applied, r70542 and r70547.
msg84066 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-24 09:00
Would it be possible to port the patch to python trunk?
msg84067 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-24 09:10
Hi Victor!  I already applied the x_divrem patch to the trunk and to py3k.  
Did you mean the release branch?

I'd prefer not to backport this to 2.6 or 3.0:  it's not a new feature, 
but it's not a bugfix either and there's always some risk of breakage...
msg84068 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-24 09:11
> I already applied the x_divrem patch to the trunk and to py3k

Oops! I taught that you applied it to py3k and release30-maint :-/

It's ok, trunk and py3k is enough ;-)
History
Date User Action Args
2009-03-24 09:11:59hayposetmessages: + msg84068
2009-03-24 09:10:01mark.dickinsonsetmessages: + msg84067
2009-03-24 09:00:07hayposetmessages: + msg84066
2009-03-23 20:04:49mark.dickinsonsetstatus: open -> closed
resolution: accepted
messages: + msg84027
2009-03-21 22:19:31mark.dickinsonsetfiles: + faster_integer_division2.patch

messages: + msg83953
2009-03-18 23:27:21hayposetmessages: + msg83795
2009-03-18 23:22:35hayposetmessages: + msg83794
2009-03-18 23:22:07hayposetfiles: + pidigits-2.py

messages: + msg83793
2009-03-18 22:54:54mark.dickinsonsetfiles: + pidigits_bestof.py

messages: + msg83791
2009-03-18 22:52:05mark.dickinsonsetmessages: + msg83790
2009-03-18 22:48:53hayposetmessages: + msg83789
2009-03-18 22:47:45pitrousetmessages: + msg83788
2009-03-18 22:43:56pitrousetnosy: + pitrou
messages: + msg83786
2009-03-18 22:31:29mark.dickinsonsetmessages: + msg83785
2009-03-18 22:17:09hayposetnosy: + haypo
messages: + msg83782
2009-03-18 22:11:26mark.dickinsoncreate