This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Asymptotically faster divmod and str(long)
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.2
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Carl.Friedrich.Bolz, eric.smith, fredrikj, mark.dickinson, pernici, serhiy.storchaka, steve21, tim.peters, vstinner, xuanji
Priority: normal Keywords: patch

Created on 2008-07-27 01:11 by fredrikj, last changed 2022-04-11 14:56 by admin. This issue is now closed.

File name Uploaded Description Edit fredrikj, 2008-07-27 16:04 python implementation of Newton division
div_timings.txt fredrikj, 2008-07-27 16:17 Comparison of performance between builtin and Newton-based divmod and str mark.dickinson, 2008-08-04 20:38 Fast recursive division
diff pernici, 2008-09-08 09:39 division with Burnikel , Ziegler algorithm
longobject2.diff pernici, 2008-10-13 10:14 review
longformat.diff pernici, 2009-03-30 22:59 str(n) recursive algorithm review
longformat_BZ.diff pernici, 2009-03-30 23:04 str(n) and BZ division algorithm review
Messages (26)
msg70309 - (view) Author: Fredrik Johansson (fredrikj) Date: 2008-07-27 01:11
A few weeks ago, I blogged about taking advantage of Karatsuba
multiplication and Newton's method to divide big integers quickly (some
of you may have read it, as it was posted to Daily Python-URL among
other places):

To summarize, this method blows the builtin division out of the water 
already at ~(2000 digits)/(1000 digits).

The primary catch is that the result of the Newton division may be
slightly wrong (typically 1 ulp). However, a single extra multiplication
and a subtraction at the end allows one to compute a remainder, and
since the remainder must satisfy 0 <= r < q, the error is easily
corrected. From a quick test, the cost of the extra multiplication seems
to move the break-even point with the builtin division up to around
5000/2500 digits.

A pure Python implementation of divmod, with error correction based on
the remainder, can be found in this file:

(See the function idivmod)

Of particular note is that fast divmod gives a fast way to do radix
conversion, by recursively splitting the number in half. The function
numeral (see same .py file) does this, e.g:

>>> from time import clock
>>> a = 2**1257787-1
>>> t1=clock(); s1=str(a); t2=clock(); t2-t1
>>> t1=clock(); s2=numeral(a); t2=clock(); t2-t1
>>> s1 == s2

(This recursive algorithm, by the way, is actually faster than str()
even with the slow builtin divmod.)

Would there be any interest in porting these algorithms to C and using
them in the standard Python long implementation?

There are likely some problems that I have overlooked. A critical review
will be most welcome.
msg70313 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-07-27 06:34
> Would there be any interest in porting these algorithms to C and using
> them in the standard Python long implementation?

Yes, definitely!  Though it depends a little bit how much complication 
is involved.

A subquadratic algorithm for converting strings to longs might also fit 
well here.

Now I'm just waiting for you to propose an implementation of integer 
square root  :-).
msg70316 - (view) Author: Fredrik Johansson (fredrikj) Date: 2008-07-27 07:07
> Yes, definitely!  Though it depends a little bit how much complication
is involved.

Not that much, I think, the pure Python version being just a few lines
of code (plus a few more lines of boilerplate). But I'm not too familiar
with converting Python to C myself, so I may underestimate the
difficulties. It might get a little more complicated if you want to
minimize temporary allocations, for example.

> Now I'm just waiting for you to propose an implementation of integer
square root  :-).

Yes, there is also code for fast (O(M(n)) integer square roots in This function would be much less useful overall as a
builtin, though.
msg70325 - (view) Author: Fredrik Johansson (fredrikj) Date: 2008-07-27 16:04
For your convenience, I have split the division and numeral code off to
a standalone .py file which I'm attaching here.

I also updated the remainder logic to use the default divmod instead of
repeated subtraction, which ensures worst-case performance similar to
the builtin divmod in the very unlikely case that div_newton would
return junk.
msg70326 - (view) Author: Fredrik Johansson (fredrikj) Date: 2008-07-27 16:17
And here some timings on my laptop.

Both int->str and balanced division become faster somewhere between 1000
and 2000 digits. This is after editing the division benchmark to use
random numbers instead of exact powers of ten (interestingly causing a
bit of slowdown). String conversion might be a little slower for lengths
that aren't close to powers of two.
msg70702 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-04 15:51
There's also the recursive division algorithm due
to Burnikel and Ziegler;  this might be worth a look.
I think it's the same asymptotic complexity (constant
times karatsuba multiplication complexity), but may
turn out to be faster for one reason or another.

I had a Python implementation of this somewhere;
I'll see if I can dig it out.

Assigning this to me so that it doesn't get lost or
forgotten;  but note that I don't intend to do
anything about it before 2.6/3.0 final.  If anyone else
wants to take it off my hands before then, feel free.
msg70704 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-04 15:58
See also issue 1746088.
msg70719 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-04 20:38
Here's a pure Python implementation of the Burnikel and Ziegler recursive 
division algorithm.  I've no idea whether it's faster or slower than 
Newton, but it might be worth a look.  It depends heavily on bit 
operations, which ought to be much faster when coded in C.  (Some of the 
shifts would be completely unnecessary---replaced by changes in indexing 

The original paper describing the algorithm is available here:
msg70720 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-04 20:38
Oops.  Wrong file.  Here's the right one.
msg70776 - (view) Author: Fredrik Johansson (fredrikj) Date: 2008-08-06 08:51
Indeed, that seems to be much faster. In the mean time, do you mind if I
steal the code? :-)
msg71010 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-11 10:40
> Indeed, that seems to be much faster. In the mean time, do you mind if I
> steal the code? :-)

Not at all!

I guess constant factors could well appear and/or disappear when
recoding in C; it might well be worth trying to implement both
methods to see which is faster.

There are some semi-obvious tricks available that might speed up the
recursive version.  For one thing, all shifts should be in multiples
of 15 bits (because longs are stored in base 2**15).  For another, it
ought to be possible to avoid the single-bit normalization shifts
every time the number of bits ('n') is odd---one can do a single
shift at the beginning of the calculation instead.
msg72772 - (view) Author: Pernici Mario (pernici) Date: 2008-09-08 09:39
I have translated in C the algorithm for long division by
Burnikel and Ziegler (BZ), using the Python version
and the suggestions by Mark.

Here is a benchmark for divmod(p. q), p = 7**np, q = 5**nq
digits = q_digits = p_digits/2; the time taken by Python division
is normalized to 1
tests on Debian, BZ with Python-2.6b3
        Pentium 1.60GHz    Athlon XP 2600+     Athlon 64 3800+
digits  BZ      fast_div   BZ      fast_div    BZ      fast_div
500     1.01    1.27       1.00    1.18        1.00    1.21
700     0.88    1.22       0.76    1.08        0.81    1.14
1000    0.82    1.17       0.72    1.04        0.76    1.10
2000    0.66    0.85       0.55    0.73        0.59    0.78
4000    0.51    0.62       0.43    0.52        0.45    0.56
10000   0.32    0.38       0.31    0.37        0.33    0.39
20000   0.24    0.25       0.23    0.25        0.25    0.27
100000  0.14    0.14       0.11    0.11        0.12    0.12

BZ starts being faster than Python division around 2000 bits, apart
from two cases:
 for q with less than 4000 bits and p much smaller than q**2
 x_divrem is faster than divmod_pos;
 for p very much larger than q**2 x_divrem becomes
 again faster.
I put a bound in long_divrem to fall back to x_divrem in those cases,
based on tests on my laptop Pentium 1.60GHz.

The treatment of exceptions is very rudimentary;
I use a simple macro STUB_EXC to return NULL,
but without releasing memory.
Help for doing this properly is welcome.

Please find attached the diff to the longobject.c file in Python-2.6b3 .

msg73698 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-09-24 09:55
Thanks for the patch, Mario!  I'll give a look when I get the chance.
msg74679 - (view) Author: Pernici Mario (pernici) Date: 2008-10-13 10:14
In this patch I added to the patch by Mark in issue 3944 the code
in the previous patch, modified to release memory in case of exceptions.

Benchmark for division on Athlon 64 3800+  with respect to Python3.0:
(1)  Python with patch 30bit.patch
(2)  Python with this patch
up to 1000 digits (1) and (2) perform in the same way
digits    (1)    (2)
2000      4x     5x
4000      4x     7x
10000     4x     10x
20000     4x     13x
100000    4x     27x
msg84105 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-24 18:32
The longobject2.diff patch probably doesn't apply cleanly any more.  
Anyone interested in updating it?

I think this patch looks like a promising beginning, but there's still 
quite a lot of work to do.  The main concerns at the moment are:

(1) the huge numbers of Py_DECREFs and Py_XDECREFs.  Some refactoring 
might help here ("goto" isn't all bad: it's fairly common to use lots of 
"goto error" statements in Python's source code).

(2) I suspect that many of the operations could be turned into in-place 
operations on digit vectors, thus saving lots of object allocations and 
deallocations.  This should also help out with (1).

I'm not yet 100% sold on getting subquadratic division into Python---it 
remains to be seen how much complexity it adds.  If it makes it easy to 
implement subquadratic integer <-> string conversions that would be a big 
msg84702 - (view) Author: Pernici Mario (pernici) Date: 2009-03-30 22:59
In this patch there is an implementation of the algorithm to convert
numbers in strings by recursively splitting the number in half, adapted
from Fredrik's
msg84704 - (view) Author: Pernici Mario (pernici) Date: 2009-03-30 23:04
In this second patch to the above patch it is added the recursive
division algorithm by Burnikel and Ziegler (BZ) 
from longobject2.diff, unmodified,
to see the effect of the subquadratic algorithm; there is still a lot of
work to be done on it, as Mark pointed out.

Here is a benchmark on hp pavilion Q8200  2.33GHz
a = 7**n; compute str(a) N times
n        N        unpatched  patch1   patch2
100      100000   0.25       0.25     0.25
200      100000   0.63       0.64     0.63
300      100000   1.19       1.22     1.23
400      100000   1.87       1.74     1.75
500      10000    0.27       0.24     0.24
1000     10000    0.98       0.59     0.60
2000     1000     0.36       0.16     0.17
5000     1000     2.17       0.65     0.66
10000    100      0.86       0.20     0.19
20000    100      3.42       0.70     0.55
50000    10       2.13       0.37     0.24
100000   1        0.85       0.13     0.08
500000   1        21         2.9      0.94
1000000  1        85         11       2.8
2000000  1        339        44       8.4

str(n) in the first patch uses a quadratic algorithm, but asymptotically
it is almost 8 times faster than the unpatched version;
in the second patch the subquadratic algorithm starts showing 
after 10000 digits.
msg103028 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-04-13 09:27
Unassigning myself for now.  The most recent patch still needs work before it can be considered for inclusion.
msg190407 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2013-05-31 15:05
See also issue18107, in particular
msg221270 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-22 16:12
As issue18107 has been closed as a duplicate of this, should this be moved to open from languishing?
msg390940 - (view) Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * Date: 2021-04-13 08:55
FWIW, we have implemented a faster algorithm for divmod for big numbers using Mark's in PyPy. In particular, this speeds up str(long) for large numbers significantly (eg calling str on the result of math.factorial(2**17) is now 15x faster than CPython ;-)).
msg390966 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-04-13 13:55
> we have implemented a faster algorithm for divmod for big numbers using Mark's in PyPy

Urk! I hope your implementation included a healthy dose of validation, testing and cleanup. :-) (I have no doubts that it did.)

I'd consider to be proof-of-concept code rather than something suitable for immediate use.
msg390968 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-04-13 14:57
FWIW, I think we should close this; the same comments as in issue 26256 apply. See msg259408 onwards in that discussion.
msg390971 - (view) Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * Date: 2021-04-13 15:29
yes, that sounds fair. In PyPy we improve things occasionally if somebody feels like working on it, but in general competing against GMP is a fools errand.
msg390972 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-13 15:47
Python has a reasonable efficiency up to 1000 decimal digits (according to benchmark) which is already really great! Operations with more than 1000 decimal digits is an uncommon. IMO you should use a dedicated library like GMP for that.

I would prefer to keep CPython code simple enough to maintain. Objects/longobject.c is already 5744 lines of C code. longformat_BZ.diff adds around 700 lines of code, but it only makes str(int) starting at 2000 decimal digits.

Yeah, we could do better for integers with <many digits>, but I would prefer not to :-) Python has a limited number of volunteers working on it, and it's not specialized in numbers. We should keep the maintenance burden at an acceptable level ;-)
msg412121 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-01-30 01:41
Ha! This will never die. More discussion in bpo-46558.

Ya, I already closed it, but don't want to. I opened it to begin with to record an int->str method that doesn't use division, so it didn't really belong on this report.

What if we _didn't_ convert these things to C? That's the primary question the new report gets around to asking. These things are far easier to write, understand, and maintain in Python, and converting to C is generally only improving a constant factor, or _maybe_ removing a log(n) factor. When we're looking at massive O() improvements, squeezing those out is of comparatively little value.
Date User Action Args
2022-04-11 14:56:36adminsetgithub: 47701
2022-01-30 01:41:10tim.peterssetnosy: + tim.peters
messages: + msg412121
2021-04-13 15:47:42vstinnersetstatus: languishing -> closed
resolution: wont fix
messages: + msg390972

stage: patch review -> resolved
2021-04-13 15:29:44Carl.Friedrich.Bolzsetmessages: + msg390971
2021-04-13 15:05:16BreamoreBoysetnosy: - BreamoreBoy
2021-04-13 14:57:33mark.dickinsonsetmessages: + msg390968
2021-04-13 13:55:02mark.dickinsonsetmessages: + msg390966
2021-04-13 08:55:52Carl.Friedrich.Bolzsetnosy: + Carl.Friedrich.Bolz
messages: + msg390940
2014-06-22 16:12:51BreamoreBoysetnosy: + serhiy.storchaka, BreamoreBoy
messages: + msg221270
2013-05-31 15:05:07eric.smithsetnosy: + eric.smith
messages: + msg190407
2011-04-27 11:57:55xuanjisetnosy: + xuanji
2010-04-13 09:27:29mark.dickinsonsetstatus: open -> languishing
assignee: mark.dickinson ->
messages: + msg103028

versions: - Python 2.7
2010-01-03 14:07:45mark.dickinsonsetversions: - Python 3.1
2009-09-21 16:32:16steve21setnosy: + steve21
2009-05-16 20:48:09ajaksu2setnosy: + vstinner
versions: + Python 3.2

stage: patch review
2009-03-30 23:04:52pernicisetfiles: + longformat_BZ.diff

messages: + msg84704
2009-03-30 22:59:13pernicisetfiles: + longformat.diff

messages: + msg84702
2009-03-24 18:32:11mark.dickinsonsetmessages: + msg84105
2008-10-13 10:14:34pernicisetfiles: + longobject2.diff
keywords: + patch
messages: + msg74679
2008-09-24 09:55:34mark.dickinsonsetmessages: + msg73698
2008-09-08 09:39:16pernicisetfiles: + diff
nosy: + pernici
messages: + msg72772
2008-08-11 10:40:27mark.dickinsonsetmessages: + msg71010
2008-08-06 08:51:51fredrikjsetmessages: + msg70776
2008-08-04 20:39:07mark.dickinsonsetfiles: -
2008-08-04 20:38:59mark.dickinsonsetfiles: +
messages: + msg70720
2008-08-04 20:38:23mark.dickinsonsetfiles: +
messages: + msg70719
2008-08-04 15:58:06mark.dickinsonsetmessages: + msg70704
2008-08-04 15:57:54mark.dickinsonlinkissue1746088 superseder
2008-08-04 15:51:56mark.dickinsonsetpriority: normal
assignee: mark.dickinson
type: performance
messages: + msg70702
versions: + Python 3.1, Python 2.7
2008-07-27 16:17:46fredrikjsetfiles: + div_timings.txt
messages: + msg70326
2008-07-27 16:04:54fredrikjsetfiles: +
messages: + msg70325
2008-07-27 07:07:39fredrikjsetmessages: + msg70316
2008-07-27 06:34:28mark.dickinsonsetmessages: + msg70313
2008-07-27 01:13:09facundobatistasetnosy: + mark.dickinson
2008-07-27 01:11:04fredrikjcreate