Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asymptotically faster divmod and str(long) #47701

Closed
fredrikj mannequin opened this issue Jul 27, 2008 · 26 comments
Closed

Asymptotically faster divmod and str(long) #47701

fredrikj mannequin opened this issue Jul 27, 2008 · 26 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@fredrikj
Copy link
Mannequin

fredrikj mannequin commented Jul 27, 2008

BPO 3451
Nosy @tim-one, @mdickinson, @vstinner, @cfbolz, @ericvsmith, @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll, @serhiy-storchaka
Files
  • div.py: python implementation of Newton division
  • div_timings.txt: Comparison of performance between builtin and Newton-based divmod and str
  • fast_div.py: Fast recursive division
  • diff: division with Burnikel , Ziegler algorithm
  • longobject2.diff
  • longformat.diff: str(n) recursive algorithm
  • longformat_BZ.diff: str(n) and BZ division algorithm
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-04-13.15:47:42.644>
    created_at = <Date 2008-07-27.01:11:04.477>
    labels = ['interpreter-core', 'performance']
    title = 'Asymptotically faster divmod and str(long)'
    updated_at = <Date 2022-01-30.01:41:10.989>
    user = 'https://bugs.python.org/fredrikj'

    bugs.python.org fields:

    activity = <Date 2022-01-30.01:41:10.989>
    actor = 'tim.peters'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-04-13.15:47:42.644>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2008-07-27.01:11:04.477>
    creator = 'fredrikj'
    dependencies = []
    files = ['10990', '10991', '11060', '11423', '11773', '13496', '13497']
    hgrepos = []
    issue_num = 3451
    keywords = ['patch']
    message_count = 26.0
    messages = ['70309', '70313', '70316', '70325', '70326', '70702', '70704', '70719', '70720', '70776', '71010', '72772', '73698', '74679', '84105', '84702', '84704', '103028', '190407', '221270', '390940', '390966', '390968', '390971', '390972', '412121']
    nosy_count = 10.0
    nosy_names = ['tim.peters', 'mark.dickinson', 'pernici', 'vstinner', 'Carl.Friedrich.Bolz', 'eric.smith', 'fredrikj', 'steve21', 'xuanji', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue3451'
    versions = ['Python 3.2']

    @fredrikj
    Copy link
    Mannequin Author

    fredrikj mannequin commented Jul 27, 2008

    A few weeks ago, I blogged about taking advantage of Karatsuba
    multiplication and Newton's method to divide big integers quickly (some
    of you may have read it, as it was posted to Daily Python-URL among
    other places):

    http://fredrik-j.blogspot.com/2008/07/making-division-in-python-faster.html

    To summarize, this method blows the builtin division out of the water
    already at ~(2000 digits)/(1000 digits).

    The primary catch is that the result of the Newton division may be
    slightly wrong (typically 1 ulp). However, a single extra multiplication
    and a subtraction at the end allows one to compute a remainder, and
    since the remainder must satisfy 0 <= r < q, the error is easily
    corrected. From a quick test, the cost of the extra multiplication seems
    to move the break-even point with the builtin division up to around
    5000/2500 digits.

    A pure Python implementation of divmod, with error correction based on
    the remainder, can be found in this file:

    http://www.dd.chalmers.se/~frejohl/code/libarith/libarith.py

    (See the function idivmod)

    Of particular note is that fast divmod gives a fast way to do radix
    conversion, by recursively splitting the number in half. The function
    numeral (see same .py file) does this, e.g:

    >>> from time import clock
    >>> a = 2**1257787-1
    >>> t1=clock(); s1=str(a); t2=clock(); t2-t1
    109.08065159119386
    >>> t1=clock(); s2=numeral(a); t2=clock(); t2-t1
    7.1394886780303182
    >>> s1 == s2
    True

    (This recursive algorithm, by the way, is actually faster than str()
    even with the slow builtin divmod.)

    Would there be any interest in porting these algorithms to C and using
    them in the standard Python long implementation?

    There are likely some problems that I have overlooked. A critical review
    will be most welcome.

    @fredrikj fredrikj mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jul 27, 2008
    @mdickinson
    Copy link
    Member

    Would there be any interest in porting these algorithms to C and using
    them in the standard Python long implementation?

    Yes, definitely! Though it depends a little bit how much complication
    is involved.

    A subquadratic algorithm for converting strings to longs might also fit
    well here.

    Now I'm just waiting for you to propose an implementation of integer
    square root :-).

    @fredrikj
    Copy link
    Mannequin Author

    fredrikj mannequin commented Jul 27, 2008

    Yes, definitely! Though it depends a little bit how much complication
    is involved.

    Not that much, I think, the pure Python version being just a few lines
    of code (plus a few more lines of boilerplate). But I'm not too familiar
    with converting Python to C myself, so I may underestimate the
    difficulties. It might get a little more complicated if you want to
    minimize temporary allocations, for example.

    Now I'm just waiting for you to propose an implementation of integer
    square root :-).

    Yes, there is also code for fast (O(M(n)) integer square roots in
    libarith.py. This function would be much less useful overall as a
    builtin, though.

    @fredrikj
    Copy link
    Mannequin Author

    fredrikj mannequin commented Jul 27, 2008

    For your convenience, I have split the division and numeral code off to
    a standalone .py file which I'm attaching here.

    I also updated the remainder logic to use the default divmod instead of
    repeated subtraction, which ensures worst-case performance similar to
    the builtin divmod in the very unlikely case that div_newton would
    return junk.

    @fredrikj
    Copy link
    Mannequin Author

    fredrikj mannequin commented Jul 27, 2008

    And here some timings on my laptop.

    Both int->str and balanced division become faster somewhere between 1000
    and 2000 digits. This is after editing the division benchmark to use
    random numbers instead of exact powers of ten (interestingly causing a
    bit of slowdown). String conversion might be a little slower for lengths
    that aren't close to powers of two.

    @mdickinson
    Copy link
    Member

    There's also the recursive division algorithm due
    to Burnikel and Ziegler; this might be worth a look.
    I think it's the same asymptotic complexity (constant
    times karatsuba multiplication complexity), but may
    turn out to be faster for one reason or another.

    I had a Python implementation of this somewhere;
    I'll see if I can dig it out.

    Assigning this to me so that it doesn't get lost or
    forgotten; but note that I don't intend to do
    anything about it before 2.6/3.0 final. If anyone else
    wants to take it off my hands before then, feel free.

    @mdickinson mdickinson self-assigned this Aug 4, 2008
    @mdickinson mdickinson added the performance Performance or resource usage label Aug 4, 2008
    @mdickinson
    Copy link
    Member

    See also bpo-1746088.

    @mdickinson
    Copy link
    Member

    Here's a pure Python implementation of the Burnikel and Ziegler recursive
    division algorithm. I've no idea whether it's faster or slower than
    Newton, but it might be worth a look. It depends heavily on bit
    operations, which ought to be much faster when coded in C. (Some of the
    shifts would be completely unnecessary---replaced by changes in indexing
    instead.)

    The original paper describing the algorithm is available here:

    http://cr.yp.to/bib/1998/burnikel.ps

    @mdickinson
    Copy link
    Member

    Oops. Wrong file. Here's the right one.

    @fredrikj
    Copy link
    Mannequin Author

    fredrikj mannequin commented Aug 6, 2008

    Indeed, that seems to be much faster. In the mean time, do you mind if I
    steal the code? :-)

    @mdickinson
    Copy link
    Member

    Indeed, that seems to be much faster. In the mean time, do you mind if I
    steal the code? :-)

    Not at all!

    I guess constant factors could well appear and/or disappear when
    recoding in C; it might well be worth trying to implement both
    methods to see which is faster.

    There are some semi-obvious tricks available that might speed up the
    recursive version. For one thing, all shifts should be in multiples
    of 15 bits (because longs are stored in base 2**15). For another, it
    ought to be possible to avoid the single-bit normalization shifts
    every time the number of bits ('n') is odd---one can do a single
    shift at the beginning of the calculation instead.

    @pernici
    Copy link
    Mannequin

    pernici mannequin commented Sep 8, 2008

    I have translated in C the algorithm for long division by
    Burnikel and Ziegler (BZ), using the Python version fast_div.py
    and the suggestions by Mark.

    Here is a benchmark for divmod(p. q), p = 7**np, q = 5**nq
    digits = q_digits = p_digits/2; the time taken by Python division
    is normalized to 1
    tests on Debian, BZ with Python-2.6b3
    Pentium 1.60GHz Athlon XP 2600+ Athlon 64 3800+
    digits BZ fast_div BZ fast_div BZ fast_div
    500 1.01 1.27 1.00 1.18 1.00 1.21
    700 0.88 1.22 0.76 1.08 0.81 1.14
    1000 0.82 1.17 0.72 1.04 0.76 1.10
    2000 0.66 0.85 0.55 0.73 0.59 0.78
    4000 0.51 0.62 0.43 0.52 0.45 0.56
    10000 0.32 0.38 0.31 0.37 0.33 0.39
    20000 0.24 0.25 0.23 0.25 0.25 0.27
    100000 0.14 0.14 0.11 0.11 0.12 0.12

    BZ starts being faster than Python division around 2000 bits, apart
    from two cases:
    for q with less than 4000 bits and p much smaller than q**2
    x_divrem is faster than divmod_pos;
    for p very much larger than q**2 x_divrem becomes
    again faster.
    I put a bound in long_divrem to fall back to x_divrem in those cases,
    based on tests on my laptop Pentium 1.60GHz.

    The treatment of exceptions is very rudimentary;
    I use a simple macro STUB_EXC to return NULL,
    but without releasing memory.
    Help for doing this properly is welcome.

    Please find attached the diff to the longobject.c file in Python-2.6b3 .

    Mario

    @mdickinson
    Copy link
    Member

    Thanks for the patch, Mario! I'll give a look when I get the chance.

    @pernici
    Copy link
    Mannequin

    pernici mannequin commented Oct 13, 2008

    In this patch I added to the patch by Mark in bpo-3944 the code
    in the previous patch, modified to release memory in case of exceptions.

    Benchmark for division on Athlon 64 3800+ with respect to Python3.0:
    (1) Python with patch 30bit.patch
    (2) Python with this patch
    up to 1000 digits (1) and (2) perform in the same way
    digits (1) (2)
    2000 4x 5x
    4000 4x 7x
    10000 4x 10x
    20000 4x 13x
    100000 4x 27x

    @mdickinson
    Copy link
    Member

    The longobject2.diff patch probably doesn't apply cleanly any more.
    Anyone interested in updating it?

    I think this patch looks like a promising beginning, but there's still
    quite a lot of work to do. The main concerns at the moment are:

    (1) the huge numbers of Py_DECREFs and Py_XDECREFs. Some refactoring
    might help here ("goto" isn't all bad: it's fairly common to use lots of
    "goto error" statements in Python's source code).

    (2) I suspect that many of the operations could be turned into in-place
    operations on digit vectors, thus saving lots of object allocations and
    deallocations. This should also help out with (1).

    I'm not yet 100% sold on getting subquadratic division into Python---it
    remains to be seen how much complexity it adds. If it makes it easy to
    implement subquadratic integer <-> string conversions that would be a big
    plus.

    @pernici
    Copy link
    Mannequin

    pernici mannequin commented Mar 30, 2009

    In this patch there is an implementation of the algorithm to convert
    numbers in strings by recursively splitting the number in half, adapted
    from Fredrik's div.py

    @pernici
    Copy link
    Mannequin

    pernici mannequin commented Mar 30, 2009

    In this second patch to the above patch it is added the recursive
    division algorithm by Burnikel and Ziegler (BZ)
    from longobject2.diff, unmodified,
    to see the effect of the subquadratic algorithm; there is still a lot of
    work to be done on it, as Mark pointed out.

    Here is a benchmark on hp pavilion Q8200 2.33GHz
    a = 7**n; compute str(a) N times
    n N unpatched patch1 patch2
    100 100000 0.25 0.25 0.25
    200 100000 0.63 0.64 0.63
    300 100000 1.19 1.22 1.23
    400 100000 1.87 1.74 1.75
    500 10000 0.27 0.24 0.24
    1000 10000 0.98 0.59 0.60
    2000 1000 0.36 0.16 0.17
    5000 1000 2.17 0.65 0.66
    10000 100 0.86 0.20 0.19
    20000 100 3.42 0.70 0.55
    50000 10 2.13 0.37 0.24
    100000 1 0.85 0.13 0.08
    500000 1 21 2.9 0.94
    1000000 1 85 11 2.8
    2000000 1 339 44 8.4

    str(n) in the first patch uses a quadratic algorithm, but asymptotically
    it is almost 8 times faster than the unpatched version;
    in the second patch the subquadratic algorithm starts showing
    after 10000 digits.

    @mdickinson
    Copy link
    Member

    Unassigning myself for now. The most recent patch still needs work before it can be considered for inclusion.

    @mdickinson mdickinson added the stale Stale PR or inactive for long period of time. label Apr 13, 2010
    @mdickinson mdickinson removed their assignment Apr 13, 2010
    @ericvsmith
    Copy link
    Member

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jun 22, 2014

    As bpo-18107 has been closed as a duplicate of this, should this be moved to open from languishing?

    @cfbolz
    Copy link
    Mannequin

    cfbolz mannequin commented Apr 13, 2021

    FWIW, we have implemented a faster algorithm for divmod for big numbers using Mark's fast_div.py in PyPy. In particular, this speeds up str(long) for large numbers significantly (eg calling str on the result of math.factorial(2**17) is now 15x faster than CPython ;-)).

    @mdickinson
    Copy link
    Member

    we have implemented a faster algorithm for divmod for big numbers using Mark's fast_div.py in PyPy

    Urk! I hope your implementation included a healthy dose of validation, testing and cleanup. :-) (I have no doubts that it did.)

    I'd consider fast_div.py to be proof-of-concept code rather than something suitable for immediate use.

    @mdickinson
    Copy link
    Member

    FWIW, I think we should close this; the same comments as in bpo-26256 apply. See msg259408 onwards in that discussion.

    @cfbolz
    Copy link
    Mannequin

    cfbolz mannequin commented Apr 13, 2021

    yes, that sounds fair. In PyPy we improve things occasionally if somebody feels like working on it, but in general competing against GMP is a fools errand.

    @vstinner
    Copy link
    Member

    Python has a reasonable efficiency up to 1000 decimal digits (according to benchmark) which is already really great! Operations with more than 1000 decimal digits is an uncommon. IMO you should use a dedicated library like GMP for that.

    I would prefer to keep CPython code simple enough to maintain. Objects/longobject.c is already 5744 lines of C code. longformat_BZ.diff adds around 700 lines of code, but it only makes str(int) starting at 2000 decimal digits.

    Yeah, we could do better for integers with <many digits>, but I would prefer not to :-) Python has a limited number of volunteers working on it, and it's not specialized in numbers. We should keep the maintenance burden at an acceptable level ;-)

    @vstinner vstinner removed the stale Stale PR or inactive for long period of time. label Apr 13, 2021
    @tim-one
    Copy link
    Member

    tim-one commented Jan 30, 2022

    Ha! This will never die. More discussion in bpo-46558.

    Ya, I already closed it, but don't want to. I opened it to begin with to record an int->str method that doesn't use division, so it didn't really belong on this report.

    What if we _didn't_ convert these things to C? That's the primary question the new report gets around to asking. These things are far easier to write, understand, and maintain in Python, and converting to C is generally only improving a constant factor, or _maybe_ removing a log(n) factor. When we're looking at massive O() improvements, squeezing those out is of comparatively little value.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants