This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Improve performance of math.prod with bignums (and functools.reduce?)
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benrg, mark.dickinson, rhettinger, tim.peters
Priority: normal Keywords:

Created on 2022-02-26 23:35 by benrg, last changed 2022-04-11 14:59 by admin.

Messages (9)
msg414126 - (view) Author: (benrg) Date: 2022-02-26 23:35
math.prod is slow at multiplying arbitrary-precision numbers. E.g., compare the run time of factorial(50000) to prod(range(2, 50001)).

factorial has some special-case optimizations, but the bulk of the difference is due to prod evaluating an expression tree of depth n. If you re-parenthesize the product so that the tree has depth log n, as factorial does, it's much faster. The evaluation order of prod isn't documented, so I think the change would be safe.

factorial uses recursion to build the tree, but it can be done iteratively with no advance knowledge of the total number of nodes.

This trick is widely useful for turning a way of combining two things into a way of combining many things, so I wouldn't mind seeing a generic version of it in the standard library, e.g. reduce(..., order='mid'). For many specific cases there are more efficient alternatives (''.join, itertools.chain, set.unions, heapq.merge), but it's nice to have a recipe that saves you the trouble of writing special-case algorithms at the cost of a log factor that's often ignorable.
msg414138 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-02-27 03:07
I don't know that there's a good use case for this. For floating addition, tree-based summation can greatly reduce total roundoff error, but that's not true of floating multiplication.

The advantage for prod(range(2, 50001)) doesn't really stem from turning it into a tree reduction, but instead for that specific input sequence "the tree" happens to do a decent job of keeping multiplicands more-or-less balanced in size (bit length), which eventually allows Karatsuba multiplication to come into play. If CPython didn't implement Karatsuba multiplication, tree reduction wouldn't improve speed: if the inputs are in sequence xs, regardless how school-book multiplication is grouped, or rearranged, the time needed is proportional to

    sum(a * b for a, b in combinations([x.bit_length() for x in xs], 2))

So if the real point is to speed large products of integers, a different approach is wanted: keep track of intermediate products' bit lengths, and at each step strive to multiply partial products near the same length. This will reliably get Karatsuba into play if possible, and caters too to that Karatsuba is most valuable on multiplicands of the same bit length. When any of that happens from blind tree reduction, it's down to luck.

I've seen decent results from doing that with a fixed, small array A, which (very roughly speaking) combines "the next" integer `i` to multiply with the array entry A[i.bit_length().bit_length()] (and continues that with the new partial product, & so on, until hitting an array slot containing 1).
msg414144 - (view) Author: (benrg) Date: 2022-02-27 08:36
My example used ints, but I was being deliberately vague when I said "bignums". Balanced-tree reduction certainly isn't optimal for ints, and may not be optimal for anything, but it's pretty good for a lot of things. It's the comparison-based sorting of reduction algorithms.

* When the inputs are of similar sizes, it tends to produce intermediate operands of similar sizes, which helps with Karatsuba multiplication (as you said).

* When the inputs are of different sizes, it limits the "damage" any one of them can do, since they only participate in log2(n) operations each.

* It doesn't look at the values, so it works with third-party types that are unknown to the stdlib.

There's always a fallback case, and balanced reduction is good for that. If there's a faster path for ints that looks at their bit lengths, great.
msg414164 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-02-27 21:12
Hi. "It's pretty good for a lot of things" is precisely what I'm questioning. Name some, and please be specific ;-)

Tree reduction is very popular in the parallel processing world, for obvious reasons. But we're talking about a single thread here, and there aren't all that many associative binary operators (or, depending on implementation, they may need also to be commutative).

I gave floating * as an example: tree reduction buys nothing for accuracy, and would be both slower and consume more memory. Other associative operators for which all the same are true include bitwise &, bitwise |, bitwise ^, min, and max.

Floating + can benefit for accuracy, but we already have math.fsum() for that, which delivers the most accurate possible result.

The only unaddressed example we have here so far that could actually benefit is bigint *. If that's the real use case, fine, but there are better ways to address that case.

I've searched in vain for other languages that try to address this "in general", except in the context of parallelization. As Guido will tell you, the only original idea in Python is adding an "else" clause to loops ;-)

In any case, I don't believe this belongs hiding in reduce(). As above, it's a net loss for most binary operators. Too close to "attractive nuisance" territory. Adding a new, e.g., treereduce() would be better - provided that it is in fact "pretty good for a lot of things".
msg414216 - (view) Author: (benrg) Date: 2022-02-28 19:09
Anything that produces output of O(m+n) size in O(m+n) time. Ordered merging operations. Mergesort is a binary ordered merge with log-depth tree reduction, and insertion sort is the same binary operation with linear-depth tree reduction. Say you're merging sorted lists of intervals, and overlapping intervals need special treatment. It's easier to write a manifestly correct binary merge than an n-way merge, or a filter after heapq.merge that needs to deal with complex interval clusters. I've written that sort of code.

Any situation that resembles a fast path but doesn't qualify for the fast path. For example, there's an optimized factorial function in math, but you need double factorial. Or math.prod is optimized for ints as you suggested, but you have a class that uses ints internally but doesn't pass the CheckExact test. Usually when you miss out on a fast path, you just take a (sometimes large) constant-factor penalty, but here it pushes you into a different complexity class. Or you have a class that uses floats internally and wants to limit accumulated roundoff errors, but the struture of the computation doesn't fit fsum.

>Tree reduction is very popular in the parallel processing world, for obvious reasons.

It's the same reason in every case: the log depth limits the accumulation of some bad thing. In parallel computing it's critical-path length, in factorial and mergesort it's size, in fsum it's roundoff error. Log depth helps in a range of situations.

>I've searched in vain for other languages that try to address this "in general"

You've got me there.

>As Guido will tell you, the only original idea in Python is adding an "else" clause to loops ;-)

I don't think that's really true, except in the sense that there's nothing new under the sun. No one would use Python if it was just like other languages except slower and with for-else.
msg414219 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-02-28 19:47
Too abstract for me to find compelling. I suspect the title of this report referenced "math.prod with bignums" because it's the only actual concrete use case you had ;-)

Here's another: math.lcm. That can benefit for the same reason as math.prod - provoking Karatsuba multiplication. However, applying lcm to a largish collection of ints is so rare I can't recall ever seeing it done.

Here's a related anti-example: math.gcd. Tree reduction hurts that. It typically falls to 1 quickly, and tree reduction just delays that.

So I'm at best -0 on this, and I'll stop now.

For reference, here's a Python implementation that strives to match functools.reduce's signature and endcase behaviors. It accepts any iterable, and requires temp space at most about log2(number of elements the iterable delivers in all).

That memory frugality adds a log2 factor to the runtime. The O() speed penalty could be eliminated by using temp memory that grows to about half the number of elements in the iterable.

    def treereduce(function, iterable, initializer=None):
        levels = []
        if initializer is not None:
            levels.append(initializer)
        NULL = object()
        for y in iterable:
            for i, x in enumerate(levels):
                if x is NULL:
                    levels[i] = y
                    break
                y = function(x, y)
                levels[i] = NULL
            else:
                levels.append(y)
        y = NULL
        for x in levels:
            if x is not NULL:
                y = x if y is NULL else function(x, y)
        if y is NULL:
            raise TypeError("treereduce() of empty iterable with no initial value")
        return y

Then, for example,

>>> treereduce(lambda x, y: f"({x}+{y})", "abcdefghijk")
'((((a+b)+(c+d))+((e+f)+(g+h)))+((i+j)+k))'
msg414225 - (view) Author: (benrg) Date: 2022-02-28 23:27
>That memory frugality adds a log2 factor to the runtime.

Your iterative algorithm is exactly the one I had in mind, but it doesn't have the run time that you seem to think. Is that the whole reason for our disagreement?

It does only O(1) extra work (not even amortized O(1), really O(1)) for each call of the binary function, and there are exactly n-1 calls. There's a log(n) term (not factor) for expanding the array and skipping NULLs in the final cleanup. The constant factor for it is tiny since the array is so small.

I implemented it in C and benchmarked it against reduce with unvarying arguments (binary | on identical ints), and it's slightly slower around 75% of the time, and slightly faster around 25% of the time, seemingly at random, even in the same test, which I suppose is related to where the allocator decides to put the temporaries. The reordering only needs to have a sliver of a benefit for it to come out on top.

When I said "at the cost of a log factor" in the first message, I meant relative to algorithms like ''.join, not left-reduce.


>I suspect the title of this report referenced "math.prod with bignums" because it's the only actual concrete use case you had ;-)

I led with math.prod because its evaluation order isn't documented, so it can be changed (and I guess I should have said explicitly that there is no up-front penalty to changing it beyond tricky cache locality issues). I said "bignums" because I had in mind third-party libraries and the custom classes that I mentioned in my last message. I put ? after reduce because its left-associativity is documented and useful (e.g. with nonassociative functions), so it would have to be extended or a new function added, which is always a hard sell. I also wanted the title to be short. I did the best I could.
msg414231 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-03-01 01:32
About runtime, you're right. I did a ballpark "OK, if there are N incoming values, the inner loop has to go around, for each one, looking for a NULL, across a vector of at most log2(N) entries. So N * log2(N)". But, in fact, it's highly skewed toward getting out early, and 2*N is an upper bound on the total number of inner loop iterations. Strongly related to that the total number of trailing 1 bits in the integers from 1 through N inclusive is N - N.bit_count().

For the rest, I'll only repeat that if this goes in, it should be as a new function. Special-casing, e.g., math.prod() is a Bad Idea. We can have no idea in advance whether the iterable is type-homogenous, or even whether the __mul__ methods the types involved implement are even intended to be associative. 

functools.reduce() clearly documents strict "left to right" evaluation.

But a new treereduce() can do anything it feels like promising.
msg414233 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-03-01 02:30
> the total number of trailing 1 bits in the integers from 1
> through N inclusive is N - N.bit_count()

Sorry, that's the total number of trailing 0 bits. The total number of trailing 1 bits is (N+1) - (N+1).bit_count().
History
Date User Action Args
2022-04-11 14:59:56adminsetgithub: 91024
2022-03-01 02:30:42tim.peterssetmessages: + msg414233
2022-03-01 01:32:22tim.peterssetmessages: + msg414231
2022-02-28 23:27:29benrgsetmessages: + msg414225
2022-02-28 19:47:30tim.peterssetmessages: + msg414219
2022-02-28 19:09:45benrgsetmessages: + msg414216
2022-02-27 21:12:26tim.peterssetmessages: + msg414164
2022-02-27 08:36:09benrgsetmessages: + msg414144
2022-02-27 03:07:19tim.peterssetmessages: + msg414138
2022-02-27 00:05:23ned.deilysetnosy: + tim.peters, rhettinger, mark.dickinson
2022-02-26 23:35:53benrgcreate