I was thinking about
comb(1000000, 500000)
The simple "* --n / ++k" loop does 499,999 each of multiplication and division, and in all instances the second operand is a single Python digit. Cheap as can be.
In contrast, despite that it short-circuits all "small k" cases, comb_pole2.py executes
return C(n, j) * C(n-j, k-j) // C(k, j) # Recursive case
7,299,598 times. Far more top-level arithmetic operations, including extra-expensive long division with both operands multi-digit. At the very top of the tree, "// C(500000, 250000)" is dividing out nearly half a million bits.
But a tiny Python function coding the simple loop takes about 150 seconds, while Raymond's C() about 30 (under the released 3.10.1). The benefits from provoking Karatsuba are major.
Just for the heck of it, I coded a complication of the simple loop that just tries to provoke Karatsuba on the numerator (prod(range(n, n-k, -1))), and dividing that just once at the end, by factorial(k). That dropped the time to about 17.5 seconds. Over 80% of that time is spent computing the "return" expression, where len(p) is 40 and only 7 entries pass the x>1 test:
return prod(x for x in p if x > 1) // factorial(k)
That is, with really big results, almost everything is mostly noise compared to the time needed to do the relative handful of * and // on the very largest intermediate results.
Stefan's code runs much faster still, but requires storage proportional to k. The crude hack I used needs a fixed (independent of k and n) and small amount of temp storage, basically grouping inputs into buckets roughly based on the log _of_ their log, and keeping only one int per bucket.
Here's the code:
def pcomb2(n, k):
from math import prod, factorial
assert 0 <= k <= n
k = min(k, n-k)
PLEN = 40 # good enough for ints with over a trillion bits
p = [1] * PLEN
def fold_into_p(x):
if x == 1:
return
while True:
i = max(x.bit_length().bit_length() - 5, 0)
if p[i] == 1:
p[i] = x
break
x *= p[i]
p[i] = 1
def showp():
for i in range(PLEN):
pi = p[i]
if pi > 1:
print(i, pi.bit_length())
for i in range(k):
fold_into_p(n)
n -= 1
showp()
return prod(x for x in p if x > 1) // factorial(k)
I'm not sure it's practical. For example, while the list p[] can be kept quite small, factorial(k) can require a substantial amount of temp storage of its own - factorial(500000) in the case at hand is an int approaching 9 million bits.
Note: again in the case at hand, computing it via
factorial(1000000) // factorial(500000)**2
takes about 10% longer than Raymond's C() function. |