Clever, Mark! Very nice.
The justification for the shift count isn't self-evident, and appears to me to be an instance of the generalization of Kummer's theorem to multinomial coefficients. I think it would be clearer at first sight to rely instead on that 2**i/(2**j * 2**k) = 2**(i-j-k), which is shallow.
So here's a minor rewrite doing that instead; it would add 68 bytes to the precomputed static data.
import math
# Largest n such that comb(n, k) fits in 64 bits for all k.
Nmax = 67
# Express n! % 2**64 as Fodd[n] << Fntz[n] where Fodd[n] is odd.
Fodd = [] # unsigned 8-byte ints
Fntz = [] # unsigned 1-byte ints
for i in range(Nmax + 1):
f = math.factorial(i)
lsb = f & -f # isolate least-significant 1 bit
Fntz.append(lsb.bit_length() - 1)
Fodd.append((f >> Fntz[-1]) % 2**64)
Finv = [pow(fodd, -1, 2**64) for fodd in Fodd]
# All of the above is meant to be precomputed; just static tables in C.
# Fast comb for small values.
def comb_small(n, k):
if not 0 <= k <= n <= Nmax:
raise ValueError("k or n out of range")
return ((Fodd[n] * Finv[k] * Finv[n-k] % 2**64)
<< (Fntz[n] - Fntz[k] - Fntz[n-k]))
# Exhaustive test
for n in range(Nmax+1):
for k in range(0, n+1):
assert comb_small(n, k) == math.comb(n, k)
Since 99.86% of comb() calls in real life are applied to a deck of cards ;-) , it's valuable that Nmax be >= 52. |