Ya, I don't expect anyone will check in a change without doing comparative timings in C first. Not worried about that.
I'd be happy to declare victory and move on at this point ;-) But that's me. Near the start of this, I noted that we just won't compete with GMP's vast array of tricks.
I noted that they use a special routine for division when it's known in advance that the remainder is 0 (as it is, e.g., in every division performed by "our" recursion (which GMP also uses, in some cases)).
But I didn't let on just how much that can buy them. Under 3.10.1 on my box, the final division alone in math.factorial(1000000) // math.factorial(500000)**2 takes over 20 seconds. But a pure Python implementation of what I assume (don't know for sure) is the key idea(*) in their exact-division algorithm does the same thing in under 0.4 seconds. Huge difference - and while the pure Python version only ever wants the lower N bits of an NxN product, there's no real way to do that in pure Python except via throwing away the higher N bits of a double-width int product. In C, of course, the high half of the bits wouldn't be computed to begin with.
(*) Modular arithmetic again. Given n and d such that it's known n = q*d for some integer q, shift n and d right until d is odd. q is unaffected. A good upper bound on the bit length of q is then n.bit_length() - d.bit_length() + 1. Do the remaining work modulo 2 raised to that power. Call that base B.
We "merely" need to solve for q in the equation n = q*d (mod B). Because d is odd, I = pow(d, -1, B) exists. Just multiply both sides by I to get n * I = q (mod B).
No divisions of any kind are needed. More, there's also a very efficient, division-free algorithm for finding an inverse modulo a power of 2. To start with, every odd int is its own inverse mod 8, so we start with 3 good bits. A modular Newton-like iteration can double the number of correct bits on each iteration.
But I won't post code (unless someone asks) because I don't want to encourage anyone :-) |