Some random notes:
- 1425089352415399815 appears to be derived from using the golden ratio to contrive a worst case for the Euclid egcd method. Which it's good at :-) Even so, the current code runs well over twice as fast as when replacing the pow(that, -1, P) with pow(that, P-2, P).
- Using binary gcd on the same thing requires only 46 iterations - and, of course, no divisions at all. So that could be a big win. There's no possible way to get exponentiation to require less than 60 iterations, because it requires that many squarings just to reach the high bit of P-2. However, I never finishing working out how to extend binary gcd to return inverses too.
- All cases are "bad" for pow(whatever, P-2, P) because P-2 has 60 bits set. So we currently need 60 multiplies to account for those, in addition to 60 squarings to reach P-2's high bit. A significant speedup could probably have been gotten just by rewriting whatever**(P-2) as
(whatever ** 79511827903920481) ** 29
That is, express P-2 as its prime factorization. There are 28 zero bits in the larger factor, so would save 28 multiply steps right there. Don't know how far that may yet be from an optimal addition chain for P-2.
- The worst burden of the P-2-power method is that there's no convenient way to exploit that % P _can_ be very cheap, because P has a very special value. The power method actually needs to divide by P on each step. As currently coded, the egcd method never needs to divide by P (although _in_ the egcd part it divides narrower & narrower numbers < P).
- Something on my todo list forever was writing an internal routine for squaring, based on that (a+b)**2 = a**2 + 2ab + b**2. That gives Karatsuba-like O() speedup but with simpler code, enough simpler that it could probably be profitably applied even to a relatively small argument.
Which of those do I intend to pursue? Probably none :-( But I confess to be _annoyed_ at giving up on extending binary gcd to compute an inverse, so I may at least do that in Python before I die ;-) |