Some comments, now that I've had a chance to look properly at the suggestion.
For reference, here's the "near square root" function that forms the basis of Python's isqrt algorithm. For clarity, I've written it recursively, but it's equivalent to the iterative version described in mathmodule.c. (Definition: for a positive integer n, a "near square root" of n is an integer a such that |a - √n| < 1; or in other words the near square roots of n are the floor and the ceiling of √n.)
def nsqrt(n):
"""Compute a near square root for a positive integer n."""
if n < 4:
return 1
else:
e = (n.bit_length() - 3) // 4
a = nsqrt(n >> 2*e + 2)
return (a << e) + (n >> e + 2) // a
Juraj's suggestion, applied to each step of the recursion rather than just the outer step, amounts to computing the expression in the last line in a different way. (What follows isn't *identical* to what Juraj is suggesting in all the details, but it's essentially equivalent and has the same key performance implications.) Here's the proposed new version, identical to the previous one except for the last line:
def nsqrt(n):
"""Compute a near square root for a positive integer n."""
if n < 4:
return 1
else:
e = (n.bit_length() - 3) // 4
a = nsqrt(n >> 2*e + 2)
return (a << e + 1) + ((n >> e + 2) - (a * a << e)) // a
With regards to proof, it's straightforward to see that this is equivalent: we have
(a << e) + (n >> e + 2) // a
== (a << e) + (a << e) + (n >> e + 2) // a - (a << e)
== (a << e + 1) + (n >> e + 2) // a - (a << e)
== (a << e + 1) + ((n >> e + 2) - (a * a << e)) // a
It's interesting to note that with this version we *do* rely on Python's floor division semantics for negative numbers, since the quantity ((n >> e + 2) - (a * a << e)) could be negative; that last equality is not valid with C-style sign-of-result-is-sign-of-dividend division. The first version works entirely with nonnegative integers. (And the version proposed by Juraj also works entirely with nonnegative integers, as a result of the extra correction step.)
And now the key point is that the dividend (n >> e + 2) - (a * a << e) in the second version is significantly smaller (roughly two-thirds the bit count) than the original divisor n >> e + 2. That should make the division roughly twice as fast (in the limit as n gets large), and since the division is the main bottleneck in the algorithm for large n, it should speed up the algorithm overall, again in the limit for large n, despite the fact that we have a greater number of arithmetic operations per iteration. And with Python's current subquadratic multiplication but quadratic division, the advantage becomes more significant as n gets large.
The second version is a little more complicated than the first, but the complication probably amounts to no more than 10-20 extra lines of C code. Still, there's a maintenance cost in adding that complication to be considered.
But here's the thing: I don't actually care about the performance for *huge* n - people who care about that sort of thing would be better off using gmpy2. I'm much more interested in the performance implications for *smaller* n: that is, integers of length 64-256 bits, say. (For n smaller than 2**64 the difference is irrelevant, since we have a pure C fast path there.) If the second version performs better across the board it may be worth the extra complexity. If it doesn't, then what's the cutoff? That is, where does the second version start outperforming the first? I'm not really so interested in having a hybrid algorithm that switches from one solution to the other at some threshold - that's a step too far complexity-wise.
So I propose that the next step be to code up the second variant in mathmodule.c and do some performance testing.
Juraj: are you interested in working on a PR? |