classification
Title: Speed up math.isqrt
Type: performance Stage: resolved
Components: Extension Modules Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: mark.dickinson, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-05-18 15:43 by mark.dickinson, last changed 2019-05-19 16:53 by mark.dickinson. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 13405 merged mark.dickinson, 2019-05-18 15:46
PR 13416 merged serhiy.storchaka, 2019-05-19 07:03
Messages (9)
msg342799 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-05-18 15:43
The `math.isqrt` algorithm introduce in GH-36887 currently works entirely with Python long integers. That's unnecessarily inefficient for small inputs.

For n < 2**64, `math.isqrt(n)` can be computed, via exactly the same algorithm, using entirely C integer arithmetic.

For larger n, the first 5 iterations of the algorithm can similarly be performed entirely in C integer arithmetic, and we can then switch to long integer arithmetic for subsequent iterations.

On my machine, these simple changes make a substantial difference (4x faster) for small inputs, and a significant but less substantial difference (70% speedup) for inputs not much larger than 2**64. The speedup for huge integers is likely to be much smaller, percentage-wise.

Some timings:

Unpatched
---------
lovelace:cpython mdickinson$ ./python.exe -m timeit -s "from math import isqrt" "[isqrt(n) for n in range(2, 1000)]"
1000 loops, best of 5: 327 usec per loop
lovelace:cpython mdickinson$ ./python.exe -m timeit -s "from math import isqrt; x = range(2**63-1000, 2**63+1000)" "[isqrt(n) for n in x]"
200 loops, best of 5: 1.44 msec per loop
lovelace:cpython mdickinson$ ./python.exe -m timeit -s "from math import isqrt; x = range(2**95-1000, 2**95+1000)" "[isqrt(n) for n in x]"
200 loops, best of 5: 1.64 msec per loop

Patched (PR imminent)
-------
lovelace:cpython mdickinson$ ./python.exe -m timeit -s "from math import isqrt" "[isqrt(n) for n in range(2, 1000)]"
5000 loops, best of 5: 78.1 usec per loop
lovelace:cpython mdickinson$ ./python.exe -m timeit -s "from math import isqrt; x = range(2**63-1000, 2**63+1000)" "[isqrt(n) for n in x]"
1000 loops, best of 5: 355 usec per loop
lovelace:cpython mdickinson$ ./python.exe -m timeit -s "from math import isqrt; x = range(2**95-1000, 2**95+1000)" "[isqrt(n) for n in x]"
500 loops, best of 5: 954 usec per loop
msg342800 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-05-18 15:47
> introduce in GH-36887

Sorry, that should have been: introduced in GH-13244. #36887 was the corresponding b.p.o. issue.
msg342801 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-05-18 16:06
Did you try the floating point implementation?
msg342802 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-05-18 17:07
> Did you try the floating point implementation?

The aim here was to use exactly the same algorithm, but speed it up by working with C integers where possible; that's a fairly simple change.

Using floating-point would require more complex changes. Again, the biggest issue with using a floating-point sqrt as an initial value is that we can't assume either IEEE 754 floating-point format, *or* a correctly rounded libm sqrt, even though both those things are highly likely on a typical modern machine. So any use of floating-point would also have to have an accuracy check and a fallback integer-only implementation for the case where the floating-point fails. It's possible to make those changes, but I think we'd end up crossing the threshold to "too complicated" for the implementation of a simple function.

It's a bit of a shame, because if we _are_ allowed to assume IEEE 754, and a correctly-rounded sqrt implementation (using round-ties-to-even), then it turns out that one can prove that for any value `n` smaller than 2**106 and `a := int(math.sqrt(float(n)))` (assuming that the `int` and `float` conversions are *also* correctly rounded), we have (a - 1)**2 < n < (a + 1)**2, which is exactly the loop invariant that the current algorithm needs to maintain.
msg342836 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-05-19 07:07
It is possible to get yet 10-20% by avoiding to create temporary Python integers for right arguments of shift operations. PR 13416 adds two private functions _PyLong_Rshift() and _PyLong_Lshift() which take the second argument as C size_t instead of Python integer. _PyLong_Lshift() can also be used in factorial() and in float to int comparison.

$ ./python -m timeit -s "from math import isqrt; x = range(2**63-1000, 2**63+1000)" "[isqrt(n) for n in x]"
Unpatched:  200 loops, best of 5: 1.84 msec per loop
Patched:    200 loops, best of 5: 1.51 msec per loop

$ ./python -m timeit -s "from math import isqrt; x = range(2**95-1000, 2**95+1000)" "[isqrt(n) for n in x]"
Unpatched:  100 loops, best of 5: 2.09 msec per loop
Patched:    200 loops, best of 5: 1.75 msec per loop
msg342837 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-05-19 07:16
I have also some ideas about algorithmic optimizations (they need to be tested). In classic formula $a_{i+1} = a_i + (n - a_i^2)/(2*a_i)$ we can calculate $n - a_i^2$ as $(n - a_{i-1}^2) - (a_i^2 - a_{i-1})^2 = (n - a_{i-1}^2) - (a_i^2 - a_{i-1})*(a_i^2 + a_{i-1})$. $n - a_i^2$ usually is much smaller than $n$, so this can speed up subtraction and division. Things become more complicated when use shifts as in your formula, but I think that we can get benefit even in this case. This can also speed up the final check $a_i^2 <= n$.
msg342840 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-05-19 11:14
New changeset a5119e7d75c9729fc36c059d05f3d7132e7f6bb4 by Serhiy Storchaka in branch 'master':
bpo-36957: Add _PyLong_Rshift() and _PyLong_Lshift(). (GH-13416)
https://github.com/python/cpython/commit/a5119e7d75c9729fc36c059d05f3d7132e7f6bb4
msg342864 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-05-19 16:51
New changeset 5c08ce9bf712acbb3f05a3a57baf51fcb534cdf0 by Mark Dickinson in branch 'master':
bpo-36957: Speed up math.isqrt (#13405)
https://github.com/python/cpython/commit/5c08ce9bf712acbb3f05a3a57baf51fcb534cdf0
msg342865 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-05-19 16:53
Calling this done for now.
History
Date User Action Args
2019-05-19 16:53:34mark.dickinsonsetstatus: open -> closed
resolution: fixed
messages: + msg342865

stage: patch review -> resolved
2019-05-19 16:51:59mark.dickinsonsetmessages: + msg342864
2019-05-19 11:14:48serhiy.storchakasetmessages: + msg342840
2019-05-19 07:16:22serhiy.storchakasetmessages: + msg342837
2019-05-19 07:07:01serhiy.storchakasetmessages: + msg342836
2019-05-19 07:03:24serhiy.storchakasetpull_requests: + pull_request13327
2019-05-18 17:07:57mark.dickinsonsetmessages: + msg342802
2019-05-18 16:06:24serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg342801
2019-05-18 15:47:56mark.dickinsonsetmessages: + msg342800
2019-05-18 15:46:48mark.dickinsonsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request13316
2019-05-18 15:43:46mark.dickinsoncreate