classification
Title: Improve performance of integer exponentiation
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: mark.dickinson, serhiy.storchaka, steven.daprano, tim.peters
Priority: normal Keywords: patch

Created on 2021-06-10 12:12 by steven.daprano, last changed 2021-06-12 16:49 by tim.peters. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 26662 merged tim.peters, 2021-06-11 02:36
Messages (7)
msg395527 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-06-10 12:12
Naively, I assumed that `x**2` would be faster than `x*x` as there is only one name lookup in the first, and two in the second. But it is slower.

The performance of `x**2` relative to `x*x` has gradually deteriorated compared to `x*x` over many versions.

I have found the ratio of `x**2` to `x*x` using timeit:

    pythonX.Y -m timeit -s "x=115" "x**2"
    pythonX.Y -m timeit -s "x=115" "x*x"

for various X.Y:

2.4: 1.1  # ratio of time for x**2 / time for x*x
2.5: 1.5
2.6: 1.0
2.7: 1.6
3.2: 4.2
3.3: 4.2
3.5: 3.8
3.7: 5.9
3.9: 7.3


In the 2.x series, performance was quite close. In 3.x, the ratio has increased significantly. Either integer multiplication has gotten much faster, or exponentiation much slower, or both.

Shockingly (to me at least), an exponent of 1 is an order of magnitude slower than an multiplicand of 1:

2.7: 1.3  # ratio of time for x**1 / time for x*1
3.9: 10.2


Even an exponent of 10 is a little slower than repeated multiplication in 3.9:

`x*x*x*x*x*x*x*x*x*x` is slightly faster than `x**10`.


It would be nice if we could improve the performance of exponentiation.


(OS: Linux)
msg395546 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-06-10 16:01
Is it because exponentiation becomes slower or because multiplication becomes faster? What are absolute numbers?
msg395555 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-06-10 16:51
I can't reproduce this on my Mac laptop (using Python builds from MacPorts). Numbers for both x**2 and x*x are fairly stable across Python 3.2 to Python 3.10. There's some variation, but nothing close to the same extent that Steven is seeing.

Here are my raw numbers:

lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.031 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.0297 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.0286 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.03 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.0312 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 28.7 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 32 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 33.5 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 32.3 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.249 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.224 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.221 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.213 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.235 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 204 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 217 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 245 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 230 nsec per loop
msg395560 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-06-10 18:18
Under the released 3.9.5 for 64-bit Win10, raising to the power 2 is clearly much slower than multiplying directly:

C:\Windows\System32>py -3 -m timeit -s "x=151" "x*x"
10000000 loops, best of 5: 30 nsec per loop

C:\Windows\System32>py -3 -m timeit -s "x=151" "x**2"
1000000 loops, best of 5: 194 nsec per loop

Since the multiplication itself is cheap, overheads must account for it. Offhand, looks to me like the `x**2` spelling is actually doing 31 multiplications under the covers, although most of them are as cheap as Python-int multiplies get.

        for (i = Py_SIZE(b) - 1; i >= 0; --i) {
            digit bi = b->ob_digit[i];

            for (j = (digit)1 << (PyLong_SHIFT-1); j != 0; j >>= 1) {
                MULT(z, z, z);
                if (bi & j)
                    MULT(z, a, z);
            }
        }

Python ints on a 64-bit box are stored internally in base 2**30 (PyLong_SHIFT is 30). z starts life at 1. The first 28 trips through the loop are chewing up the 28 leading zero bits in exponent 2, so MULT(z, z, z) multiplies 1 by 1 to get 1, 28 times. Then again on the 29th iteration, but now "bi & j" is finally true (we finally found the leading one bit in exponent 2), so z is replaced by 1 times the base = the base. On the final, 30th, iteration, MULT(z, z, z) replaces z with its square, and we're done.

It would probably be worthwhile to add code special-casing the leading Python "digit" of the exponent, fast-forwarding without any multiplies to the leading one bit, and setting z directly to the base then.
msg395597 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-06-11 02:51
This is a stab at reducing overhead for small exponents, along the lines I sketched:

https://github.com/python/cpython/pull/26662

Unfortunately, I've been unable to convince BPO and GitHub to recognize that the PR is related to this report. Did something basic change?

Incidentally, at first this change triggered rare shutdown deaths due to negative refcounts, in the collection of small integer objects. That was a head-scratcher! Turns that was, I believe, due to a "temp = NULL" line missing from earlier code introduced to implement powers of modular inverses.
msg395691 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-06-12 16:30
New changeset 9d8dd8f08aae4ad6e73a9322a4e9dee965afebbc by Tim Peters in branch 'main':
bpo-44376 - reduce pow() overhead for small exponents (GH-26662)
https://github.com/python/cpython/commit/9d8dd8f08aae4ad6e73a9322a4e9dee965afebbc
msg395692 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-06-12 16:49
Closing this now because the pull request did, I believe, all that can be done at the function level. Exponents of 1 and 2 are well within a factor of 2 of repeated multiplication now, and it's essentially a tie at exponent 3 now. Above that, pow() wins now. On my box.

Doing better would require a smarter compiler, which, e.g., knew that `pow(x, 2)` is the same as `x*x`. But, as is, `pow` is just another identifier to CPython's compiler, and may refer to any code at all. `i**2` isn't really much better, because CPython just redirects to type(i)'s __pow__ function at runtime. Which, again to the compiler, may refer to any code at all.

`pow()` is quite an involved function, needing to cater to all sorts of things, including possible reduction by the optional modulus, and possibly negative exponents.

`pow(i, 2)` (same as `i**2` under the covers) does exactly one Python-int multiplication now, same as `i*i`. That's fast. In either case overheads account for the bulk of the elapsed time. The overhead of going around the eval loop an "extra" time (in `i*i`) and doing another name lookup is simply smaller than all the overheads `pow()` incurs to _deduce_, at runtime, that it's only being asked to do one multiply.
History
Date User Action Args
2021-06-12 16:49:02tim.peterssetstatus: open -> closed
resolution: fixed
messages: + msg395692

stage: patch review -> resolved
2021-06-12 16:30:03tim.peterssetmessages: + msg395691
2021-06-11 02:52:00tim.peterssetmessages: + msg395597
2021-06-11 02:36:44tim.peterssetkeywords: + patch
stage: patch review
pull_requests: + pull_request25248
2021-06-10 18:18:00tim.peterssetnosy: + tim.peters
messages: + msg395560
2021-06-10 16:51:54mark.dickinsonsetmessages: + msg395555
2021-06-10 16:01:36serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg395546
2021-06-10 15:20:19xtreaksetnosy: + mark.dickinson
2021-06-10 12:12:15steven.dapranocreate