Improve performance of integer exponentiation #88542

stevendaprano · 2021-06-10T12:12:15Z

BPO	44376
Nosy	@tim-one, @mdickinson, @stevendaprano, @serhiy-storchaka
PRs	bpo-44376 - reduce pow() overhead for small exponents #26662

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-06-12.16:49:02.736>
created_at = <Date 2021-06-10.12:12:15.241>
labels = ['interpreter-core', '3.11', 'performance']
title = 'Improve performance of integer exponentiation'
updated_at = <Date 2021-06-12.16:49:02.736>
user = 'https://github.com/stevendaprano'

bugs.python.org fields:

activity = <Date 2021-06-12.16:49:02.736>
actor = 'tim.peters'
assignee = 'none'
closed = True
closed_date = <Date 2021-06-12.16:49:02.736>
closer = 'tim.peters'
components = ['Interpreter Core']
creation = <Date 2021-06-10.12:12:15.241>
creator = 'steven.daprano'
dependencies = []
files = []
hgrepos = []
issue_num = 44376
keywords = ['patch']
message_count = 7.0
messages = ['395527', '395546', '395555', '395560', '395597', '395691', '395692']
nosy_count = 4.0
nosy_names = ['tim.peters', 'mark.dickinson', 'steven.daprano', 'serhiy.storchaka']
pr_nums = ['26662']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue44376'
versions = ['Python 3.11']

stevendaprano · 2021-06-10T12:12:15Z

Naively, I assumed that x**2 would be faster than x*x as there is only one name lookup in the first, and two in the second. But it is slower.

The performance of x**2 relative to x*x has gradually deteriorated compared to x*x over many versions.

I have found the ratio of x**2 to x*x using timeit:

pythonX.Y -m timeit -s "x=115" "x**2"
pythonX.Y -m timeit -s "x=115" "x*x"

for various X.Y:

2.4: 1.1 # ratio of time for x**2 / time for x*x
2.5: 1.5
2.6: 1.0
2.7: 1.6
3.2: 4.2
3.3: 4.2
3.5: 3.8
3.7: 5.9
3.9: 7.3

In the 2.x series, performance was quite close. In 3.x, the ratio has increased significantly. Either integer multiplication has gotten much faster, or exponentiation much slower, or both.

Shockingly (to me at least), an exponent of 1 is an order of magnitude slower than an multiplicand of 1:

2.7: 1.3 # ratio of time for x**1 / time for x*1
3.9: 10.2

Even an exponent of 10 is a little slower than repeated multiplication in 3.9:

x*x*x*x*x*x*x*x*x*x is slightly faster than x**10.

It would be nice if we could improve the performance of exponentiation.

(OS: Linux)

serhiy-storchaka · 2021-06-10T16:01:36Z

Is it because exponentiation becomes slower or because multiplication becomes faster? What are absolute numbers?

mdickinson · 2021-06-10T16:51:54Z

I can't reproduce this on my Mac laptop (using Python builds from MacPorts). Numbers for both x**2 and x*x are fairly stable across Python 3.2 to Python 3.10. There's some variation, but nothing close to the same extent that Steven is seeing.

Here are my raw numbers:

lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.031 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.0297 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.0286 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.03 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x*x"
10000000 loops, best of 3: 0.0312 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 28.7 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 32 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 33.5 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x*x"
10000000 loops, best of 5: 32.3 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.249 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.224 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.221 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.213 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x**2"
1000000 loops, best of 3: 0.235 usec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 204 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 217 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 245 nsec per loop
lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x**2"
1000000 loops, best of 5: 230 nsec per loop

tim-one · 2021-06-10T18:18:01Z

Under the released 3.9.5 for 64-bit Win10, raising to the power 2 is clearly much slower than multiplying directly:

C:\Windows\System32>py -3 -m timeit -s "x=151" "x*x"
10000000 loops, best of 5: 30 nsec per loop

C:\Windows\System32>py -3 -m timeit -s "x=151" "x**2"
1000000 loops, best of 5: 194 nsec per loop

Since the multiplication itself is cheap, overheads must account for it. Offhand, looks to me like the x**2 spelling is actually doing 31 multiplications under the covers, although most of them are as cheap as Python-int multiplies get.

        for (i = Py_SIZE(b) - 1; i >= 0; --i) {
            digit bi = b->ob_digit[i];

            for (j = (digit)1 << (PyLong_SHIFT-1); j != 0; j >>= 1) {
                MULT(z, z, z);
                if (bi & j)
                    MULT(z, a, z);
            }
        }

Python ints on a 64-bit box are stored internally in base 2**30 (PyLong_SHIFT is 30). z starts life at 1. The first 28 trips through the loop are chewing up the 28 leading zero bits in exponent 2, so MULT(z, z, z) multiplies 1 by 1 to get 1, 28 times. Then again on the 29th iteration, but now "bi & j" is finally true (we finally found the leading one bit in exponent 2), so z is replaced by 1 times the base = the base. On the final, 30th, iteration, MULT(z, z, z) replaces z with its square, and we're done.

It would probably be worthwhile to add code special-casing the leading Python "digit" of the exponent, fast-forwarding without any multiplies to the leading one bit, and setting z directly to the base then.

tim-one · 2021-06-11T02:52:00Z

This is a stab at reducing overhead for small exponents, along the lines I sketched:

#26662

Unfortunately, I've been unable to convince BPO and GitHub to recognize that the PR is related to this report. Did something basic change?

Incidentally, at first this change triggered rare shutdown deaths due to negative refcounts, in the collection of small integer objects. That was a head-scratcher! Turns that was, I believe, due to a "temp = NULL" line missing from earlier code introduced to implement powers of modular inverses.

tim-one · 2021-06-12T16:30:04Z

New changeset 9d8dd8f by Tim Peters in branch 'main':
bpo-44376 - reduce pow() overhead for small exponents (GH-26662)
9d8dd8f

tim-one · 2021-06-12T16:49:02Z

Closing this now because the pull request did, I believe, all that can be done at the function level. Exponents of 1 and 2 are well within a factor of 2 of repeated multiplication now, and it's essentially a tie at exponent 3 now. Above that, pow() wins now. On my box.

Doing better would require a smarter compiler, which, e.g., knew that pow(x, 2) is the same as x*x. But, as is, pow is just another identifier to CPython's compiler, and may refer to any code at all. i**2 isn't really much better, because CPython just redirects to type(i)'s pow function at runtime. Which, again to the compiler, may refer to any code at all.

pow() is quite an involved function, needing to cater to all sorts of things, including possible reduction by the optional modulus, and possibly negative exponents.

pow(i, 2) (same as i**2 under the covers) does exactly one Python-int multiplication now, same as i*i. That's fast. In either case overheads account for the bulk of the elapsed time. The overhead of going around the eval loop an "extra" time (in i*i) and doing another name lookup is simply smaller than all the overheads pow() incurs to deduce, at runtime, that it's only being asked to do one multiply.

stevendaprano added 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jun 10, 2021

tim-one closed this as completed Jun 12, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of integer exponentiation #88542

Improve performance of integer exponentiation #88542

stevendaprano commented Jun 10, 2021

stevendaprano commented Jun 10, 2021

serhiy-storchaka commented Jun 10, 2021

mdickinson commented Jun 10, 2021

tim-one commented Jun 10, 2021

tim-one commented Jun 11, 2021

tim-one commented Jun 12, 2021

tim-one commented Jun 12, 2021

Improve performance of integer exponentiation #88542

Improve performance of integer exponentiation #88542

Comments

stevendaprano commented Jun 10, 2021

stevendaprano commented Jun 10, 2021

serhiy-storchaka commented Jun 10, 2021

mdickinson commented Jun 10, 2021

tim-one commented Jun 10, 2021

tim-one commented Jun 11, 2021

tim-one commented Jun 12, 2021

tim-one commented Jun 12, 2021