Issue44376
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2021-06-10 12:12 by steven.daprano, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 26662 | merged | tim.peters, 2021-06-11 02:36 |
Messages (7) | |||
---|---|---|---|
msg395527 - (view) | Author: Steven D'Aprano (steven.daprano) * | Date: 2021-06-10 12:12 | |
Naively, I assumed that `x**2` would be faster than `x*x` as there is only one name lookup in the first, and two in the second. But it is slower. The performance of `x**2` relative to `x*x` has gradually deteriorated compared to `x*x` over many versions. I have found the ratio of `x**2` to `x*x` using timeit: pythonX.Y -m timeit -s "x=115" "x**2" pythonX.Y -m timeit -s "x=115" "x*x" for various X.Y: 2.4: 1.1 # ratio of time for x**2 / time for x*x 2.5: 1.5 2.6: 1.0 2.7: 1.6 3.2: 4.2 3.3: 4.2 3.5: 3.8 3.7: 5.9 3.9: 7.3 In the 2.x series, performance was quite close. In 3.x, the ratio has increased significantly. Either integer multiplication has gotten much faster, or exponentiation much slower, or both. Shockingly (to me at least), an exponent of 1 is an order of magnitude slower than an multiplicand of 1: 2.7: 1.3 # ratio of time for x**1 / time for x*1 3.9: 10.2 Even an exponent of 10 is a little slower than repeated multiplication in 3.9: `x*x*x*x*x*x*x*x*x*x` is slightly faster than `x**10`. It would be nice if we could improve the performance of exponentiation. (OS: Linux) |
|||
msg395546 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-06-10 16:01 | |
Is it because exponentiation becomes slower or because multiplication becomes faster? What are absolute numbers? |
|||
msg395555 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2021-06-10 16:51 | |
I can't reproduce this on my Mac laptop (using Python builds from MacPorts). Numbers for both x**2 and x*x are fairly stable across Python 3.2 to Python 3.10. There's some variation, but nothing close to the same extent that Steven is seeing. Here are my raw numbers: lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x*x" 10000000 loops, best of 3: 0.031 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x*x" 10000000 loops, best of 3: 0.0297 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x*x" 10000000 loops, best of 3: 0.0286 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x*x" 10000000 loops, best of 3: 0.03 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x*x" 10000000 loops, best of 3: 0.0312 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x*x" 10000000 loops, best of 5: 28.7 nsec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x*x" 10000000 loops, best of 5: 32 nsec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x*x" 10000000 loops, best of 5: 33.5 nsec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x*x" 10000000 loops, best of 5: 32.3 nsec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x**2" 1000000 loops, best of 3: 0.249 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x**2" 1000000 loops, best of 3: 0.224 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x**2" 1000000 loops, best of 3: 0.221 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x**2" 1000000 loops, best of 3: 0.213 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x**2" 1000000 loops, best of 3: 0.235 usec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x**2" 1000000 loops, best of 5: 204 nsec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x**2" 1000000 loops, best of 5: 217 nsec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x**2" 1000000 loops, best of 5: 245 nsec per loop lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x**2" 1000000 loops, best of 5: 230 nsec per loop |
|||
msg395560 - (view) | Author: Tim Peters (tim.peters) * | Date: 2021-06-10 18:18 | |
Under the released 3.9.5 for 64-bit Win10, raising to the power 2 is clearly much slower than multiplying directly: C:\Windows\System32>py -3 -m timeit -s "x=151" "x*x" 10000000 loops, best of 5: 30 nsec per loop C:\Windows\System32>py -3 -m timeit -s "x=151" "x**2" 1000000 loops, best of 5: 194 nsec per loop Since the multiplication itself is cheap, overheads must account for it. Offhand, looks to me like the `x**2` spelling is actually doing 31 multiplications under the covers, although most of them are as cheap as Python-int multiplies get. for (i = Py_SIZE(b) - 1; i >= 0; --i) { digit bi = b->ob_digit[i]; for (j = (digit)1 << (PyLong_SHIFT-1); j != 0; j >>= 1) { MULT(z, z, z); if (bi & j) MULT(z, a, z); } } Python ints on a 64-bit box are stored internally in base 2**30 (PyLong_SHIFT is 30). z starts life at 1. The first 28 trips through the loop are chewing up the 28 leading zero bits in exponent 2, so MULT(z, z, z) multiplies 1 by 1 to get 1, 28 times. Then again on the 29th iteration, but now "bi & j" is finally true (we finally found the leading one bit in exponent 2), so z is replaced by 1 times the base = the base. On the final, 30th, iteration, MULT(z, z, z) replaces z with its square, and we're done. It would probably be worthwhile to add code special-casing the leading Python "digit" of the exponent, fast-forwarding without any multiplies to the leading one bit, and setting z directly to the base then. |
|||
msg395597 - (view) | Author: Tim Peters (tim.peters) * | Date: 2021-06-11 02:51 | |
This is a stab at reducing overhead for small exponents, along the lines I sketched: https://github.com/python/cpython/pull/26662 Unfortunately, I've been unable to convince BPO and GitHub to recognize that the PR is related to this report. Did something basic change? Incidentally, at first this change triggered rare shutdown deaths due to negative refcounts, in the collection of small integer objects. That was a head-scratcher! Turns that was, I believe, due to a "temp = NULL" line missing from earlier code introduced to implement powers of modular inverses. |
|||
msg395691 - (view) | Author: Tim Peters (tim.peters) * | Date: 2021-06-12 16:30 | |
New changeset 9d8dd8f08aae4ad6e73a9322a4e9dee965afebbc by Tim Peters in branch 'main': bpo-44376 - reduce pow() overhead for small exponents (GH-26662) https://github.com/python/cpython/commit/9d8dd8f08aae4ad6e73a9322a4e9dee965afebbc |
|||
msg395692 - (view) | Author: Tim Peters (tim.peters) * | Date: 2021-06-12 16:49 | |
Closing this now because the pull request did, I believe, all that can be done at the function level. Exponents of 1 and 2 are well within a factor of 2 of repeated multiplication now, and it's essentially a tie at exponent 3 now. Above that, pow() wins now. On my box. Doing better would require a smarter compiler, which, e.g., knew that `pow(x, 2)` is the same as `x*x`. But, as is, `pow` is just another identifier to CPython's compiler, and may refer to any code at all. `i**2` isn't really much better, because CPython just redirects to type(i)'s __pow__ function at runtime. Which, again to the compiler, may refer to any code at all. `pow()` is quite an involved function, needing to cater to all sorts of things, including possible reduction by the optional modulus, and possibly negative exponents. `pow(i, 2)` (same as `i**2` under the covers) does exactly one Python-int multiplication now, same as `i*i`. That's fast. In either case overheads account for the bulk of the elapsed time. The overhead of going around the eval loop an "extra" time (in `i*i`) and doing another name lookup is simply smaller than all the overheads `pow()` incurs to _deduce_, at runtime, that it's only being asked to do one multiply. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:46 | admin | set | github: 88542 |
2021-06-12 16:49:02 | tim.peters | set | status: open -> closed resolution: fixed messages: + msg395692 stage: patch review -> resolved |
2021-06-12 16:30:03 | tim.peters | set | messages: + msg395691 |
2021-06-11 02:52:00 | tim.peters | set | messages: + msg395597 |
2021-06-11 02:36:44 | tim.peters | set | keywords:
+ patch stage: patch review pull_requests: + pull_request25248 |
2021-06-10 18:18:00 | tim.peters | set | nosy:
+ tim.peters messages: + msg395560 |
2021-06-10 16:51:54 | mark.dickinson | set | messages: + msg395555 |
2021-06-10 16:01:36 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg395546 |
2021-06-10 15:20:19 | xtreak | set | nosy:
+ mark.dickinson |
2021-06-10 12:12:15 | steven.daprano | create |