Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of integer exponentiation #88542

Closed
stevendaprano opened this issue Jun 10, 2021 · 7 comments
Closed

Improve performance of integer exponentiation #88542

stevendaprano opened this issue Jun 10, 2021 · 7 comments
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@stevendaprano
Copy link
Member

BPO 44376
Nosy @tim-one, @mdickinson, @stevendaprano, @serhiy-storchaka
PRs
  • bpo-44376 - reduce pow() overhead for small exponents #26662
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-06-12.16:49:02.736>
    created_at = <Date 2021-06-10.12:12:15.241>
    labels = ['interpreter-core', '3.11', 'performance']
    title = 'Improve performance of integer exponentiation'
    updated_at = <Date 2021-06-12.16:49:02.736>
    user = 'https://github.com/stevendaprano'

    bugs.python.org fields:

    activity = <Date 2021-06-12.16:49:02.736>
    actor = 'tim.peters'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-06-12.16:49:02.736>
    closer = 'tim.peters'
    components = ['Interpreter Core']
    creation = <Date 2021-06-10.12:12:15.241>
    creator = 'steven.daprano'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 44376
    keywords = ['patch']
    message_count = 7.0
    messages = ['395527', '395546', '395555', '395560', '395597', '395691', '395692']
    nosy_count = 4.0
    nosy_names = ['tim.peters', 'mark.dickinson', 'steven.daprano', 'serhiy.storchaka']
    pr_nums = ['26662']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue44376'
    versions = ['Python 3.11']

    @stevendaprano
    Copy link
    Member Author

    Naively, I assumed that x**2 would be faster than x*x as there is only one name lookup in the first, and two in the second. But it is slower.

    The performance of x**2 relative to x*x has gradually deteriorated compared to x*x over many versions.

    I have found the ratio of x**2 to x*x using timeit:

    pythonX.Y -m timeit -s "x=115" "x**2"
    pythonX.Y -m timeit -s "x=115" "x*x"
    
    for various X.Y:

    2.4: 1.1 # ratio of time for x**2 / time for x*x
    2.5: 1.5
    2.6: 1.0
    2.7: 1.6
    3.2: 4.2
    3.3: 4.2
    3.5: 3.8
    3.7: 5.9
    3.9: 7.3

    In the 2.x series, performance was quite close. In 3.x, the ratio has increased significantly. Either integer multiplication has gotten much faster, or exponentiation much slower, or both.

    Shockingly (to me at least), an exponent of 1 is an order of magnitude slower than an multiplicand of 1:

    2.7: 1.3 # ratio of time for x**1 / time for x*1
    3.9: 10.2

    Even an exponent of 10 is a little slower than repeated multiplication in 3.9:

    x*x*x*x*x*x*x*x*x*x is slightly faster than x**10.

    It would be nice if we could improve the performance of exponentiation.

    (OS: Linux)

    @stevendaprano stevendaprano added 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jun 10, 2021
    @serhiy-storchaka
    Copy link
    Member

    Is it because exponentiation becomes slower or because multiplication becomes faster? What are absolute numbers?

    @mdickinson
    Copy link
    Member

    I can't reproduce this on my Mac laptop (using Python builds from MacPorts). Numbers for both x**2 and x*x are fairly stable across Python 3.2 to Python 3.10. There's some variation, but nothing close to the same extent that Steven is seeing.

    Here are my raw numbers:

    lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 3: 0.031 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 3: 0.0297 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 3: 0.0286 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 3: 0.03 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 3: 0.0312 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 5: 28.7 nsec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 5: 32 nsec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 5: 33.5 nsec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x*x"
    10000000 loops, best of 5: 32.3 nsec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.2 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 3: 0.249 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.3 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 3: 0.224 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.4 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 3: 0.221 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.5 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 3: 0.213 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.6 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 3: 0.235 usec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.7 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 5: 204 nsec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.8 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 5: 217 nsec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.9 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 5: 245 nsec per loop
    lovelace:cpython mdickinson$ /opt/local/bin/python3.10 -m timeit -s "x=115" "x**2"
    1000000 loops, best of 5: 230 nsec per loop

    @tim-one
    Copy link
    Member

    tim-one commented Jun 10, 2021

    Under the released 3.9.5 for 64-bit Win10, raising to the power 2 is clearly much slower than multiplying directly:

    C:\Windows\System32>py -3 -m timeit -s "x=151" "x*x"
    10000000 loops, best of 5: 30 nsec per loop

    C:\Windows\System32>py -3 -m timeit -s "x=151" "x**2"
    1000000 loops, best of 5: 194 nsec per loop

    Since the multiplication itself is cheap, overheads must account for it. Offhand, looks to me like the x**2 spelling is actually doing 31 multiplications under the covers, although most of them are as cheap as Python-int multiplies get.

            for (i = Py_SIZE(b) - 1; i >= 0; --i) {
                digit bi = b->ob_digit[i];
    
                for (j = (digit)1 << (PyLong_SHIFT-1); j != 0; j >>= 1) {
                    MULT(z, z, z);
                    if (bi & j)
                        MULT(z, a, z);
                }
            }

    Python ints on a 64-bit box are stored internally in base 2**30 (PyLong_SHIFT is 30). z starts life at 1. The first 28 trips through the loop are chewing up the 28 leading zero bits in exponent 2, so MULT(z, z, z) multiplies 1 by 1 to get 1, 28 times. Then again on the 29th iteration, but now "bi & j" is finally true (we finally found the leading one bit in exponent 2), so z is replaced by 1 times the base = the base. On the final, 30th, iteration, MULT(z, z, z) replaces z with its square, and we're done.

    It would probably be worthwhile to add code special-casing the leading Python "digit" of the exponent, fast-forwarding without any multiplies to the leading one bit, and setting z directly to the base then.

    @tim-one
    Copy link
    Member

    tim-one commented Jun 11, 2021

    This is a stab at reducing overhead for small exponents, along the lines I sketched:

    #26662

    Unfortunately, I've been unable to convince BPO and GitHub to recognize that the PR is related to this report. Did something basic change?

    Incidentally, at first this change triggered rare shutdown deaths due to negative refcounts, in the collection of small integer objects. That was a head-scratcher! Turns that was, I believe, due to a "temp = NULL" line missing from earlier code introduced to implement powers of modular inverses.

    @tim-one
    Copy link
    Member

    tim-one commented Jun 12, 2021

    New changeset 9d8dd8f by Tim Peters in branch 'main':
    bpo-44376 - reduce pow() overhead for small exponents (GH-26662)
    9d8dd8f

    @tim-one
    Copy link
    Member

    tim-one commented Jun 12, 2021

    Closing this now because the pull request did, I believe, all that can be done at the function level. Exponents of 1 and 2 are well within a factor of 2 of repeated multiplication now, and it's essentially a tie at exponent 3 now. Above that, pow() wins now. On my box.

    Doing better would require a smarter compiler, which, e.g., knew that pow(x, 2) is the same as x*x. But, as is, pow is just another identifier to CPython's compiler, and may refer to any code at all. i**2 isn't really much better, because CPython just redirects to type(i)'s pow function at runtime. Which, again to the compiler, may refer to any code at all.

    pow() is quite an involved function, needing to cater to all sorts of things, including possible reduction by the optional modulus, and possibly negative exponents.

    pow(i, 2) (same as i**2 under the covers) does exactly one Python-int multiplication now, same as i*i. That's fast. In either case overheads account for the bulk of the elapsed time. The overhead of going around the eval loop an "extra" time (in i*i) and doing another name lookup is simply smaller than all the overheads pow() incurs to deduce, at runtime, that it's only being asked to do one multiply.

    @tim-one tim-one closed this as completed Jun 12, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants