Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize floor division for ints #70477

Closed
1st1 opened this issue Feb 5, 2016 · 19 comments
Closed

Optimize floor division for ints #70477

1st1 opened this issue Feb 5, 2016 · 19 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@1st1
Copy link
Member

1st1 commented Feb 5, 2016

BPO 26289
Nosy @mdickinson, @pitrou, @vstinner, @serhiy-storchaka, @1st1
Files
  • floor_div.patch
  • floor_div_2.patch
  • floor_div_3.patch
  • floor_div_4.patch
  • fast_divmod.patch
  • fast_divmod_2.patch
  • fast_divmod_3.patch
  • fast_divmod_5.patch
  • fast_divmod_6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/1st1'
    closed_at = <Date 2016-02-11.15:29:32.206>
    created_at = <Date 2016-02-05.02:07:54.315>
    labels = ['interpreter-core', 'performance']
    title = 'Optimize floor division for ints'
    updated_at = <Date 2016-02-11.15:29:32.205>
    user = 'https://github.com/1st1'

    bugs.python.org fields:

    activity = <Date 2016-02-11.15:29:32.205>
    actor = 'yselivanov'
    assignee = 'yselivanov'
    closed = True
    closed_date = <Date 2016-02-11.15:29:32.206>
    closer = 'yselivanov'
    components = ['Interpreter Core']
    creation = <Date 2016-02-05.02:07:54.315>
    creator = 'yselivanov'
    dependencies = []
    files = ['41813', '41860', '41862', '41870', '41871', '41875', '41876', '41886', '41887']
    hgrepos = []
    issue_num = 26289
    keywords = ['patch']
    message_count = 19.0
    messages = ['259617', '259803', '259889', '259893', '259896', '259909', '259932', '259934', '259935', '259938', '259940', '259942', '259944', '259955', '259994', '260019', '260025', '260110', '260112']
    nosy_count = 6.0
    nosy_names = ['mark.dickinson', 'pitrou', 'vstinner', 'python-dev', 'serhiy.storchaka', 'yselivanov']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26289'
    versions = ['Python 3.6']

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 5, 2016

    The attached patch optimizes floor division for ints.

    ### spectral_norm ###
    Min: 0.319087 -> 0.289172: 1.10x faster
    Avg: 0.322564 -> 0.294319: 1.10x faster
    Significant (t=21.71)
    Stddev: 0.00249 -> 0.01277: 5.1180x larger

    -m timeit -s "x=22331" "x//2;x//3;x//4;x//5;x//6;x//7;x//8;x/99;x//100;"

    with patch: 0.298
    without:    0.515

    @1st1 1st1 added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 5, 2016
    @serhiy-storchaka
    Copy link
    Member

    Added comments on Rietveld.

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 8, 2016

    Serhiy, Victor, thanks for the review. Attaching an updated version of the patch.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 8, 2016

    This change looks related to the issue bpo-21955. IMHO we should take the same decision. I mean, maybe it's better to implement the fast-path only in ceval.c? Or maybe in ceval.c and longobject.c?

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 8, 2016

    There is no drastic difference on where you implement the fast path. I'd implement all specializations/optimizations in longobject.c and optimize ceval to call slots directly. That way, the implact on ceval performance would be minimal.

    @1st1 1st1 self-assigned this Feb 8, 2016
    @1st1
    Copy link
    Member Author

    1st1 commented Feb 9, 2016

    Also, every other operation for longs (except %, for which I created issue bpo-26315) is optimized for single digit longs. This optimization is also important for users of operator.floordiv etc. Even if we decide to provide a fast path in ceval, it's going to be another matter.

    @mdickinson
    Copy link
    Member

    A slightly neater way to compute the result in the case that the signs differ is

    div = ~((left - 1) / right)

    That saves the extra multiplication and equality check.

    @mdickinson
    Copy link
    Member

    ... though on second thoughts, it's probably better to spell that

    div = -1 - (left - 1) / right
    

    to avoid compiler warnings about bit operations on signed types. A good compiler should be able to optimise -1 - x to ~x anyway.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 9, 2016

    STINNER Victor:

    > This change looks related to the issue bpo-21955. IMHO we should take the same decision. I mean, maybe it's better to implement the fast-path only in ceval.c? Or maybe in ceval.c and longobject.c?

    Yury Selivanov:

    There is no drastic difference on where you implement the fast path. I'd implement all specializations/optimizations in longobject.c and optimize ceval to call slots directly. That way, the implact on ceval performance would be minimal.

    Oh wait, I was confused by my own patch for bpo-21955 inlining int operations in ceval.c.

    Since recent benchmarks showed slow-down when ceval.c is modified, I think that it's ok to modify longobject.c rather than ceval.c (and maybe only longobject.c, but let's discuss that in issue bpo-21955).

    @serhiy-storchaka
    Copy link
    Member

    Since the patch is changed (and may be changed further if accept Mark's suggestion), benchmarks results should be updated.

    Is it worth to move the optimization inside l_divmod? Will this speed up or slow down other operations that use l_divmod?

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 9, 2016

    Attaching a new patch -- big thanks to Mark and Serhiy.

    div = ~((left - 1) / right)

    The updated code works slightly faster - ~0.285 usec vs ~0.3 usec.

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 9, 2016

    Is it worth to move the optimization inside l_divmod? Will this speed up or slow down other operations that use l_divmod?

    Attaching a new patch -- fast_divmod.patch

    It combines patches for this issue and issue bpo-26315.

    Individual timeit benchmarks work as fast, but ** op becomes faster:

    -m timeit -s "x=223" "x**2;x**-2;x**2;x**-3;x**3;x**-3;x**4.5;x**-4.5"
    with patch: 1.2usec
    without: 1.5usec

    @mdickinson
    Copy link
    Member

    Just for the record, there's a less-branchy analog of the floor division code I suggested for modulo. In Python-land, it looks like this:

    def propermod(a, b):
        # mimic the C setup
        assert a != 0 and b != 0
        left = abs(a)
        size_a = -1 if a < 0 else 1
        right = abs(b)
        size_b = -1 if b < 0 else 1
    
        # Compute mod: only one branch needed.
        mod = left % right if size_a == size_b else right - 1 - (left - 1) % right
        return mod * size_b
    
    # Verify that we get the same results as the regular mod.
    for n in range(-100, 100):
        if n == 0:
            continue
        for d in range(-100, 100):
            if d == 0:
                continue
            assert propermod(n, d) == n % d

    It may well not have any significant effect here, though.

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 9, 2016

    mod = left % right if size_a == size_b else right - 1 - (left - 1) % right

    This works, Mark! Again, the difference in performance is very subtle, but the code is more compact.

    -m timeit -s "x=22331" "x//2;x//-3;x//4;x//5;x//-6;x//7;x//8;x//-99;x//100;"
    with patch: 0.321 without patch: 0.633

    -m timeit -s "x=22331" "x%2;x%3;x%-4;x%5;x%6;x%-7;x%8;x%99;x%-100;"
    with patch: 0.224 without patch: 0.66

    -m timeit "divmod(100,20);divmod(7,3);divmod(121,99);divmod(123121,123);divmod(-1000,7);divmod(23423,-111)"
    with patch: 0.839 without patch: 1.07

    I'm in favour of fast_divmod_2.patch for solving both this issue and issue bpo-26315. Serhiy, what do you think?

    @mdickinson
    Copy link
    Member

    A couple more observations:

    1. I think you're missing the final multiplication by Py_SIZE(b) in the fast_mod code. Which is odd, because your tests should catch that. So either you didn't run the tests, or that code path isn't being used somehow.

    2. Talking of tests, it would be good to have tests (for both // and %) for the case where the dividend is an exact multiple of the divisor.

    3. Negative divisors almost never come up in real life, so you might also consider restricting the optimisations to the case Py_SIZE(b) == 1 (rather than ABS(Py_SIZE(b)) == 1). If that makes any speed difference at all for the case of positive divisors, then it's probably worth it. If not, then don't worry about it.

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 10, 2016

    Attaching an updated patch.

    1. I think you're missing the final multiplication by Py_SIZE(b) in the fast_mod code. Which is odd, because your tests should catch that. So either you didn't run the tests, or that code path isn't being used somehow.

    Thanks. Not sure how this happened :(

    1. Talking of tests, it would be good to have tests (for both // and %) for the case where the dividend is an exact multiple of the divisor.

    Done.

    1. Negative divisors almost never come up in real life, so you might also consider restricting the optimisations to the case Py_SIZE(b) == 1 (rather than ABS(Py_SIZE(b)) == 1). If that makes any speed difference at all for the case of positive divisors, then it's probably worth it. If not, then don't worry about it.

    Tried it, the difference is very small. For modulo division it's ~0.225 usec vs ~0.23 usec for [-m timeit -s "x=22331" "x%2;x%3;x%4;x%5;x%6;x%7;x%8;x%99;x%100;"]

    @mdickinson
    Copy link
    Member

    Thanks for the updates! No further comments from me - the patch looks good as far as I'm concerned.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 11, 2016

    New changeset 37bacf3fa1f5 by Yury Selivanov in branch 'default':
    Issues bpo-26289 and bpo-26315: Optimize floor/modulo div for single-digit longs
    https://hg.python.org/cpython/rev/37bacf3fa1f5

    @1st1
    Copy link
    Member Author

    1st1 commented Feb 11, 2016

    Committed. Thank you Serhiy, Mark and Victor for helping with the patch!

    @1st1 1st1 closed this as completed Feb 11, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants