classification
Title: Optimize floor division for ints
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: yselivanov Nosy List: mark.dickinson, pitrou, python-dev, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2016-02-05 02:07 by yselivanov, last changed 2016-02-11 15:29 by yselivanov. This issue is now closed.

Files
File name Uploaded Description Edit
floor_div.patch yselivanov, 2016-02-05 02:07 review
floor_div_2.patch yselivanov, 2016-02-08 21:33 review
floor_div_3.patch yselivanov, 2016-02-08 22:36 review
floor_div_4.patch yselivanov, 2016-02-09 15:59 review
fast_divmod.patch yselivanov, 2016-02-09 16:03 review
fast_divmod_2.patch yselivanov, 2016-02-09 21:57 review
fast_divmod_3.patch yselivanov, 2016-02-09 22:07 review
fast_divmod_5.patch yselivanov, 2016-02-10 15:32 review
fast_divmod_6.patch yselivanov, 2016-02-10 17:09 review
Messages (19)
msg259617 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-05 02:07
The attached patch optimizes floor division for ints.

### spectral_norm ###
Min: 0.319087 -> 0.289172: 1.10x faster
Avg: 0.322564 -> 0.294319: 1.10x faster
Significant (t=21.71)
Stddev: 0.00249 -> 0.01277: 5.1180x larger


-m timeit -s "x=22331" "x//2;x//3;x//4;x//5;x//6;x//7;x//8;x/99;x//100;"

with patch: 0.298
without:    0.515
msg259803 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-07 21:07
Added comments on Rietveld.
msg259889 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-08 21:33
Serhiy, Victor, thanks for the review. Attaching an updated version of the patch.
msg259893 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-08 22:18
This change looks related to the issue #21955. IMHO we should take the same decision. I mean, maybe it's better to implement the fast-path only in ceval.c? Or maybe in ceval.c and longobject.c?
msg259896 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-08 22:23
There is no drastic difference on where you implement the fast path.  I'd implement all specializations/optimizations in longobject.c and optimize ceval to call slots directly.  That way, the implact on ceval performance would be minimal.
msg259909 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-09 02:35
Also, every other operation for longs (except %, for which I created issue #26315) is optimized for single digit longs.  This optimization is also important for users of operator.floordiv etc.  Even if we decide to provide a fast path in ceval, it's going to be another matter.
msg259932 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-09 13:41
A slightly neater way to compute the result in the case that the signs differ is 

  div = ~((left - 1) / right)

That saves the extra multiplication and equality check.
msg259934 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-09 14:05
... though on second thoughts, it's probably better to spell that

    div = -1 - (left - 1) / right

to avoid compiler warnings about bit operations on signed types. A good compiler should be able to optimise `-1 - x` to `~x` anyway.
msg259935 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-09 14:08
STINNER Victor:
>> This change looks related to the issue #21955. IMHO we should take the same decision. I mean, maybe it's better to implement the fast-path only in ceval.c? Or maybe in ceval.c and longobject.c?

Yury Selivanov:
> There is no drastic difference on where you implement the fast path.  I'd implement all specializations/optimizations in longobject.c and optimize ceval to call slots directly.  That way, the implact on ceval performance would be minimal.

Oh wait, I was confused by my own patch for #21955 inlining int operations in ceval.c.

Since recent benchmarks showed slow-down when ceval.c is modified, I think that it's ok to modify longobject.c rather than ceval.c (and maybe only longobject.c, but let's discuss that in issue #21955).
msg259938 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-09 15:20
Since the patch is changed (and may be changed further if accept Mark's suggestion), benchmarks results should be updated.

Is it worth to move the optimization inside l_divmod? Will this speed up or slow down other operations that use l_divmod?
msg259940 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-09 15:59
Attaching a new patch -- big thanks to Mark and Serhiy.

> div = ~((left - 1) / right)

The updated code works slightly faster - ~0.285 usec vs ~0.3 usec.
msg259942 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-09 16:03
> Is it worth to move the optimization inside l_divmod? Will this speed up or slow down other operations that use l_divmod?

Attaching a new patch -- fast_divmod.patch

It combines patches for this issue and issue #26315.

Individual timeit benchmarks work as fast, but ** op becomes faster:

-m timeit -s "x=223" "x**2;x**-2;x**2;x**-3;x**3;x**-3;x**4.5;x**-4.5"
with patch: 1.2usec
without: 1.5usec
msg259944 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-09 16:35
Just for the record, there's a less-branchy analog of the floor division code I suggested for modulo. In Python-land, it looks like this:

def propermod(a, b):
    # mimic the C setup
    assert a != 0 and b != 0
    left = abs(a)
    size_a = -1 if a < 0 else 1
    right = abs(b)
    size_b = -1 if b < 0 else 1

    # Compute mod: only one branch needed.
    mod = left % right if size_a == size_b else right - 1 - (left - 1) % right
    return mod * size_b

# Verify that we get the same results as the regular mod.
for n in range(-100, 100):
    if n == 0:
        continue
    for d in range(-100, 100):
        if d == 0:
            continue
        assert propermod(n, d) == n % d

It may well not have any significant effect here, though.
msg259955 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-09 21:57
> mod = left % right if size_a == size_b else right - 1 - (left - 1) % right

This works, Mark!  Again, the difference in performance is very subtle, but the code is more compact.


-m timeit -s "x=22331" "x//2;x//-3;x//4;x//5;x//-6;x//7;x//8;x//-99;x//100;"
with patch: 0.321          without patch: 0.633

-m timeit -s "x=22331" "x%2;x%3;x%-4;x%5;x%6;x%-7;x%8;x%99;x%-100;"
with patch: 0.224          without patch: 0.66

-m timeit "divmod(100,20);divmod(7,3);divmod(121,99);divmod(123121,123);divmod(-1000,7);divmod(23423,-111)"
with patch: 0.839          without patch: 1.07


I'm in favour of fast_divmod_2.patch for solving both this issue and issue #26315.  Serhiy, what do you think?
msg259994 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-10 08:52
A couple more observations:

1. I think you're missing the final multiplication by Py_SIZE(b) in the fast_mod code. Which is odd, because your tests should catch that. So either you didn't run the tests, or that code path isn't being used somehow.

2. Talking of tests, it would be good to have tests (for both // and %) for the case where the dividend is an exact multiple of the divisor.

3. Negative divisors almost never come up in real life, so you might also consider restricting the optimisations to the case Py_SIZE(b) == 1 (rather than ABS(Py_SIZE(b)) == 1). If that makes any speed difference at all for the case of positive divisors, then it's probably worth it. If not, then don't worry about it.
msg260019 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-10 15:32
Attaching an updated patch.

> 1. I think you're missing the final multiplication by Py_SIZE(b) in the fast_mod code. Which is odd, because your tests should catch that. So either you didn't run the tests, or that code path isn't being used somehow.

Thanks.  Not sure how this happened :(

> 2. Talking of tests, it would be good to have tests (for both // and %) for the case where the dividend is an exact multiple of the divisor.

Done.

> 3. Negative divisors almost never come up in real life, so you might also consider restricting the optimisations to the case Py_SIZE(b) == 1 (rather than ABS(Py_SIZE(b)) == 1). If that makes any speed difference at all for the case of positive divisors, then it's probably worth it. If not, then don't worry about it.

Tried it, the difference is very small.  For modulo division it's ~0.225 usec vs ~0.23 usec for [-m timeit -s "x=22331" "x%2;x%3;x%4;x%5;x%6;x%7;x%8;x%99;x%100;"]
msg260025 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-10 16:12
Thanks for the updates! No further comments from me - the patch looks good as far as I'm concerned.
msg260110 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-02-11 15:27
New changeset 37bacf3fa1f5 by Yury Selivanov in branch 'default':
Issues #26289 and #26315: Optimize floor/modulo div for single-digit longs
https://hg.python.org/cpython/rev/37bacf3fa1f5
msg260112 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-11 15:29
Committed.  Thank you Serhiy, Mark and Victor for helping with the patch!
History
Date User Action Args
2016-02-11 15:29:32yselivanovsetstatus: open -> closed
resolution: fixed
messages: + msg260112

stage: patch review -> resolved
2016-02-11 15:27:00python-devsetnosy: + python-dev
messages: + msg260110
2016-02-10 17:09:49yselivanovsetfiles: + fast_divmod_6.patch
2016-02-10 16:12:31mark.dickinsonsetmessages: + msg260025
2016-02-10 15:32:27yselivanovsetfiles: + fast_divmod_5.patch

messages: + msg260019
2016-02-10 08:52:05mark.dickinsonsetmessages: + msg259994
2016-02-09 22:07:20yselivanovsetfiles: + fast_divmod_3.patch
2016-02-09 21:57:49yselivanovsetfiles: + fast_divmod_2.patch

messages: + msg259955
2016-02-09 16:35:40mark.dickinsonsetmessages: + msg259944
2016-02-09 16:03:38yselivanovsetfiles: + fast_divmod.patch

messages: + msg259942
2016-02-09 15:59:22yselivanovsetfiles: + floor_div_4.patch

messages: + msg259940
2016-02-09 15:20:58serhiy.storchakasetmessages: + msg259938
2016-02-09 14:08:46vstinnersetmessages: + msg259935
2016-02-09 14:05:59mark.dickinsonsetmessages: + msg259934
2016-02-09 13:41:09mark.dickinsonsetmessages: + msg259932
2016-02-09 13:30:41mark.dickinsonsetnosy: + mark.dickinson
2016-02-09 02:35:36yselivanovsetmessages: + msg259909
2016-02-08 22:40:39yselivanovsetassignee: yselivanov
2016-02-08 22:36:50yselivanovsetfiles: + floor_div_3.patch
2016-02-08 22:23:09yselivanovsetmessages: + msg259896
2016-02-08 22:18:52vstinnersetmessages: + msg259893
2016-02-08 21:33:29yselivanovsetfiles: + floor_div_2.patch

messages: + msg259889
2016-02-07 21:07:46serhiy.storchakasetmessages: + msg259803
2016-02-05 02:07:54yselivanovcreate