classification
Title: Optimise PyLong division by 1 or -1
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: akuchling, mark.dickinson, pitrou, scoder, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2014-09-26 07:52 by scoder, last changed 2015-04-22 23:42 by akuchling. This issue is now closed.

Files
File name Uploaded Description Edit
div_by_1_fast_path.patch scoder, 2014-09-26 07:52 fast paths for division by 1 or -1 and 0/x review
mul_by_1_fast_path.patch scoder, 2014-09-26 09:03 fast paths for multiplication by 0, 1 and -1 review
mul_div_by_1_fast_path.patch scoder, 2014-09-26 09:47 review
mul_div_by_1_fast_path_3.patch scoder, 2014-09-26 10:43 updated patch that moves div optimisation into l_divmod() review
add_sub_0_fast_path.patch scoder, 2014-09-26 10:57 fast paths for +/- 0
add_sub_0_fast_path_2.patch scoder, 2014-09-26 11:43 fast paths for +/- 0 that always return PyLong
mul_div_by_1_fast_path_4.patch scoder, 2014-09-26 11:44 updated patch that always returns PyLong from the special cases review
mul_div_by_1_fast_path_5.patch scoder, 2014-09-26 16:37 review
mul_div_by_1_fast_path_6.patch scoder, 2014-09-27 18:32 updated patch without changes to long_true_div() review
Messages (21)
msg227590 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 07:52
The attached patch adds fast paths for PyLong division by 1 and -1, as well as dividing 0 by something. This was found helpful for fractions normalisation, as the GCD that is divided by can often be |1|, but firing up the whole division machinery for this eats a lot of CPU cycles for nothing.

There are currently two test failures in test_long.py because dividing a huge number by 1 or -1 no longer raises an OverflowError. This is a behavioural change, but I find it acceptable. If others agree, I'll fix the tests and submit a new patch.
msg227594 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-26 08:32
Perhaps it would be worth to special case multiplying on 0, 1 and -1 and adding 0, 1 and -1 too.
msg227595 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-09-26 08:51
Any optimization requires a benchmark. What is the speedup?
msg227596 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-09-26 08:54
I proposed an optimization for "x << 0" (as part of a larger patch to optimize 2 ** x) but the issue was rejected:
http://bugs.python.org/issue21420#msg217802

Mark Dickson wrote (msg217863):
"There are many, many tiny optimisations we *could* be making in Objects/longobject.c; each of those potential optimisations adds to the cost of maintaining the code, detracts from readability, and potentially even slows down the common cases fractionally.  In general, I think we should only be applying this sort of optimization when there's a clear benefit to real-world code.  I don't think this one crosses that line."
msg227597 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 09:03
Attaching a similar patch for long_mul().
msg227598 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 09:08
> Any optimization requires a benchmark. What is the speedup?

I gave numbers in ticket #22464.

"""
Since many Fraction input values can already be normalised for some reason, the following change shaves off almost 30% of the calls to PyNumber_InPlaceFloorDivide() in the telco benchmark during Fraction instantiation according to callgrind, thus saving 20% of the CPU instructions that go into tp_new().
"""

I then proposed to move this into the PyLong type in general, rather than letting Fraction itself do it less efficiently.
msg227599 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 09:22
@Serhiy: moving the fast path into l_divmod() has the disadvantage of making it even more complex because we'd then also want to determine the modulus, but only if requested, and it can be 1, 0 or -1, depending on the second value. Sounds like a lot more if's.
msg227600 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 09:47
Combined patch for both mul and div that fixes the return value of long_true_div(), as found by Serhiy, and removes the useless change in long_divrem(), as found by Antoine. Thanks!

All test_long.py tests pass now.
msg227601 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 10:05
@Serhiy: please ignore my comment in msg227599. I'll submit a patch that moves the specialisation to l_divmod().
msg227602 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 10:27
Thanks for the reviews, here's a new patch.
msg227604 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 10:43
Sorry, last patch version contained a "use before type check" bug.
msg227605 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 10:57
Here is an incremental patch that adds fast paths for adding and subtracting 0.

Question: the module calls long_long() in some places (e.g. long_abs()) and thus forces the return type to be exactly a PyLong and not a subtype. My changes use a plain "incref+return input value" in some places. Should they call long_long() on it instead?
msg227607 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-26 11:19
Le 26/09/2014 12:57, Stefan Behnel a écrit :
> 
> Question: the module calls long_long() in some places (e.g.
long_abs()) and thus forces the return type to be exactly a PyLong and
not a subtype. My changes use a plain "incref+return input value" in
some places. Should they call long_long() on it instead?

Ah, yes, they should. The return type should not depend on the input
*values* :-)
msg227609 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 11:43
Ok, updating both patches.
msg227611 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 12:37
I reran the fractions benchmark over the final result and the overall gain turned out to be, well, small. It's a clearly reproducible 2-3% faster. That's not bad for the macro impact of a micro-optimisation, but it's not a clear argument for throwing more code at it either.

I'll leave it to you to decide.
msg227625 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-26 15:34
Oh, such small gain and only on one specific benchmark not included still in standard benchmark suite, looks discourage. May be other benchmarks have gain from these changes?
msg227627 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-09-26 15:47
> 2-3% faster

3% is not enough to justify the change.
msg227633 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 16:37
Since Serhiy gave another round of valid feedback, here's an updated patch.
msg227644 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-26 17:57
I callgrinded it again and it confirmed that the gain when doing this inside of long_div() and friends is way lower than doing it right in Fraction.__new__(). It's not safe to do there, though, as "is" tests on integers are generally not a good idea in Python code. (Although it doesn't break anything if it fails, as it's a pure optimisation to avoid useless overhead.)

The micro benchmarks with timeit confirm that it's as fast as expected.

Large numbers before:

$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' ' -x'
10000000 loops, best of 3: 0.177 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x * -2'
1000000 loops, best of 3: 0.329 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x // -2'
100000 loops, best of 3: 2.8 usec per loop

$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x * -1'
1000000 loops, best of 3: 0.329 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x // -1'
100000 loops, best of 3: 2.36 usec per loop

$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x * 1'
1000000 loops, best of 3: 0.333 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x // 1'
100000 loops, best of 3: 2.37 usec per loop


Patched:

$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' ' -x'
10000000 loops, best of 3: 0.176 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x * -2'
1000000 loops, best of 3: 0.328 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x // -2'
100000 loops, best of 3: 2.8 usec per loop

$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x * -1'
10000000 loops, best of 3: 0.177 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x // -1'
10000000 loops, best of 3: 0.178 usec per loop

$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x * 1'
10000000 loops, best of 3: 0.0244 usec per loop
$ ./python -m timeit -s 'x = 2**2000 + 3**234 + 5**891 + 7**1234' 'x // 1'
10000000 loops, best of 3: 0.0258 usec per loop


Small numbers before:

$ ./python -m timeit -s 'x = 5' 'x * -2'
10000000 loops, best of 3: 0.0408 usec per loop
$ ./python -m timeit -s 'x = 5' 'x // -2'
10000000 loops, best of 3: 0.0714 usec per loop

$ ./python -m timeit -s 'x = 5' 'x * -1'
10000000 loops, best of 3: 0.0293 usec per loop
$ ./python -m timeit -s 'x = 5' 'x * 1'
10000000 loops, best of 3: 0.0282 usec per loop

$ ./python -m timeit -s 'x = 5' 'x // 1'
10000000 loops, best of 3: 0.0529 usec per loop
$ ./python -m timeit -s 'x = 5' 'x // -1'
10000000 loops, best of 3: 0.0536 usec per loop


Patched:

$ ./python -m timeit -s 'x = 5' 'x * -2'
10000000 loops, best of 3: 0.0391 usec per loop
$ ./python -m timeit -s 'x = 5' 'x // -2'
10000000 loops, best of 3: 0.0718 usec per loop

$ ./python -m timeit -s 'x = 5' 'x * -1'
10000000 loops, best of 3: 0.0267 usec per loop
$ ./python -m timeit -s 'x = 5' 'x * 1'
10000000 loops, best of 3: 0.0265 usec per loop

$ ./python -m timeit -s 'x = 5' 'x // 1'
10000000 loops, best of 3: 0.0259 usec per loop
$ ./python -m timeit -s 'x = 5' 'x // -1'
10000000 loops, best of 3: 0.0285 usec per loop


Note: we're talking µsecs here, not usually something to worry about. And it's unlikely that other benchmarks see similarly "high" speedups as the one for fractions (due to the relatively high likelihood of the GCD being 1 there).

I'm ok with closing this ticket as "won't fix".
msg227709 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-09-27 18:32
One more comment: I also benchmarked the change in long_true_div() now and found that it's only a minor improvement for large numbers and a *pessimisation* for small numbers:

Before:

$ ./python -m timeit -s 'x = 5' 'x / -1'
10000000 loops, best of 3: 0.0313 usec per loop
$ ./python -m timeit -s 'x = 5' 'x / 1'
10000000 loops, best of 3: 0.0307 usec per loop

$ ./python -m timeit -s 'x = 2**200 + 3**234 + 5**89 + 7**123' 'x / 1'
10000000 loops, best of 3: 0.101 usec per loop
$ ./python -m timeit -s 'x = 2**200 + 3**234 + 5**89 + 7**123' 'x / -1'
10000000 loops, best of 3: 0.104 usec per loop


Patched:

$ ./python -m timeit -s 'x = 5' 'x / 1'                                  
10000000 loops, best of 3: 0.0569 usec per loop
$ ./python -m timeit -s 'x = 5' 'x / -1'
10000000 loops, best of 3: 0.0576 usec per loop

$ ./python -m timeit -s 'x = 2**200 + 3**234 + 5**89 + 7**123' 'x / -1'
10000000 loops, best of 3: 0.056 usec per loop
$ ./python -m timeit -s 'x = 2**200 + 3**234 + 5**89 + 7**123' 'x / 1' 
10000000 loops, best of 3: 0.056 usec per loop

$ ./python -m timeit -s 'x = 2**200 + 3**234 + 5**89 + 7**123' 'x / -2'
10000000 loops, best of 3: 0.106 usec per loop


So, just for completeness, here's the patch without that part, with changes only in l_divmod() and long_mul(), with the timeit results as given in my previous comment.
msg241834 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2015-04-22 23:42
From reading the discussion thread, it looks like the consensus is to not apply this set of patches because the speed-up is unfortunately small.  Closing as won't-fix; please re-open if someone wishes to pursue this again.
History
Date User Action Args
2015-04-22 23:42:55akuchlingsetstatus: open -> closed

nosy: + akuchling
messages: + msg241834

resolution: wont fix
stage: patch review -> resolved
2014-09-27 18:32:08scodersetfiles: + mul_div_by_1_fast_path_6.patch

messages: + msg227709
2014-09-26 17:57:23scodersetmessages: + msg227644
2014-09-26 16:37:23scodersetfiles: + mul_div_by_1_fast_path_5.patch

messages: + msg227633
2014-09-26 15:47:46vstinnersetmessages: + msg227627
2014-09-26 15:34:27serhiy.storchakasetmessages: + msg227625
2014-09-26 12:37:45scodersetmessages: + msg227611
2014-09-26 11:44:24scodersetfiles: + mul_div_by_1_fast_path_4.patch
2014-09-26 11:43:41scodersetfiles: + add_sub_0_fast_path_2.patch

messages: + msg227609
2014-09-26 11:19:58pitrousetmessages: + msg227607
2014-09-26 10:57:27scodersetfiles: + add_sub_0_fast_path.patch

messages: + msg227605
2014-09-26 10:43:16scodersetfiles: + mul_div_by_1_fast_path_3.patch

messages: + msg227604
2014-09-26 10:42:06scodersetfiles: - mul_div_by_1_fast_path_2.patch
2014-09-26 10:27:27scodersetfiles: + mul_div_by_1_fast_path_2.patch

messages: + msg227602
2014-09-26 10:05:48scodersetmessages: + msg227601
2014-09-26 09:47:04scodersetfiles: + mul_div_by_1_fast_path.patch

messages: + msg227600
2014-09-26 09:22:31scodersetmessages: + msg227599
2014-09-26 09:08:42scodersetmessages: + msg227598
2014-09-26 09:03:21scodersetfiles: + mul_by_1_fast_path.patch

messages: + msg227597
2014-09-26 08:54:19vstinnersetmessages: + msg227596
2014-09-26 08:51:50vstinnersetnosy: + vstinner
messages: + msg227595
2014-09-26 08:32:17serhiy.storchakasetmessages: + msg227594
stage: patch review
2014-09-26 07:52:31scodercreate