classification
Title: Optimize 2 ** n: implement it as 1 << n
Type: performance Stage:
Components: Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, haypo, josh.r, mark.dickinson, pitrou, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2014-05-02 21:54 by haypo, last changed 2014-05-12 20:46 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
pow2.patch haypo, 2014-05-02 21:54 review
bench_pow2.py haypo, 2014-05-02 21:54
pow2_specialized.patch haypo, 2014-05-03 12:45 review
Messages (8)
msg217800 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-05-02 21:54
The algorithm for 2 ** n (long_pow) is slower than 1 << n algorithm (long_lshift). I propose to compute x**y as 1<<y if x==2 and y >= 0.

Attached patch implements this idea. According to my microbenchmark, it is 4x faster to small power (2**0 .. 2**1024) and up to 340x faster for large power (2**(2**28)).
msg217801 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-05-02 21:54
Results of bench_pow2.py on Linux.

Common platform:
CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug
Timer: time.perf_counter
Python unicode implementation: PEP 393
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Bits: int=32, long=64, long long=64, size_t=64, void*=64

Platform of campaign orig:
Date: 2014-05-02 23:45:29
SCM: hg revision=62438d1b11c7 tag=tip branch=default date="2014-05-02 23:26 +0200"
Timer precision: 49 ns
Python version: 3.5.0a0 (default:62438d1b11c7, May 2 2014, 23:45:22) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]

Platform of campaign pow2:
Python version: 3.5.0a0 (default:62438d1b11c7+, May 2 2014, 23:47:50) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
SCM: hg revision=62438d1b11c7+ tag=tip branch=default date="2014-05-02 23:26 +0200"
Timer precision: 51 ns
Date: 2014-05-02 23:48:03

---------------+--------------+----------------
Tests          |         orig |            pow2
---------------+--------------+----------------
2 ** 0         |    40 ns (*) |     44 ns (+9%)
2 ** 1         |   264 ns (*) |    62 ns (-76%)
2 ** 3         |   270 ns (*) |    62 ns (-77%)
2 ** 5         |   269 ns (*) |    62 ns (-77%)
2 ** 10        |   278 ns (*) |    63 ns (-78%)
2 ** 15        |   299 ns (*) |    62 ns (-79%)
2 ** 20        |   287 ns (*) |    62 ns (-78%)
2 ** 25        |   302 ns (*) |    62 ns (-79%)
2 ** 28        |   293 ns (*) |    62 ns (-79%)
2 ** (2 ** 0)  |   264 ns (*) |    62 ns (-77%)
2 ** (2 ** 1)  |   264 ns (*) |    62 ns (-76%)
2 ** (2 ** 3)  |   264 ns (*) |    62 ns (-77%)
2 ** (2 ** 5)  |   286 ns (*) |    68 ns (-76%)
2 ** (2 ** 10) |   740 ns (*) |    68 ns (-91%)
2 ** (2 ** 15) |  93.1 us (*) |  180 ns (-100%)
2 ** (2 ** 20) |  3.94 ms (*) | 3.89 us (-100%)
2 ** (2 ** 25) |   154 ms (*) |  213 us (-100%)
2 ** (2 ** 28) | 1.36 sec (*) | 4.01 ms (-100%)
---------------+--------------+----------------
Total          | 1.52 sec (*) | 4.23 ms (-100%)
---------------+--------------+----------------
msg217802 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-05-02 21:57
Oh, the patch implements another micro-optimization: "(x << 0) is x" becomes True. I included it in the same patch to reduce the overhead for 2 ** 0.
msg217820 - (view) Author: Stefan Behnel (scoder) * Date: 2014-05-03 10:17
LGTM, can't see a case where this might go wrong (errors and type checks are handled before the added code).

It also seems a sufficiently common case to optimise it internally. The 2**n spelling is easier to read and to get right than the shifting, so it's good to make both similarly fast to remove an excuse for micro optimisations in Python code.
msg217823 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-05-03 12:45
pow2_specialized.patch: here is a different approach, a little bit faster. I wrote a dedicated long_pow2() function which computes 2**k.

Arguments in favor of pow2_specialized.patch instead of pow2.patch:

- it's a little bit faster
- don't touch long_lshift(), so there is no risk of performance regression
- it becomes easier to implement #21419 optimization: use calloc() instead of malloc() for very large power of 2, to avoid allocating physical memory pages for zeros

Results of bench_pow2.py comparing original Python, pow2.pathc and pow2_specialized.patch:

Common platform:
Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Python unicode implementation: PEP 393
Bits: int=32, long=64, long long=64, size_t=64, void*=64
Timer: time.perf_counter
CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes

Platform of campaign orig:
SCM: hg revision=62438d1b11c7 tag=tip branch=default date="2014-05-02 23:26 +0200"
Date: 2014-05-03 14:42:56
Timer precision: 46 ns
Python version: 3.5.0a0 (default:62438d1b11c7, May 3 2014, 14:41:50) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]

Platform of campaign pow2:
SCM: hg revision=62438d1b11c7+ tag=tip branch=default date="2014-05-02 23:26 +0200"
Python version: 3.5.0a0 (default:62438d1b11c7+, May 3 2014, 14:41:13) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
Date: 2014-05-03 14:41:21
Timer precision: 51 ns

Platform of campaign pow2_specialized:
SCM: hg revision=62438d1b11c7+ tag=tip branch=default date="2014-05-02 23:26 +0200"
Date: 2014-05-03 14:40:26
Python version: 3.5.0a0 (default:62438d1b11c7+, May 3 2014, 14:40:18) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
Timer precision: 46 ns

---------------+--------------------+--------------+-----------------
Tests          |               orig |         pow2 | pow2_specialized
---------------+--------------------+--------------+-----------------
2 ** 0         |          39 ns (*) | 44 ns (+11%) |            41 ns
2 ** 1         |     264 ns (+531%) | 62 ns (+48%) |        42 ns (*)
2 ** 3         |     269 ns (+554%) | 62 ns (+50%) |        41 ns (*)
2 ** 5         |     269 ns (+560%) | 62 ns (+51%) |        41 ns (*)
2 ** 10        |     279 ns (+410%) | 63 ns (+15%) |        55 ns (*)
2 ** 15        |     299 ns (+460%) | 62 ns (+16%) |        53 ns (*)
2 ** 20        |     287 ns (+426%) | 62 ns (+14%) |        55 ns (*)
2 ** 25        |     302 ns (+466%) | 62 ns (+16%) |        53 ns (*)
2 ** 28        |     293 ns (+438%) | 62 ns (+14%) |        55 ns (*)
2 ** (2 ** 0)  |     264 ns (+549%) | 62 ns (+52%) |        41 ns (*)
2 ** (2 ** 1)  |     264 ns (+524%) | 62 ns (+48%) |        42 ns (*)
2 ** (2 ** 3)  |     263 ns (+548%) | 62 ns (+52%) |        41 ns (*)
2 ** (2 ** 5)  |     283 ns (+346%) |  68 ns (+8%) |        63 ns (*)
2 ** (2 ** 10) |     715 ns (+950%) |    68 ns (*) |            69 ns
2 ** (2 ** 15) |  89.6 us (+51109%) |   175 ns (*) |           177 ns
2 ** (2 ** 20) | 3.89 ms (+100603%) |  3.86 us (*) |          3.86 us
2 ** (2 ** 25) |   152 ms (+66557%) | 240 us (+5%) |       228 us (*)
2 ** (2 ** 28) | 1.35 sec (+33851%) |  3.99 ms (*) |          3.99 ms
---------------+--------------------+--------------+-----------------
Total          | 1.51 sec (+35664%) |      4.23 ms |      4.22 ms (*)
---------------+--------------------+--------------+-----------------
msg217863 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2014-05-04 09:34
Victor, can you demonstrate any cases of real code where this optimization makes a significant difference?

There are many, many tiny optimisations we *could* be making in Objects/longobject.c; each of those potential optimisations adds to the cost of maintaining the code, detracts from readability, and potentially even slows down the common cases fractionally.  In general, I think we should only be applying this sort of optimization when there's a clear benefit to real-world code.  I don't think this one crosses that line.

In the (I suspect rare) cases where a piece of real-world code is slowed down significantly due to a non-optimized 2**n, the code author still has the option of replacing that piece of code with 1<<n manually.  And in some cases, that's probably the wrong optimization anyway: an expression like `x * 2**n` would be better hand-optimized to `x << n`.

IOW, I'm -1 on making this change.
msg217897 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-04 22:03
I agree with Mark, there doesn't seem to be a strong point in favour of this optimization.
msg218360 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-05-12 20:46
My main motivation was to use calloc() (issue #21419) for 2**n. It looks like using calloc() for this is only useful for very rare cases (very large value of n: result larger than 1 MB). I close the issue.
History
Date User Action Args
2014-05-12 20:46:21hayposetstatus: open -> closed
resolution: rejected
messages: + msg218360
2014-05-11 12:01:29Arfreversetnosy: + Arfrever
2014-05-04 22:03:15pitrousetmessages: + msg217897
2014-05-04 09:34:29mark.dickinsonsetmessages: + msg217863
2014-05-03 12:45:15hayposetfiles: + pow2_specialized.patch

messages: + msg217823
2014-05-03 10:17:54scodersetnosy: + scoder
messages: + msg217820
2014-05-02 21:57:58hayposetnosy: + mark.dickinson
2014-05-02 21:57:35hayposetnosy: + pitrou, serhiy.storchaka, josh.r
messages: + msg217802
2014-05-02 21:54:54hayposetfiles: + bench_pow2.py

messages: + msg217801
2014-05-02 21:54:22haypocreate