Issue 21420: Optimize 2 ** n: implement it as 1 << n

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65619

classification

Title:	Optimize 2 ** n: implement it as 1 << n
Type:	performance	Stage:
Components:		Versions:	Python 3.5

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, josh.r, mark.dickinson, pitrou, scoder, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-05-02 21:54 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
pow2.patch	vstinner, 2014-05-02 21:54		review
bench_pow2.py	vstinner, 2014-05-02 21:54
pow2_specialized.patch	vstinner, 2014-05-03 12:45		review

Messages (8)
msg217800 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-02 21:54
The algorithm for 2 n (long_pow) is slower than 1 << n algorithm (long_lshift). I propose to compute xy as 1<<y if x==2 and y >= 0. Attached patch implements this idea. According to my microbenchmark, it is 4x faster to small power (20 .. 21024) and up to 340x faster for large power (2(228)).
msg217801 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-02 21:54
Results of bench_pow2.py on Linux. Common platform: CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug Timer: time.perf_counter Python unicode implementation: PEP 393 Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Bits: int=32, long=64, long long=64, size_t=64, void=64 Platform of campaign orig: Date: 2014-05-02 23:45:29 SCM: hg revision=62438d1b11c7 tag=tip branch=default date="2014-05-02 23:26 +0200" Timer precision: 49 ns Python version: 3.5.0a0 (default:62438d1b11c7, May 2 2014, 23:45:22) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] Platform of campaign pow2: Python version: 3.5.0a0 (default:62438d1b11c7+, May 2 2014, 23:47:50) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=62438d1b11c7+ tag=tip branch=default date="2014-05-02 23:26 +0200" Timer precision: 51 ns Date: 2014-05-02 23:48:03 ---------------+--------------+---------------- Tests \| orig \| pow2 ---------------+--------------+---------------- 2 * 0 \| 40 ns () \| 44 ns (+9%) 2 * 1 \| 264 ns () \| 62 ns (-76%) 2 * 3 \| 270 ns () \| 62 ns (-77%) 2 * 5 \| 269 ns () \| 62 ns (-77%) 2 * 10 \| 278 ns () \| 63 ns (-78%) 2 * 15 \| 299 ns () \| 62 ns (-79%) 2 * 20 \| 287 ns () \| 62 ns (-78%) 2 * 25 \| 302 ns () \| 62 ns (-79%) 2 * 28 \| 293 ns () \| 62 ns (-79%) 2 * (2 ** 0) \| 264 ns () \| 62 ns (-77%) 2 * (2 ** 1) \| 264 ns () \| 62 ns (-76%) 2 * (2 ** 3) \| 264 ns () \| 62 ns (-77%) 2 * (2 ** 5) \| 286 ns () \| 68 ns (-76%) 2 * (2 ** 10) \| 740 ns () \| 68 ns (-91%) 2 * (2 ** 15) \| 93.1 us () \| 180 ns (-100%) 2 * (2 ** 20) \| 3.94 ms () \| 3.89 us (-100%) 2 * (2 ** 25) \| 154 ms () \| 213 us (-100%) 2 * (2 ** 28) \| 1.36 sec () \| 4.01 ms (-100%) ---------------+--------------+---------------- Total \| 1.52 sec () \| 4.23 ms (-100%) ---------------+--------------+----------------
msg217802 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-02 21:57
Oh, the patch implements another micro-optimization: "(x << 0) is x" becomes True. I included it in the same patch to reduce the overhead for 2 ** 0.
msg217820 - (view)	Author: Stefan Behnel (scoder) *	Date: 2014-05-03 10:17
LGTM, can't see a case where this might go wrong (errors and type checks are handled before the added code). It also seems a sufficiently common case to optimise it internally. The 2**n spelling is easier to read and to get right than the shifting, so it's good to make both similarly fast to remove an excuse for micro optimisations in Python code.
msg217823 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-03 12:45
pow2_specialized.patch: here is a different approach, a little bit faster. I wrote a dedicated long_pow2() function which computes 2*k. Arguments in favor of pow2_specialized.patch instead of pow2.patch: - it's a little bit faster - don't touch long_lshift(), so there is no risk of performance regression - it becomes easier to implement #21419 optimization: use calloc() instead of malloc() for very large power of 2, to avoid allocating physical memory pages for zeros Results of bench_pow2.py comparing original Python, pow2.pathc and pow2_specialized.patch: Common platform: Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Python unicode implementation: PEP 393 Bits: int=32, long=64, long long=64, size_t=64, void=64 Timer: time.perf_counter CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Platform of campaign orig: SCM: hg revision=62438d1b11c7 tag=tip branch=default date="2014-05-02 23:26 +0200" Date: 2014-05-03 14:42:56 Timer precision: 46 ns Python version: 3.5.0a0 (default:62438d1b11c7, May 3 2014, 14:41:50) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] Platform of campaign pow2: SCM: hg revision=62438d1b11c7+ tag=tip branch=default date="2014-05-02 23:26 +0200" Python version: 3.5.0a0 (default:62438d1b11c7+, May 3 2014, 14:41:13) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] Date: 2014-05-03 14:41:21 Timer precision: 51 ns Platform of campaign pow2_specialized: SCM: hg revision=62438d1b11c7+ tag=tip branch=default date="2014-05-02 23:26 +0200" Date: 2014-05-03 14:40:26 Python version: 3.5.0a0 (default:62438d1b11c7+, May 3 2014, 14:40:18) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] Timer precision: 46 ns ---------------+--------------------+--------------+----------------- Tests \| orig \| pow2 \| pow2_specialized ---------------+--------------------+--------------+----------------- 2 ** 0 \| 39 ns () \| 44 ns (+11%) \| 41 ns 2 * 1 \| 264 ns (+531%) \| 62 ns (+48%) \| 42 ns () 2 * 3 \| 269 ns (+554%) \| 62 ns (+50%) \| 41 ns () 2 * 5 \| 269 ns (+560%) \| 62 ns (+51%) \| 41 ns () 2 * 10 \| 279 ns (+410%) \| 63 ns (+15%) \| 55 ns () 2 * 15 \| 299 ns (+460%) \| 62 ns (+16%) \| 53 ns () 2 * 20 \| 287 ns (+426%) \| 62 ns (+14%) \| 55 ns () 2 * 25 \| 302 ns (+466%) \| 62 ns (+16%) \| 53 ns () 2 * 28 \| 293 ns (+438%) \| 62 ns (+14%) \| 55 ns () 2 * (2 ** 0) \| 264 ns (+549%) \| 62 ns (+52%) \| 41 ns () 2 * (2 ** 1) \| 264 ns (+524%) \| 62 ns (+48%) \| 42 ns () 2 * (2 ** 3) \| 263 ns (+548%) \| 62 ns (+52%) \| 41 ns () 2 * (2 ** 5) \| 283 ns (+346%) \| 68 ns (+8%) \| 63 ns () 2 * (2 ** 10) \| 715 ns (+950%) \| 68 ns () \| 69 ns 2 * (2 ** 15) \| 89.6 us (+51109%) \| 175 ns () \| 177 ns 2 * (2 ** 20) \| 3.89 ms (+100603%) \| 3.86 us () \| 3.86 us 2 * (2 ** 25) \| 152 ms (+66557%) \| 240 us (+5%) \| 228 us () 2 * (2 ** 28) \| 1.35 sec (+33851%) \| 3.99 ms () \| 3.99 ms ---------------+--------------------+--------------+----------------- Total \| 1.51 sec (+35664%) \| 4.23 ms \| 4.22 ms () ---------------+--------------------+--------------+-----------------
msg217863 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2014-05-04 09:34
Victor, can you demonstrate any cases of real code where this optimization makes a significant difference? There are many, many tiny optimisations we could be making in Objects/longobject.c; each of those potential optimisations adds to the cost of maintaining the code, detracts from readability, and potentially even slows down the common cases fractionally. In general, I think we should only be applying this sort of optimization when there's a clear benefit to real-world code. I don't think this one crosses that line. In the (I suspect rare) cases where a piece of real-world code is slowed down significantly due to a non-optimized 2*n, the code author still has the option of replacing that piece of code with 1<<n manually. And in some cases, that's probably the wrong optimization anyway: an expression like `x 2**n` would be better hand-optimized to `x << n`. IOW, I'm -1 on making this change.
msg217897 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-05-04 22:03
I agree with Mark, there doesn't seem to be a strong point in favour of this optimization.
msg218360 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-12 20:46
My main motivation was to use calloc() (issue #21419) for 2**n. It looks like using calloc() for this is only useful for very rare cases (very large value of n: result larger than 1 MB). I close the issue.

History
Date	User	Action	Args
2022-04-11 14:58:03	admin	set	github: 65619
2014-05-12 20:46:21	vstinner	set	status: open -> closed resolution: rejected messages: + msg218360
2014-05-11 12:01:29	Arfrever	set	nosy: + Arfrever
2014-05-04 22:03:15	pitrou	set	messages: + msg217897
2014-05-04 09:34:29	mark.dickinson	set	messages: + msg217863
2014-05-03 12:45:15	vstinner	set	files: + pow2_specialized.patch messages: + msg217823
2014-05-03 10:17:54	scoder	set	nosy: + scoder messages: + msg217820
2014-05-02 21:57:58	vstinner	set	nosy: + mark.dickinson
2014-05-02 21:57:35	vstinner	set	nosy: + pitrou, serhiy.storchaka, josh.r messages: + msg217802
2014-05-02 21:54:54	vstinner	set	files: + bench_pow2.py messages: + msg217801
2014-05-02 21:54:22	vstinner	create