Issue 45976: Random._randbelow() loses time by caching an attribute lookup

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90134

classification

Title:	Random._randbelow() loses time by caching an attribute lookup
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	onethreeseven, python-dev, rhettinger
Priority:	normal	Keywords:	patch

Created on 2021-12-04 02:20 by onethreeseven, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
randbelow_timing.py	onethreeseven, 2021-12-04 04:37

Pull Requests
URL	Status	Linked	Edit
PR 29911	closed	python-dev, 2021-12-04 02:22

Messages (10)
msg407625 - (view)	Author: Andrew Lin (onethreeseven) *	Date: 2021-12-04 02:20
This PR obtains a performance improvement in the random module by removing a cached attribute lookup in Random._randbelow_with_getrandbits() that costs time on average. In the best cases (on my machine) I get 10% improvement for randrange(), 7% for shuffle(), and 11% for sample(), with no change in the worst cases. To elaborate... If n is slightly less than a power of 2, _randbelow_with_getrandbits(n) almost always makes just one call to getrandbits(), so caching the attribute lookup never pays for itself. Even in the worst case, when n is exactly a power of 2, the expected number of calls to getrandbits() is still only 2, and on my machine caching just breaks even. (A friend ran it on ARM and got even more favorable results.) I included a similar change to _randbelow_without_getrandbits() on similar grounds, although I didn't benchmark it. (Brand new here; let me know if I've left anything out!)
msg407626 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-04 02:32
Thanks for the suggestion. I'll try this out on various builds and machines to see how it works out. The speed of locals is consistently fast, but the speed of method calls has varied over the last few versions of Python (generally getting faster). When this code was first written, it was benchmarked and localizing was a net win.
msg407628 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-04 02:46
* If you post a timing script, it will make it easier for me to verify this across versions and across machine and for various input sizes. * If you have time, do run some benchmarks for _randbelow_without_getrandbits() * Try your change with and without the walrus operator. If it doesn't make a difference in the timings, let's us the non-walrus version which is more readable.
msg407630 - (view)	Author: Andrew Lin (onethreeseven) *	Date: 2021-12-04 03:03
Timings are unaffected by updating to non-walrus, so I'll do so in the PR. Using _randbelow_without_getrandbits() I get 5% improvement to sample([None] * 2047, 50); 6% to shuffle([None] * 2000); and approximately 6% to randrange() for any number significantly less than 232. It's not surprising that the absolute speedup is smaller, because _randbelow_without_getrandbits() does a lot more work that is untouched by the change. On the other hand, the speedup is more consistent, since _randbelow_without_getrandbits(n) almost never calls random() more than once for any n significantly less than 232. I will clean up my benchmark script and post it. Thanks for the feedback!
msg407634 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-04 04:06
I just did a few timings using the stock python.org Mac builds. Only Python 3.10 gave the expected speed-up. Python 3.8 and Python 3.9 got slower. Python 3.11 was slightly slower. I think we should pass on this proposed change. The current code is safer in that it has been consistently good while the proposed change is inconsistent and can't be relied upon across builds.
msg407635 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-04 04:22
On the current 3.11, I do get a speedup below a power of two, but a slight degradation (as expected) at power of two. python3.11 -m timeit -s 'from random import randrange' 'randrange(65535)' python3.11 -m timeit -s 'from random import randrange' 'randrange(65536)' Also, I used PyPy3 on its 3.7 version of random.py but with the modern code for _getrandbits. There was no discernible change with and without the proposed edit. Please go ahead and update to the best version of your patch. I'll mark this BPO issue as "later" and will run tests again as 3.11 gets close to a release. Right now, CPython's performance is still in flux and it is too early to try tune application code the current version.
msg407636 - (view)	Author: Andrew Lin (onethreeseven) *	Date: 2021-12-04 04:36
I finally cleaned up my benchmark script, which feels like a big hack. I should really learn to use pyperf or something. Inspired by your comment about 3.8 on the Mac I tried my Ubuntu installation's stock 3.8. I get a small degradation (~0.5%) for powers of two and 4-5% speedups just below powers of two and for shuffle. (This is under the slightly strange situation where _randbelow is using the main branch code but everything else in the class is imported from 3.8.) _randbelow_without_getrandbits() is consistently (if only slightly) faster, as you would expect for cases that are basically guaranteed to call self.random() only once. I am impressed (if not necessarily surprised) by the variation between builds, especially with my friend's ARM builds which reached nearly 20% for one case. So I completely understand if it's less uncertain just to pass. Thanks for your attention, in any event!
msg407679 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-04 18:38
Notes ===== Without Boundmethod ------------------- LOAD FAST self LOAD METHOD getrandbits LOAD FAST k CALL_METHOD 1 Form Boundmethod ---------------- LOAD FAST self LOAD ATTR getrandbits STORE FAST getrandbits Call Boundmethod ---------------- LOAD FAST getrandbits LOAD FAST k CALL FUNCTION 1 Expected Call Frequency ----------------------- n = 1000 r = 0 .. 1024 P(r > n) 24 / 1024 = 2.3% E[calls] = 1 / (1 - 24/1024) = 1.024 n = 1024 r = 0 .. 2048 P(r > n) = 1024 / 2048 = 50% E[calls] = 1 / (1 - 1024/2048) = 2.0
msg407680 - (view)	Author: Andrew Lin (onethreeseven) *	Date: 2021-12-04 18:53
E[calls] reduces down to 2k / n. I only just realized from working that out that it therefore doesn't quite vary linearly over [2k, 2**(k+1)), although by what weight one wants to average I couldn't say. Personally if the change adversely impacts performance for any n on any major platform I will retract the PR. I was under the impression based on my own machine that this would not happen, but it is only my one machine.
msg413186 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2022-02-13 15:04
I'm thinking that we care more about the unhappy cases (the extreme values) than we care about a mild and implementation dependent improvement to the average case.

History
Date	User	Action	Args
2022-04-11 14:59:53	admin	set	github: 90134
2022-02-13 15:04:51	rhettinger	set	resolution: later -> rejected messages: + msg413186
2021-12-04 18:53:34	onethreeseven	set	messages: + msg407680
2021-12-04 18:38:16	rhettinger	set	messages: + msg407679
2021-12-04 04:37:12	onethreeseven	set	files: + randbelow_timing.py
2021-12-04 04:36:47	onethreeseven	set	messages: + msg407636
2021-12-04 04:22:33	rhettinger	set	status: open -> closed resolution: later messages: + msg407635 stage: patch review -> resolved
2021-12-04 04:06:10	rhettinger	set	messages: + msg407634
2021-12-04 03:03:35	onethreeseven	set	messages: + msg407630
2021-12-04 02:46:49	rhettinger	set	messages: + msg407628
2021-12-04 02:32:54	rhettinger	set	assignee: rhettinger messages: + msg407626 nosy: + rhettinger
2021-12-04 02:22:38	python-dev	set	keywords: + patch nosy: + python-dev pull_requests: + pull_request28136 stage: patch review
2021-12-04 02:20:56	onethreeseven	create