This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Random._randbelow() loses time by caching an attribute lookup
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: onethreeseven, python-dev, rhettinger
Priority: normal Keywords: patch

Created on 2021-12-04 02:20 by onethreeseven, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
randbelow_timing.py onethreeseven, 2021-12-04 04:37
Pull Requests
URL Status Linked Edit
PR 29911 closed python-dev, 2021-12-04 02:22
Messages (10)
msg407625 - (view) Author: Andrew Lin (onethreeseven) * Date: 2021-12-04 02:20
This PR obtains a performance improvement in the random module by removing a cached attribute lookup in Random._randbelow_with_getrandbits() that costs time on average.  In the best cases (on my machine) I get 10% improvement for randrange(), 7% for shuffle(), and 11% for sample(), with no change in the worst cases.

To elaborate...  If n is slightly less than a power of 2, _randbelow_with_getrandbits(n) almost always makes just one call to getrandbits(), so caching the attribute lookup never pays for itself.  Even in the worst case, when n is exactly a power of 2, the expected number of calls to getrandbits() is still only 2, and on my machine caching just breaks even.  (A friend ran it on ARM and got even more favorable results.)

I included a similar change to _randbelow_without_getrandbits() on similar grounds, although I didn't benchmark it.

(Brand new here; let me know if I've left anything out!)
msg407626 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-04 02:32
Thanks for the suggestion.  I'll try this out on various builds and machines to see how it works out.  The speed of locals is consistently fast, but the speed of method calls has varied over the last few versions of Python (generally getting faster).  When this code was first written, it was benchmarked and localizing was a net win.
msg407628 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-04 02:46
* If you post a timing script, it will make it easier for me to verify this across versions and across machine and for various input sizes.

* If you have time, do run some benchmarks for _randbelow_without_getrandbits()

* Try your change with and without the walrus operator.  If it doesn't make a difference in the timings, let's us the non-walrus version which is more readable.
msg407630 - (view) Author: Andrew Lin (onethreeseven) * Date: 2021-12-04 03:03
Timings are unaffected by updating to non-walrus, so I'll do so in the PR.

Using _randbelow_without_getrandbits() I get 5% improvement to sample([None] * 2047, 50); 6% to shuffle([None] * 2000); and approximately 6% to randrange() for any number significantly less than 2**32.  It's not surprising that the absolute speedup is smaller, because _randbelow_without_getrandbits() does a lot more work that is untouched by the change.  On the other hand, the speedup is more consistent, since _randbelow_without_getrandbits(n) almost never calls random() more than once for any n significantly less than 2**32.

I will clean up my benchmark script and post it.

Thanks for the feedback!
msg407634 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-04 04:06
I just did a few timings using the stock python.org Mac builds.  Only Python 3.10 gave the expected speed-up.  Python 3.8 and Python 3.9 got slower.  Python 3.11 was slightly slower.

I think we should pass on this proposed change.  The current code is safer in that it has been consistently good while the proposed change is inconsistent and can't be relied upon across builds.
msg407635 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-04 04:22
On the current 3.11, I do get a speedup below a power of two, but a slight degradation (as expected) at power of two.

python3.11 -m timeit -s 'from random import randrange' 'randrange(65535)'
python3.11 -m timeit -s 'from random import randrange' 'randrange(65536)'

Also, I used PyPy3 on its 3.7 version of random.py but with the modern code for _getrandbits.  There was no discernible change with and without the proposed edit.

Please go ahead and update to the best version of your patch.

I'll mark this BPO issue as "later" and will run tests again as 3.11 gets close to a release.  Right now, CPython's performance is still in flux and it is too early to try tune application code the current version.
msg407636 - (view) Author: Andrew Lin (onethreeseven) * Date: 2021-12-04 04:36
I finally cleaned up my benchmark script, which feels like a big hack.  I should really learn to use pyperf or something.

Inspired by your comment about 3.8 on the Mac I tried my Ubuntu installation's stock 3.8.  I get a small degradation (~0.5%) for powers of two and 4-5% speedups just below powers of two and for shuffle.  (This is under the slightly strange situation where _randbelow is using the main branch code but everything else in the class is imported from 3.8.)

_randbelow_without_getrandbits() is consistently (if only slightly) faster, as you would expect for cases that are basically guaranteed to call self.random() only once.

I am impressed (if not necessarily surprised) by the variation between builds, especially with my friend's ARM builds which reached nearly 20% for one case.  So I completely understand if it's less uncertain just to pass.  Thanks for your attention, in any event!
msg407679 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-04 18:38
Notes
=====

Without Boundmethod
-------------------
  LOAD FAST self
  LOAD METHOD getrandbits
  LOAD FAST k
  CALL_METHOD 1

Form Boundmethod
----------------
   LOAD FAST self
   LOAD ATTR getrandbits
   STORE FAST getrandbits

Call Boundmethod
----------------
   LOAD FAST getrandbits
   LOAD FAST k
   CALL FUNCTION 1

Expected Call Frequency
-----------------------

n = 1000
r = 0 .. 1024
P(r > n) 24 / 1024 = 2.3%
E[calls] = 1 / (1 - 24/1024) = 1.024

n = 1024
r = 0 .. 2048
P(r > n) = 1024 / 2048 = 50%
E[calls] = 1 / (1 - 1024/2048) = 2.0
msg407680 - (view) Author: Andrew Lin (onethreeseven) * Date: 2021-12-04 18:53
E[calls] reduces down to 2**k / n.  I only just realized from working that out that it therefore doesn't quite vary linearly over [2**k, 2**(k+1)), although by what weight one wants to average I couldn't say.

Personally if the change adversely impacts performance for any n on any major platform I will retract the PR.  I was under the impression based on my own machine that this would not happen, but it is only my one machine.
msg413186 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2022-02-13 15:04
I'm thinking that we care more about the unhappy cases (the extreme values) than we care about a mild and implementation dependent improvement to the average case.
History
Date User Action Args
2022-04-11 14:59:53adminsetgithub: 90134
2022-02-13 15:04:51rhettingersetresolution: later -> rejected
messages: + msg413186
2021-12-04 18:53:34onethreesevensetmessages: + msg407680
2021-12-04 18:38:16rhettingersetmessages: + msg407679
2021-12-04 04:37:12onethreesevensetfiles: + randbelow_timing.py
2021-12-04 04:36:47onethreesevensetmessages: + msg407636
2021-12-04 04:22:33rhettingersetstatus: open -> closed
resolution: later
messages: + msg407635

stage: patch review -> resolved
2021-12-04 04:06:10rhettingersetmessages: + msg407634
2021-12-04 03:03:35onethreesevensetmessages: + msg407630
2021-12-04 02:46:49rhettingersetmessages: + msg407628
2021-12-04 02:32:54rhettingersetassignee: rhettinger

messages: + msg407626
nosy: + rhettinger
2021-12-04 02:22:38python-devsetkeywords: + patch
nosy: + python-dev

pull_requests: + pull_request28136
stage: patch review
2021-12-04 02:20:56onethreesevencreate