Issue 9025: Non-uniformity in randrange for large arguments.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53271

classification

Title:	Non-uniformity in randrange for large arguments.
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	belopolsky, mark.dickinson, orsenthil, pitrou, rhettinger, terry.reedy, vstinner
Priority:	high	Keywords:	patch

Created on 2010-06-18 10:20 by mark.dickinson, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue9025.patch	mark.dickinson, 2010-06-18 12:22
issue9025_v2.patch	mark.dickinson, 2010-06-18 15:01
_smallrandbelow.diff	mark.dickinson, 2010-06-23 20:43
randint.py	vstinner, 2010-06-24 01:41

Messages (25)
msg108095 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-18 10:20
Not a serious bug, but worth noting: The result of randrange(n) is not even close to uniform for large n. Witness the obvious skew in the following (this takes a minute or two to run, so you might want to reduce the range argument): Python 3.2a0 (py3k:81980, Jun 14 2010, 11:23:36) [GCC 4.2.1 (SUSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from random import randrange >>> from collections import Counter >>> Counter(randrange(6755399441055744) % 3 for _ in range(100000000)) Counter({1: 37508130, 0: 33323818, 2: 29168052}) (The actual probabilities here are, as you might guess from the above numbers: {0: 1/3, 1: 3/8, 2: 7/24}.) The cause: for n < 2*53, randrange(n) is effectively computed as int(random() n). For small n, there's a tiny bias involved, but this is still an effective method. However, as n increases towards 253, the bias increases significantly. (For n >= 253, the random module uses a different strategy that does produce uniformly distributed results.) A solution would be to lower the cutoff point where randrange() switches from using int(random() * n) to using the _randbelow method.
msg108097 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-18 10:24
Note: the number 6755399441055744 is special: it's 0.75 * 2**53, and was deliberately chosen so that the non-uniformity is easily exhibited by looking at residues modulo 3. For other numbers of this size, the non-uniformity is just as bad, but demonstrating the non-uniformity clearly would have taken a little more work.
msg108101 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-18 12:22
Here's an example patch that removes any bias from randrange(n) (except for bias resulting from the imperfectness of the core MT generator). I added a small private method to Modules/_randommodule.c to aid the computation. This only fixes one instance of int(random() * n) in the Lib/random.py source; the other instances should be modified accordingly. With this patch, randrange is a touch faster than before (20-30% speedup) for small arguments. Is this worth pursuing?
msg108108 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-18 13:47
The nonuniformity of randrange has a knock-on effect in other random module functions. For example, take a sample of 100 elements from range(6004799503160661), and take the smallest element from that sample. Then the exact distribution of that smallest element is somewhat complicated, but you'd expect it to be even with probability very close to 50%. But it turns out that it's roughly twice as likely to be even as to be odd. >>> from random import sample >>> from collections import Counter >>> population = range(6004799503160661) >>> Counter(min(sample(population, 100)) % 2 for _ in range(100000)) Counter({0: 66810, 1: 33190})
msg108111 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-18 15:01
Here's a more careful Python-only patch that fixes the bias in randrange and randint (but not in shuffle, choice or sample). It should work well both for Mersenne Twister and for subclasses of Random that use a poorer PRNG with badly-behaved low-order bits.
msg108118 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-06-18 17:56
Will take a look at this in the next few days. Am tempted to just either provide a recipe or provide a new method. That way sequences generated by earlier python's are still reproducible.
msg108120 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-06-18 18:02
I would prefer to see correct algorithm in stdlib and a recipe for how to reproduce old sequences for the users who care.
msg108123 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-06-18 18:34
FWIW, we spent ten years maintaining the ability to reproduce sequences. It has become an implicit promise. I'll take a look at the patch in the next few days.
msg108138 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-18 20:43
Hmm. I hadn't considered the reproducibility problem. Does the module aim for reproducibility across all platforms and all versions of Python? Or just one of those? For small n, I think the patched version of randrange(n) produces the same sequence as before with very high probability, but not with certainty. Since that sounds like a recipe for hard-to-find bugs, it might be better to deliberately perturb the outputs somehow so that the sequence is obviously different from before, rather than subtly different.
msg108310 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-06-21 19:55
'Random', without qualification, is commonly taken to mean 'with uniform distribution'. Otherwise it has no specific meaning and could well be a synonym for 'arbitrary' or 'haphazard'. The behavior reported is buggy and in my opinion should be fixed if possible. I have done simulation research in the past and do not consider them minor. If I had results that depended on these functions, I might want to rerun with the fixed versions to make sure the end results were not affected. I would certainly want the fixed behavior for any future work. I do not see any promise of reproducibility of sequences from version to version. I do not really see the point as one can rerun with the old Python version or copy the older random.py. The old versions could be kept with with an 'old_' prefix and documented in a separate subsection that starts with "Do not use these buggy old versions of x and y in new code. They are only present for those who want to reproduce old sequences." But I wonder how many people would use them.
msg108438 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-06-23 07:54
FWIW, here are two approaches to getting an equi-distributed version of int(nrandom()) where 0 < n <= 253. The first mirrors the approach currently in the code. The second approach makes fewer calls to random(). def rnd1(n): assert 0 < n <= 253 N = 1 << (n-1).bit_length() r = int(N random()) while r >= n: r = int(N * random()) return r def rnd2(n, N=1<<53): assert 0 < n <= N NN = N - (N % n) + 0.0 # largest multiple of n <= N r = N * random() # a float with an integral value while r >= NN: r = N * random() return int(r) % n
msg108441 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-23 08:41
Either of these looks good to me. If the last line of the second is changed from "return int(r) % n" to "return int(r) // (N // n)" then it'll use the high-order bits of random() instead of the low-order bits. This doesn't matter for MT, but might matter for subclasses of Random using a different underlying generator.
msg108453 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-23 14:37
This wouldn't be the first time reproduceability is dropped, since reading from the docs: “As an example of subclassing, the random module provides the WichmannHill class that implements an alternative generator in pure Python. The class provides a backward compatible way to reproduce results from earlier versions of Python, which used the Wichmann-Hill algorithm as the core generator.” Also: > FWIW, we spent ten years maintaining the ability to reproduce > sequences. It has become an implicit promise. IMO it should either be documented explicitly, or be taken less dearly. There's not much value in an "implicit promise" that's only known by a select few. (besides, as Terry said, I think most people are more concerned by the quality of the random distribution than by the reproduceability of sequences)
msg108460 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-06-23 17:00
I guess, Antoine wanted to point out this: "Changed in version 2.3: MersenneTwister replaced Wichmann-Hill as the default generator." But as the paragraph points out Python did provide non default WichmanHill class for generating repeatable sequences with older python. My brief reading on this topic, does suggest that 'repeatability' is an important requirement for any PRNG.
msg108463 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-23 17:33
BTW, the Wichmann-Hill code is gone in py3k, so that doc paragraph needs removing or updating.
msg108466 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-06-23 18:01
Thanks guys, I've got it from here. Some considerations for the PRNG are: * equidistribution (for quality) * repeatability from the same seed (even in multithreaded environments) * quality and simplicity of API (for usability) * speed (it matters whether a Monte Carlo simulation takes 5 minutes or 30 minutes). I'm looking at several ideas: * updating the new randrange() to use rnd2() algorithm shown above (for equidistribution). * possibly providing a C version of rnd2() and using it in randrange() for speed and for thread-safety. * possibly updating shuffle() and choice() to use rnd2(). * moving the existing randrange() to randrange_quick() -- uses int(n * random) for speed and for reproducibility of previously created sequences. Alternatively, adding a recipe to the docs for recreating old sequences and not cluttering the code with backwards compatibility cruft. Am raising the priority to normal because I think some effort needs to be made to address equidistribution in 3.2. Will take some time to come-up with a good patch that balances quality, simplicity, speed, thread-safety, and reproducibility. May also consult with Tim Peters who has previously voiced concerns about stripping bits off of multiple calls to random() because the MT proofs make no guarantees about quality in those cases. I don't think this is an issue in practice, but in theory when we start tossing out some of the calls to random(), we're also throwing away the guarantees of a long periodic, 623 dimensions, uniformity, and equidistribution.
msg108478 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-23 19:26
> Some considerations for the PRNG are: > * equidistribution (for quality) > * repeatability from the same seed (even in multithreaded environments) I believe a reasonable (com)promise would be to guarantee repeatability accross a given set of bugfix releases (for example, accross all 2.6.x releases). We shouldn't necessarily commit to repeatability accross feature releases, especially if it conflicts with desireable opportunities for improvement.
msg108481 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-23 19:37
> * possibly providing a C version of rnd2() If recoding in C is acceptable, I think there may be better ( = simpler and faster) ways than doing a direct translation of rnd2. For example, for small k, the following algorithm for randrange(k) suffices: - take a single 32-bit deviate (generated using genrand_int32) - multiply by k (a 32-by-32 to 64-bit widening multiply) and return the high 32-bits of the result, provided that the bottom half of the product is <= 232 - k (almost always true, for small k). - consume extra random words as necessary in the case that the bottom half of the product is > 232 - k. I can provide code (with that 3rd step fully expanded) if you're interested in this approach. This is likely to be significantly faster than a direct translation of rnd32, since in the common case it requires only: one 32-bit deviate from MT, one integer multiplication, one subtraction, and one comparison. By comparison, rnd2 uses (at least) two 32-bit deviates and massages them into a float, before doing arithmetic with that float. Though it's possible (even probable) that any speed gain would be insignificant in comparison to the rest of the Python machinery involved in a single randrange call.
msg108487 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-23 20:43
Just to illustrate, here's a patch that adds a method Random._smallrandbelow, based on the algorithm I described above.
msg108495 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-06-23 23:36
Antoine, there does need to be repeatablity; there's no question about that. The open question for me is how to offer that repeatability in the cleanest manner. People use random.seed() for reproducible tests. They need it to have their studies become independently validatable, etc. Some people are pickling the state of the RNG and need to restart where they left off, etc. Mark, thanks for the alternative formulation. I'll take a look when I get a chance. I've got it from here.
msg108498 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-24 01:41
randint.py: another algorithm to generate a random integer in a range. It uses only operations on unsigned integers (no evil floatting point number). It calls tick() multiple times to generate enough entropy. It has an uniform distribution (if the input generator has an uniform distribution). tick() is simply the output of the Mersenne Twister generator. The algorithm can be optimized (especially the part computing ndigits and scale).
msg108499 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-24 02:35
> Antoine, there does need to be repeatablity; there's no question about > that. Well, that doesn't address my proposal of making it repeatable accross bugfix releases only. There doesn't seem to be a strong use case for perpetual repeatability. If some people really need perpetual repeatability, I don't understand how they can rely on Python anyway, since we don't make any such promise explicitly (or do they somehow manage to read in your mind?). So, realistically, they should already be using their custom-written deterministic generators.
msg108503 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-24 07:46
Distribution with my algorithm: ... from collections import Counter print Counter(_randint(6755399441055744) % 3 for _ in xrange(100000000)) => Counter({0L: 33342985, 2L: 33335781, 1L: 33321234}) Distribution: {0: 0.33342985000000003, 2: 0.33335780999999998, 1: 0.33321234}
msg108504 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2010-06-24 09:33
A couple of points: (1) In addition to documenting the extent of the repeatability, it would be good to have tests to prevent changes that inadvertently change the sequence of randrange values. (2) For large arguments, cross-platform reproducibility is already a bit fragile. For example, the _randbelow function depends on the system _log function---see the line k = int(1.00001 + _log(n-1, 2.0)) Now that we have the bit_length method available, it might be better to use that.
msg115738 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-09-07 04:47
Put in a fix with r84576. May come back to it to see if it can or should be optimized with C. For now, this gets the job done.

History
Date	User	Action	Args
2022-04-11 14:57:02	admin	set	github: 53271
2010-09-07 04:47:26	rhettinger	set	status: open -> closed resolution: fixed messages: + msg115738
2010-08-08 01:39:10	rhettinger	set	priority: normal -> high
2010-06-24 09:33:39	mark.dickinson	set	messages: + msg108504
2010-06-24 07:46:57	vstinner	set	messages: + msg108503
2010-06-24 02:35:28	pitrou	set	messages: + msg108499
2010-06-24 01:41:32	vstinner	set	files: + randint.py messages: + msg108498
2010-06-23 23:36:59	rhettinger	set	messages: + msg108495
2010-06-23 20:43:53	mark.dickinson	set	files: + _smallrandbelow.diff messages: + msg108487
2010-06-23 19:37:54	mark.dickinson	set	messages: + msg108481
2010-06-23 19:26:21	pitrou	set	messages: + msg108478
2010-06-23 18:01:21	rhettinger	set	priority: low -> normal messages: + msg108466
2010-06-23 17:33:32	mark.dickinson	set	messages: + msg108463
2010-06-23 17:00:06	orsenthil	set	nosy: + orsenthil messages: + msg108460
2010-06-23 14:37:16	pitrou	set	nosy: + pitrou messages: + msg108453
2010-06-23 08:41:51	mark.dickinson	set	messages: + msg108441
2010-06-23 07:54:30	rhettinger	set	messages: + msg108438
2010-06-21 19:55:48	terry.reedy	set	nosy: + terry.reedy messages: + msg108310
2010-06-18 20:43:09	mark.dickinson	set	messages: + msg108138
2010-06-18 18:34:08	rhettinger	set	messages: + msg108123
2010-06-18 18:02:32	belopolsky	set	messages: + msg108120
2010-06-18 17:56:39	rhettinger	set	messages: + msg108118
2010-06-18 17:07:01	rhettinger	set	assignee: rhettinger
2010-06-18 15:01:03	mark.dickinson	set	files: + issue9025_v2.patch messages: + msg108111
2010-06-18 14:46:39	vstinner	set	nosy: + vstinner
2010-06-18 14:11:40	belopolsky	set	nosy: + belopolsky
2010-06-18 13:47:12	mark.dickinson	set	messages: + msg108108
2010-06-18 12:22:11	mark.dickinson	set	files: + issue9025.patch keywords: + patch messages: + msg108101
2010-06-18 10:24:53	mark.dickinson	set	messages: + msg108097
2010-06-18 10:20:50	mark.dickinson	create