Issue 23974: random.randrange() biased output

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68162

classification

Title:	random.randrange() biased output
Type:	behavior	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	christian.heimes, gurnec, mark.dickinson, rhettinger, serhiy.storchaka, skip.montanaro
Priority:	normal	Keywords:	patch

Created on 2015-04-16 19:09 by gurnec, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue23974.patch	gurnec, 2015-07-02 17:52		review

Messages (11)
msg241261 - (view)	Author: Christopher Gurnee (gurnec) *	Date: 2015-04-16 19:09
Due to an optimization in random.randrange() only in Python 2, as the stop-start range approaches 2^53 the output becomes noticeably biased. This bug also affects random.SystemRandom. For example, counting the number of even ints in a set of 10^6 random ints each in the range [0, 2^52 * 2/3) produces normal results; about half are even: >>> sum(randrange(2*52 2//3) % 2 for i in xrange(1000000)) / 1000000.0 0.499932 Change the range to [0, 2^53 * 2/3), and you get a degenerate case where evens are two times more likely to occur than odds: >>> sum(randrange(2*53 2//3) % 2 for i in xrange(1000000)) / 1000000.0 0.333339 The issue occurs in three places inside randrange(), here's one: if istart >= _maxwidth: return self._randbelow(istart) return _int(self.random() * istart) _maxwidth is the max size of a double where every digit to the left of the decimal point can still be represented w/o loss of precision (2^53, where a double has 53 mantissa bits). With istart >= _maxwidth, _randbelow() behaves correctly. With istart < _maxwidth, the rounding error in random() * istart begins to cause problems as istart approaches 2^53. Changing _maxwidth to be significantly less should (practically speaking anyways) fix this, although I'm open to suggestion on just how conservatively it should be set.
msg241266 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2015-04-16 19:24
I'm completely unqualified to offer a concrete, expert opinion here, but it seems like defaulting _maxwidth to 1L<<52 should work. I have no idea of the resulting performance implications though. >>> for bpf in (45, 46, 47, 48, 49, 50, 51, 52, 53): ... print bpf, ... print sum(randrange(2*53 2//3, _maxwidth=1L<<bpf) % 2 for i in xrange(1000000)) / 1000000.0, ... print sum(randrange(2*52 2//3, _maxwidth=1L<<bpf) % 2 for i in xrange(1000000)) / 1000000.0 ... 45 0.499959 0.500144 46 0.501622 0.500371 47 0.500257 0.499692 48 0.499567 0.499942 49 0.499789 0.50028 50 0.499882 0.500488 51 0.500479 0.50006 52 0.500553 0.500234 53 0.33275 0.500296
msg241268 - (view)	Author: Christopher Gurnee (gurnec) *	Date: 2015-04-16 19:40
I shouldn't have called this a rounding error issue, that's not really what it is. A smaller example might help. If I'm given a random int, x, in the range [0, 12), and asked to produce from it a random int, y, in the range (0,8], I've got (at least?) two choices: 1. y = x If x < 8 Else fail 2. y = f(x), where f maps values from [0, 12) -> (0,8] The problem with method 2 is you end up with a mapping like this: 0,1 -> 0 2 -> 1 3,4 -> 2 5 -> 3 6,7 -> 4 8 -> 5 9,10 -> 6 11 -> 7 _randbelow() uses method 1 above. _int(self.random() * istart) is more like method 2. I chose 2^53 * 2/3 just because the bias was easy to demonstrate. There will always be some bias when stop-start % 2^53 != 0, but it might not manifest itself as easily as checking for evenness. Personally, I think 2^52 is still way too high as a cutoff point for using the (presumably faster, I didn't timeit) method 2, but I don't claim to be an expert here....
msg241315 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2015-04-17 06:29
See also #9025, which contains a fix for this exact problem in Python 3.
msg241322 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2015-04-17 08:20
In the #9025 discussion, reproducibility was a key concern. Though I note that despite the comments there, we still have no documented guarantees of reproducibility, so maybe it's safe to go ahead and change this. Raymond? IMO, the fix from #9025 should be backported. Note that that fix fixes the issue completely: all outputs will occur with equal probability. That's under the unrealistic assumption of a perfect source of random bits/words, of course; but the key point is that randrange shouldn't introduce any new biases that aggravate existing ones in the source generator. Reducing `_maxwidth` would just reduce the size of the randrange bias without eliminating it completely: if we're going to make any change at all to the source, we should fix the problem once and for all. Another option that avoids breaking reproducibility would be to note the bias in the docs, and provide a doc recipe for an unbiased randrange, for those who need it.
msg241696 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-04-21 03:25
> we still have no documented guarantees of reproducibility, > so maybe it's safe to go ahead and change this. Raymond? It is documented here: https://docs.python.org/3/library/random.html#notes-on-reproducibility The idea that is that algorithms (and the generated sequences) may change in between minor releases, but not in micro releases. And random() itself if more restricted (guaranteed to be the same across minor releases as well). The policy was new in Python 3. It was a liberalization of the implied policy in Python 2 that we didn't change the sequences at all (except for flat-out brokenness). Accordingly, the #9025 debiasing was intentionally not backported so we won't break reproducibility and adversely affect performance of existing code. We could add a note as Mark suggests, but please keep it terse and affirmatively worded (perhaps something "See also: Recipe 31xx for a way to eliminate rounding biases in randrange()". The docs are not well served by being littered with Danger-signs and security warnings.
msg241863 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2015-04-23 12:07
IMO it's not a security issue at all. If you have to care about security, you shouldn't use the random module at all. random.SystemRandom() merely uses a CPRNG as entropy source. But It also manipulates numbers in ways that may or may not be safe. Only os.getrandom() returns unmodified and unbiased random numbers -- iff the operating system provides a proper CPRNG.
msg241866 - (view)	Author: Christopher Gurnee (gurnec) *	Date: 2015-04-23 16:14
> If you have to care about security, you shouldn't use the random module > at all. random.SystemRandom() merely uses a CPRNG as entropy source. But > It also manipulates numbers in ways that may or may not be safe. I must respectfully disagree with this. The current docs say: > Use os.urandom() or SystemRandom if you require a cryptographically > secure pseudo-random number generator. That's a pretty strong statement, and IMO it would lead most to believe that SystemRandom along with all of its member functions is safe to use for cryptographic purposes[1] (assuming of course that os.urandom() is also a safe CSPRNG). As a compromise, perhaps SystemRandom could provide its own randrange() with the #9025 fix, while keeping random.randrange() unmodified to preserve the implied same-sequence rule. [1] I don't mean to imply that this bias bug necessarily is a cryptographic safety issue--it seems unlikely to me that it is one, however not being a cryptographer myself, I'd rather not draw any conclusions either way, and instead I'd prefer to err on the side of safety.
msg246091 - (view)	Author: Christopher Gurnee (gurnec) *	Date: 2015-07-02 17:52
There's been no activity on this issue in a few months.... The three options as I see it are: 1. Fix it for both randrange and SystemRandom.randrange, breaking randrange's implied stability between minor versions. 2. Fix it only for SystemRandom.randrange. 3. Close it as wont fix (for performance reasons I'd assume?). Since I'm in favor of option 2, I've attached a simple patch which implements it. Here are some quick-and-dirty performance numbers showing the decrease in performance (3 tests of the original code followed by 3 of the patched code): $ python -m timeit -r 10 -s 'import random; s = random.SystemRandom(); r = 28' 's.randrange(r)' 10000 loops, best of 10: 22.5 usec per loop $ python -m timeit -r 10 -s 'import random; s = random.SystemRandom(); r = 231' 's.randrange(r)' 10000 loops, best of 10: 22.6 usec per loop $ python -m timeit -r 10 -s 'import random; s = random.SystemRandom(); r = 2*53 2//3' 's.randrange(r)' 10000 loops, best of 10: 22.4 usec per loop $ python -m timeit -r 10 -s 'import random; s = random.SystemRandom(); r = 28' 's.randrange(r)' 10000 loops, best of 10: 23.7 usec per loop $ python -m timeit -r 10 -s 'import random; s = random.SystemRandom(); r = 231' 's.randrange(r)' 10000 loops, best of 10: 46.2 usec per loop $ python -m timeit -r 10 -s 'import random; s = random.SystemRandom(); r = 2*53 2//3' 's.randrange(r)' 10000 loops, best of 10: 34.8 usec per loop The patch also includes a unit test (with a false negative rate of 1 in 8.5 * 10^-8: http://www.wolframalpha.com/input/?i=probability+of+417+or+fewer+successes+in+1000+trials+with+p%3D0.5). Any opinions on which of the three options should be taken?
msg246130 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-07-03 04:41
I choose option 3, close as won't fix. The ship for 2.7 sailed a long time ago. The code for randrange() was designed by Tim Peters and has been in-place for a decade and half without causing suffering in the world. Also, changing the type to "behavior" from "security" -- that latter classification is quite a stretch.
msg246179 - (view)	Author: Christopher Gurnee (gurnec) *	Date: 2015-07-03 14:20
Option 3 of course wasn't my first choice (given how small the patch is and how minimal its potential negative impact), but it's certainly better than allowing an issue to linger in limbo. Thank you, all.

History
Date	User	Action	Args
2022-04-11 14:58:15	admin	set	github: 68162
2015-07-03 14:20:13	gurnec	set	messages: + msg246179
2015-07-03 04:41:24	rhettinger	set	status: open -> closed type: security -> behavior resolution: wont fix messages: + msg246130
2015-07-02 17:52:21	gurnec	set	files: + issue23974.patch keywords: + patch messages: + msg246091
2015-04-23 16:14:47	gurnec	set	messages: + msg241866
2015-04-23 12:07:37	christian.heimes	set	nosy: + christian.heimes messages: + msg241863
2015-04-21 03:25:22	rhettinger	set	messages: + msg241696
2015-04-17 08:20:56	mark.dickinson	set	messages: + msg241322
2015-04-17 06:29:07	mark.dickinson	set	messages: + msg241315
2015-04-16 19:40:54	gurnec	set	messages: + msg241268
2015-04-16 19:24:47	skip.montanaro	set	nosy: + skip.montanaro messages: + msg241266
2015-04-16 19:17:27	serhiy.storchaka	set	nosy: + rhettinger, mark.dickinson, serhiy.storchaka type: security stage: needs patch
2015-04-16 19:09:59	gurnec	create