Issue 44080: Bias in random.choices(long_sequence)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/88246

classification

Title:	Bias in random.choices(long_sequence)
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	Dennis Sweeney, mark.dickinson, rhettinger
Priority:	normal	Keywords:	patch

Created on 2021-05-08 21:10 by Dennis Sweeney, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
int_choices.diff	Dennis Sweeney, 2021-05-09 17:06

Messages (5)
msg393285 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2021-05-08 21:10
Problem: Random.choices() never returns sequence entries at large odd indices. For example: >>> import random >>> [x % 2 for x in random.choices(range(253), k=20)] [0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1] >>> [x % 2 for x in random.choices(range(254), k=20)] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] This might require a 64-bit build. Possible solutions: 1) Do nothing. Document the quirk, and recommend using [seq[randrange(len(seq))] for i in range(k)] 2) Use integer arithmetic rather than random(). Using _randbelow() in the unweighted case alone breaks test_random.test_choices_algorithms(), but adding a case for `if isinstance(total, int)` and using _randbelow() there as well creates a difference between the reproducability of weights=[1]n versus weights=[1.0]n. 3) Raise an Exception or Warning when loss of precision is likely to occur. Something like `if n > 2.0 BPF: raise OverFlowError(...)`. All smaller integers are exactly representable, but that doesn't quite eliminate the bias (see below). ----------------------- There is bias in random.choices() even when n <= 253: >>> Counter(val % 3 for val in ... choices(range(253 - 251), k=1_000_000)) Counter({1: 374976, 0: 333613, 2: 291411}) >>> Counter(val % 3 for val in ... choices(range(253 - 251), k=1_000_000) ... if val < 251) Counter({0: 166669, 1: 83664, 2: 83293}) For some explanation, imagine floats have only 3 bits of precision. Then random.choices() would do something like this on a sequence of length 7: # Random 3-significant-bit floats def random(): return randrange(8) / 8 # selecting from a pool of size 2BPF - 1 >>> Counter(floor(random() * 7) for _ in range(1_000_000)) Counter({0: 249566, 5: 125388, 6: 125251, 2: 125202, 1: 125149, 3: 124994, 4: 124450}) Issue: both 0/8 and 7/8 fall in [0, 1). Similar biases: >>> Counter(floor(randrange(16)/1615) for _ in range(1_000_000)) Counter({0: 124549, 5: 62812, 14: 62807, 6: 62766, 10: 62740, 3: 62716, 12: 62658, 13: 62649, 9: 62473, 8: 62376, 2: 62373, 4: 62346, 11: 62293, 1: 62278, 7: 62164}) >>> Counter(floor(randrange(16)/1614) for _ in range(1_000_000)) Counter({7: 125505, 0: 124962, 11: 62767, 9: 62728, 6: 62692, 10: 62684, 4: 62625, 3: 62465, 12: 62428, 13: 62397, 2: 62332, 8: 62212, 5: 62176, 1: 62027}) >>> def bias(bits, n): ... c = Counter(floor(randrange(2bits)/(2bits) * n) ... for _ in range(1_000_000)) ... m = min(c.values()) ... preferred = [k for k, v in c.items() if v / m > 1.5] ... preferred.sort() ... return preferred >>> bias(bits=4, n=15) [0] >>> bias(bits=4, n=14) [0, 7] >>> bias(bits=8, n=250) [0, 41, 83, 125, 166, 208] # Observation of which numbers may be biased toward # I have no proof that this is universally true >>> [250 * i // (28 - 250) for i in range(6)] [0, 41, 83, 125, 166, 208] If using the OverflowError approach, I think using a threshold of 2BPF/2 would only make some "bins" (indices) have 1.5x the probability of others rather than 2x, and thus would not remove the problem either.
msg393312 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-05-09 09:46
This is similar to #9025 (for randrange), and in principle the random.choices case without explicit weights could be fixed similarly. But I'm not sure it's worth it, for a couple of reasons: - The problem only really affects "virtual" population sequences like ``range``, where the actual population isn't realised in memory. For populations in list form, in practice the population size will be limited to a few billion, and I think you'd be hard pushed to demonstrate statistically detectable bias at that size. (That is, my claim is that it would be essentially impossible to distinguish the existing "random.choices" implementation from a "fixed" unbiased version based on inputs and outputs alone.) - The speed of random.choices is a particular selling point - we don't have anything else in the random module that lets you generate millions of variates at once. I'd expect replacing `floor(random() * n)` with `self._randbelow(n)` to cause a significant speed regression. - For explicit weights, there isn't a practical route to handling those that doesn't involve use of floating-point in general, so for those we have to accept some tiny bias anyway. (Yes, we could technically do something unbiased for the case of integer-only weights, but that seems like significant work for a special case that I'd expect just doesn't matter that much in practice.) And again in this case, the weights list will usually be a "real" list, so we're again limited to populations of a few billion and it'll be hard to demonstrate statistically detectable bias. @Dennis: for interests sake, do you have bandwidth to do some timings of the current "[population[floor(random() * n)] for i in _repeat(None, k)]" versus something like "[population[_randbelow(n)] for i in _repeat(None, k)]" (after putting "_randbelow = self._randbelow" to save an attribute lookup on each iteration)?
msg393327 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2021-05-09 17:06
Your suspicion looks correct, random() is faster: .\python.bat -m pyperf timeit -s "from random import choices" "choices(range(100), k=10_000)" Before int_choices.diff: Mean +- std dev: 1.49 ms +- 0.09 ms After int_choices.diff: Mean +- std dev: 3.50 ms +- 0.33 ms
msg393329 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-09 17:58
This is known and an intentional design decision. It isn't just a speed issue. Because the weights can be floats, we have floats involved at the outset and some round-off is unavoidable. To keep the method internally consistent, the same technique is used even when the weights aren't specified: >>> from random import choices, seed >>> seed(86753093) >>> s = choices('abcdefg', k=20) >>> seed(86753093) >>> t = choices('abcdefg', [0.7] * 7, k=20) >>> s == t True FWIW, this is documented: """ For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error. """
msg393333 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-09 18:58
FWIW, the principal use case that choices() was designed for is resampling/bootstapping. In that use case, speed matters and small imbalances in large sequences don't matter at all. Also, the API was designed to make it easy to select from an itemized population of individuals than a large range followed by a modulo calculation (we already have randrange() to meet that need). I could add a sentence to the last paragraph recommending "[choice(pop) for i in range(k)]" for the case where 1) the population is large, 2) the speed doesn't matter, 3) the weights are equal, and 4) integer math is desired to avoid any trace of bias. That said, I don't users would benefit from it and that this is largely just a theoretical concern that doesn't warrant more than the existing paragraph.

History
Date	User	Action	Args
2022-04-11 14:59:45	admin	set	github: 88246
2021-05-09 18:58:59	rhettinger	set	messages: + msg393333
2021-05-09 17:58:19	rhettinger	set	status: open -> closed messages: + msg393329 assignee: rhettinger resolution: not a bug stage: resolved
2021-05-09 17:06:46	Dennis Sweeney	set	files: + int_choices.diff keywords: + patch messages: + msg393327
2021-05-09 09:46:03	mark.dickinson	set	messages: + msg393312
2021-05-09 07:00:46	mark.dickinson	set	nosy: + mark.dickinson
2021-05-08 21:10:37	Dennis Sweeney	create