This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Bias in random.choices(long_sequence)
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.11, Python 3.10
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Dennis Sweeney, mark.dickinson, rhettinger
Priority: normal Keywords: patch

Created on 2021-05-08 21:10 by Dennis Sweeney, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
int_choices.diff Dennis Sweeney, 2021-05-09 17:06
Messages (5)
msg393285 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-05-08 21:10
Problem: Random.choices() never returns sequence entries at large
odd indices. For example:

>>> import random
>>> [x % 2 for x in random.choices(range(2**53), k=20)]
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1]
>>> [x % 2 for x in random.choices(range(2**54), k=20)]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

This might require a 64-bit build.

Possible solutions:

1) Do nothing.

    Document the quirk, and recommend using
    [seq[randrange(len(seq))] for i in range(k)]

2) Use integer arithmetic rather than random().

    Using _randbelow() in the unweighted case alone breaks
    test_random.test_choices_algorithms(), but adding a case for
    `if isinstance(total, int)` and using _randbelow() there as
    well creates a difference between the reproducability
    of weights=[1]*n versus weights=[1.0]*n.

3) Raise an Exception or Warning when loss of precision is likely to occur.

    Something like `if n > 2.0 ** BPF: raise OverFlowError(...)`.
    All smaller integers are exactly representable, but that doesn't
    quite eliminate the bias (see below).


-----------------------

There is bias in random.choices() even when n <= 2**53:

>>> Counter(val % 3 for val in
...         choices(range(2**53 - 2**51), k=1_000_000))
Counter({1: 374976, 0: 333613, 2: 291411})

>>> Counter(val % 3 for val in
...         choices(range(2**53 - 2**51), k=1_000_000)
...         if val < 2**51)
Counter({0: 166669, 1: 83664, 2: 83293})


For some explanation, imagine floats have
only 3 bits of precision. Then random.choices() would do something
like this on a sequence of length 7:

# Random 3-significant-bit floats
def random():
    return randrange(8) / 8

# selecting from a pool of size 2**BPF - 1
>>> Counter(floor(random() * 7) for _ in range(1_000_000))
Counter({0: 249566, 5: 125388, 6: 125251, 2: 125202, 1: 125149, 3: 124994, 4: 124450})

Issue: both 0/8 and 7/8 fall in [0, 1).

Similar biases:

>>> Counter(floor(randrange(16)/16*15) for _ in range(1_000_000))
Counter({0: 124549, 5: 62812, 14: 62807, 6: 62766, 10: 62740, 3: 62716, 12: 62658, 13: 62649, 9: 62473, 8: 62376, 2: 62373, 4: 62346, 11: 62293, 1: 62278, 7: 62164})

>>> Counter(floor(randrange(16)/16*14) for _ in range(1_000_000))
Counter({7: 125505, 0: 124962, 11: 62767, 9: 62728, 6: 62692, 10: 62684, 4: 62625, 3: 62465, 12: 62428, 13: 62397, 2: 62332, 8: 62212, 5: 62176, 1: 62027})

>>> def bias(bits, n):
...     c = Counter(floor(randrange(2**bits)/(2**bits) * n)
...                 for _ in range(1_000_000))
...     m = min(c.values())
...     preferred = [k for k, v in c.items() if v / m > 1.5]
...     preferred.sort()
...     return preferred

>>> bias(bits=4, n=15)
[0]
>>> bias(bits=4, n=14)
[0, 7]
>>> bias(bits=8, n=250)
[0, 41, 83, 125, 166, 208]

# Observation of which numbers may be biased toward
# I have no proof that this is universally true
>>> [250 * i // (2**8 - 250) for i in range(6)]
[0, 41, 83, 125, 166, 208]

If using the OverflowError approach, I think using a threshold of 2**BPF/2
would only make some "bins" (indices) have 1.5x the probability of
others rather than 2x, and thus would not remove the problem either.
msg393312 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-05-09 09:46
This is similar to #9025 (for randrange), and in principle the random.choices case without explicit weights could be fixed similarly. But I'm not sure it's worth it, for a couple of reasons:

- The problem only really affects "virtual" population sequences like ``range``, where the actual population isn't realised in memory. For populations in list form, in practice the population size will be limited to a few billion, and I think you'd be hard pushed to demonstrate statistically detectable bias at that size. (That is, my claim is that it would be essentially impossible to distinguish the existing "random.choices" implementation from a "fixed" unbiased version based on inputs and outputs alone.)

- The speed of random.choices is a particular selling point - we don't have anything else in the random module that lets you generate millions of variates at once. I'd expect replacing `floor(random() * n)` with `self._randbelow(n)` to cause a significant speed regression.

- For explicit weights, there isn't a practical route to handling those that doesn't involve use of floating-point in general, so for those we have to accept some tiny bias anyway. (Yes, we could technically do something unbiased for the case of integer-only weights, but that seems like significant work for a special case that I'd expect just doesn't matter that much in practice.) And again in this case, the weights list will usually be a "real" list, so we're again limited to populations of a few billion and it'll be hard to demonstrate statistically detectable bias.

@Dennis: for interests sake, do you have bandwidth to do some timings of the current "[population[floor(random() * n)] for i in _repeat(None, k)]" versus something like "[population[_randbelow(n)] for i in _repeat(None, k)]" (after putting "_randbelow = self._randbelow" to save an attribute lookup on each iteration)?
msg393327 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-05-09 17:06
Your suspicion looks correct, random() is faster:

.\python.bat -m pyperf timeit -s "from random import choices" "choices(range(100), k=10_000)"

Before int_choices.diff: Mean +- std dev: 1.49 ms +- 0.09 ms
After int_choices.diff:  Mean +- std dev: 3.50 ms +- 0.33 ms
msg393329 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-09 17:58
This is known and an intentional design decision.  It isn't just a speed issue.  Because the weights can be floats, we have floats involved at the outset and some round-off is unavoidable.  To keep the method internally consistent, the same technique is used even when the weights aren't specified:

    >>> from random import choices, seed
    >>> seed(8675309**3)
    >>> s = choices('abcdefg', k=20)
    >>> seed(8675309**3)
    >>> t = choices('abcdefg', [0.7] * 7, k=20)
    >>> s == t
    True

FWIW, this is documented: 
"""
For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.
"""
msg393333 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-09 18:58
FWIW, the principal use case that choices() was designed for is resampling/bootstapping.  In that use case, speed matters and small imbalances in large sequences don't matter at all.  Also, the API was designed to make it easy to select from an itemized population of individuals than a large range followed by a modulo calculation (we already have randrange() to meet that need).

I could add a sentence to the last paragraph recommending "[choice(pop) for i in range(k)]" for the case where 1)  the population is large, 2) the speed doesn't matter, 3) the weights are equal, and 4) integer math is desired to avoid any trace of bias.  That said, I don't users would benefit from it and that this is largely just a theoretical concern that doesn't warrant more than the existing paragraph.
History
Date User Action Args
2022-04-11 14:59:45adminsetgithub: 88246
2021-05-09 18:58:59rhettingersetmessages: + msg393333
2021-05-09 17:58:19rhettingersetstatus: open -> closed
messages: + msg393329

assignee: rhettinger
resolution: not a bug
stage: resolved
2021-05-09 17:06:46Dennis Sweeneysetfiles: + int_choices.diff
keywords: + patch
messages: + msg393327
2021-05-09 09:46:03mark.dickinsonsetmessages: + msg393312
2021-05-09 07:00:46mark.dickinsonsetnosy: + mark.dickinson
2021-05-08 21:10:37Dennis Sweeneycreate