This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: random.choice and random.choices have different distributions
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Dennis Sweeney, Mark.Bell, rhettinger, tim.peters
Priority: normal Keywords:

Created on 2022-03-24 23:56 by Mark.Bell, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg415981 - (view) Author: Mark Bell (Mark.Bell) * Date: 2022-03-24 23:56
The docstring for `random.choices` indicates that
```
import random
random.choices(population, k=1)
```
should produce a list containing one item, where each item of `population` has equal likelihood of being selected. However `random.choices` draws elements for its sample by doing `population[floor(random() * len(population)]` and so relies on floating point numbers. Therefore not each item is equally likely to be chosen since floats are not uniformly dense in [0, 1] and this problem becomes worse as `population` becomes larger. 

Note that this issue does not apply to `random.choice(population)` since this uses `random.randint` to choose a random element of `population` and performs exact integer arithmetic. Compare https://github.com/python/cpython/blob/main/Lib/random.py#L371 and https://github.com/python/cpython/blob/main/Lib/random.py#L490

Could `random.choices` fall back to doing `return [choice(population) for _ in _repeat(None, k)]` if no weights are given? Similarly, is it also plausible to only rely on `random.randint` and integer arithmetic if all of the (cumulative) weights given to `random.choices` are integers?
msg415982 - (view) Author: Mark Bell (Mark.Bell) * Date: 2022-03-25 00:15
To give two more consequences of `random.choices` using floating point arithmetic:

1) When doing `random.choices([A, B, C], weights=[2**55, 1, 1])` the cumulative weight to bisect for is selected using `floor(random() * (2**55 + 1 + 1 + 0.0))`. Since this is always even, it is impossible for `B` to be chosen.

2) When doing `random.choices([A, B], weights=[2**54, 1])` although the `cum_weights` is [18014398509481984, 18014398509481985] the `total` used by `random.choices` is `cum_weights[-1] + 0.0`. Due to floating point rounding this is 18014398509481984.0 and so is actually smaller than `cum_weights[-1]`. Therefore it is impossible for `B` to be chosen.
msg415983 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2022-03-25 01:19
Possible duplicate of https://bugs.python.org/issue44080
msg415985 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-03-25 02:53
Definitely a duplicate, and I doubt Mark or Raymond will change their mind.

One observation: while floats are not uniformly dense in [0, 1), random() results are uniformly spaced. Each is of the form I / 2**53 for an integer I in range(2**53).
msg415986 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2022-03-25 02:56
This was an intentional decision.  It is documented and tested.  The rationale is that is keeps choices() internally consistent so that choices with equal weights produces the same result as if no weights are specified.

For anyone who wants an alternative, it is trivial just call choice() in a list comprehension:

   [choice(seq) for i in range(k)]

Another thought is that choices() is more performant than calling choice() in a loop.  This matters because one of the principal use cases for choices() in bootstrapping where choices() can be called many times.

Lastly, we are strongly averse to changing the algorithms after they are published because it impacts reproducible research where people need to be able to recreate their random selections for a given seed.
History
Date User Action Args
2022-04-11 14:59:57adminsetgithub: 91270
2022-03-25 02:56:34rhettingersetassignee: rhettinger
2022-03-25 02:56:22rhettingersetstatus: open -> closed

nosy: + rhettinger
messages: + msg415986

resolution: not a bug
stage: resolved
2022-03-25 02:53:43tim.peterssetnosy: + tim.peters
messages: + msg415985
2022-03-25 01:19:55Dennis Sweeneysetnosy: + Dennis Sweeney
messages: + msg415983
2022-03-25 00:15:11Mark.Bellsetmessages: + msg415982
2022-03-24 23:56:38Mark.Bellcreate