Message 393312 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mark.dickinson
Recipients	Dennis Sweeney, mark.dickinson, rhettinger
Date	2021-05-09.09:46:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1620553563.14.0.594649128313.issue44080@roundup.psfhosted.org>
In-reply-to

Content
This is similar to #9025 (for randrange), and in principle the random.choices case without explicit weights could be fixed similarly. But I'm not sure it's worth it, for a couple of reasons: - The problem only really affects "virtual" population sequences like ``range``, where the actual population isn't realised in memory. For populations in list form, in practice the population size will be limited to a few billion, and I think you'd be hard pushed to demonstrate statistically detectable bias at that size. (That is, my claim is that it would be essentially impossible to distinguish the existing "random.choices" implementation from a "fixed" unbiased version based on inputs and outputs alone.) - The speed of random.choices is a particular selling point - we don't have anything else in the random module that lets you generate millions of variates at once. I'd expect replacing `floor(random() * n)` with `self._randbelow(n)` to cause a significant speed regression. - For explicit weights, there isn't a practical route to handling those that doesn't involve use of floating-point in general, so for those we have to accept some tiny bias anyway. (Yes, we could technically do something unbiased for the case of integer-only weights, but that seems like significant work for a special case that I'd expect just doesn't matter that much in practice.) And again in this case, the weights list will usually be a "real" list, so we're again limited to populations of a few billion and it'll be hard to demonstrate statistically detectable bias. @Dennis: for interests sake, do you have bandwidth to do some timings of the current "[population[floor(random() * n)] for i in _repeat(None, k)]" versus something like "[population[_randbelow(n)] for i in _repeat(None, k)]" (after putting "_randbelow = self._randbelow" to save an attribute lookup on each iteration)?

This is similar to #9025 (for randrange), and in principle the random.choices case without explicit weights could be fixed similarly. But I'm not sure it's worth it, for a couple of reasons:

- The problem only really affects "virtual" population sequences like ``range``, where the actual population isn't realised in memory. For populations in list form, in practice the population size will be limited to a few billion, and I think you'd be hard pushed to demonstrate statistically detectable bias at that size. (That is, my claim is that it would be essentially impossible to distinguish the existing "random.choices" implementation from a "fixed" unbiased version based on inputs and outputs alone.)

- The speed of random.choices is a particular selling point - we don't have anything else in the random module that lets you generate millions of variates at once. I'd expect replacing `floor(random() * n)` with `self._randbelow(n)` to cause a significant speed regression.

- For explicit weights, there isn't a practical route to handling those that doesn't involve use of floating-point in general, so for those we have to accept some tiny bias anyway. (Yes, we could technically do something unbiased for the case of integer-only weights, but that seems like significant work for a special case that I'd expect just doesn't matter that much in practice.) And again in this case, the weights list will usually be a "real" list, so we're again limited to populations of a few billion and it'll be hard to demonstrate statistically detectable bias.

@Dennis: for interests sake, do you have bandwidth to do some timings of the current "[population[floor(random() * n)] for i in _repeat(None, k)]" versus something like "[population[_randbelow(n)] for i in _repeat(None, k)]" (after putting "_randbelow = self._randbelow" to save an attribute lookup on each iteration)?

History
Date	User	Action	Args
2021-05-09 09:46:03	mark.dickinson	set	recipients: + mark.dickinson, rhettinger, Dennis Sweeney
2021-05-09 09:46:03	mark.dickinson	set	messageid: <1620553563.14.0.594649128313.issue44080@roundup.psfhosted.org>
2021-05-09 09:46:03	mark.dickinson	link	issue44080 messages
2021-05-09 09:46:01	mark.dickinson	create