This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add optional weights to random.sample()
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: cryvate, dheiberg, mark.dickinson, rhettinger, tim.peters
Priority: normal Keywords: patch

Created on 2020-05-08 19:56 by rhettinger, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 20014 closed rhettinger, 2020-05-08 23:18
Messages (5)
msg368458 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-05-08 19:56
Weighted sampling without replacement isn't easy for a user to do with the current tooling. There is a StackOverflow question about how to do it.  Also, this service is currently offered by numpy.

Use it like this:

    >>> sample(['katniss', 'prim', 'gale', 'peeta'] , weights=[20, 1, 42, 10], k=2)
    ['prim', 'peeta']

Here's an efficient implementation similar to how numpy does it:

--- a/Lib/random.py
+++ b/Lib/random.py
@@ -331,7 +331,7 @@ class Random(_random.Random):
                 j = _int(random() * (i+1))
                 x[i], x[j] = x[j], x[i]

-    def sample(self, population, k, *, counts=None):
+    def sample(self, population, k, *, counts=None, weights=None):
         """Chooses k unique random elements from a population sequence or set.

         Returns a new list containing elements from the population while
@@ -392,6 +392,18 @@ class Random(_random.Random):
         if not isinstance(population, _Sequence):
             raise TypeError("Population must be a sequence.  For dicts or sets, use sorted(d).")
         n = len(population)
+        if weights is not None:
+            if counts is not None:
+                raise TypeError('Cannot specify both counts and weights')
+            weights = list(weights)
+            positions = range(n)
+            indices = []
+            while (needed := k - len(indices)):
+                for i in choices(positions, weights, k=needed):
+                    if weights[i]:
+                        weights[i] = 0.0
+                        indices.append(i)
+            return [population[i] for i in indices]
         if counts is not None:
             cum_counts = list(_accumulate(counts))
             if len(cum_counts) != n:
msg368559 - (view) Author: David Heiberg (dheiberg) * Date: 2020-05-10 01:19
Just playing around with this and I think one thing to consider is how to handle negative weights. Should they even be allowed? One interesting behaviour I encountered with the current draft is the following:

    >>> r.sample(['katniss', 'prim', 'gale', 'peeta'] , weights=[-2,1,1,1], k=1)
    ['peeta']

This will always return ['peeta'], or whatever happens to be in the last index. I believe this is because in `choices`, the `cum_weights` list would be [-2, -1, 0, 1], causing it to only be able to select the last value in the population. 

Looking into random.choices, it suggests that the weights were designed with negative ones in mind. However they were not accounted for and end up influencing the weights of the indexes around it. Another example is this:

    >>> r.choices(['alice', 'bob', 'camile', 'david'], weights=[1, 1, -2, 1])
    ['alice']

This will always return ['alice'], showing that the negative weight of 'camile' is affecting indexes around it. 
Is there something I am missing? Otherwise this seems like it may warrant its own bug.
msg368562 - (view) Author: Henk-Jaap Wagenaar (cryvate) * Date: 2020-05-10 02:39
@rhettinger: I like this proposal.

@dheiberg: to me, it seems negative weights are explicitly not supported, from: https://docs.python.org/3/library/random.html#random.choices

"Weights are assumed to be non-negative."

Unless I am missing something. In my mind negative weights make no sense and the fact it produces garbage in your example is garbage-in-garbage-out.

Maybe an enhancement could be made (but it would be breaking for some abusing the behaviour you showed) for it to error (or do something sane? If such a thing exists) but that should be a separate issue.
msg368565 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-05-10 03:41
Negative weights are undefined for choices() as well.  Just like bisect() and merge() don't verify their inputs are sorted.  Usually, full scan precondition checks are expensive in pure python and aren't worth penalizing correct code. As Henk-Jaap points-out, negative weights are just a special case of meaningless inputs.
msg368573 - (view) Author: David Heiberg (dheiberg) * Date: 2020-05-10 09:06
Ahh I see, thanks for clearing that up!

On Sun, May 10, 2020, 04:41 Raymond Hettinger <report@bugs.python.org>
wrote:

>
> Raymond Hettinger <raymond.hettinger@gmail.com> added the comment:
>
> Negative weights are undefined for choices() as well.  Just like bisect()
> and merge() don't verify their inputs are sorted.  Usually, full scan
> precondition checks are expensive in pure python and aren't worth
> penalizing correct code. As Henk-Jaap points-out, negative weights are just
> a special case of meaningless inputs.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40569>
> _______________________________________
>
History
Date User Action Args
2022-04-11 14:59:30adminsetgithub: 84749
2020-05-10 09:06:35dheibergsetmessages: + msg368573
2020-05-10 03:41:56rhettingersetmessages: + msg368565
2020-05-10 02:39:39cryvatesetnosy: + cryvate
messages: + msg368562
2020-05-10 01:19:05dheibergsetnosy: + dheiberg
messages: + msg368559
2020-05-08 23:18:07rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request19325
2020-05-08 19:56:47rhettingercreate