classification
Title: Clarify handling of infinity and nan in random.choices
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: cool-RR, rhettinger
Priority: normal Keywords: patch

Created on 2020-09-12 15:31 by cool-RR, last changed 2020-09-29 01:33 by rhettinger. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 22441 merged cool-RR, 2020-09-28 20:27
Messages (13)
msg376802 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-09-12 15:31
I was recently tripped up by a bug caused by passing infinite weights to random.choices. I toyed around with that function, and it seemed that when it's given weights that include infinity or NaN, it selects a specific element, always without being random. That's regardless of whether multiple items have infinity weights. It chooses a different specific item if the infinity is negative. The specific item isn't always the one that has the infinite weight.

I don't know whether that's the intended behavior for some reason, or whether it's even possible to define a logical behavior for infinite weights. If it's not possible, then maybe this function should raise an errors when given weights that include infinity or nan.

I see that the documentation states that the weights must be non-negative; maybe we should have a check for that too.
msg376812 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-12 21:14
Sorry, I don't see the point in cluttering the documentation or over-specifying the function for these non-sensensical toy experiments.  AFAICT, no actual problem exists in practice.  

Also, we don't want to slow down the function by running scans for infinities or NaNs.  That would make the function worse for all users and better for almost no one.
msg376816 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-13 01:17
In general NaNs wreak havoc on any function that uses comparisons:

    >>> max(10, nan)
    10
    >>> max(nan, 10)
    nan

It isn't the responsibility of the functions to test for NaNs.  Likewise, it would not make sense to document the NaN behavior in every function that uses a comparison.  That would clutter the docs and it puts the responsibility in the wrong place.

It is the NaN itself that is responsible for the behavior you're seeing.  It is up to the NaN documentation to discuss its effects on downstream code.  For example, sorting isn't consistent when NaNs are present:

    >>> sorted([10, nan, 5])
    [10, nan, 5]
    >>> sorted([5, nan, 10])
[   5, nan, 10]

Also, a NaN is just one of many possible objects that create weird downstream effects.  For examples, sets have a partial ordering and would also create odd results with sorted(), bisect(), max(), min(), etc.

The situation with Infinity is similar.  It is a special object that has the unusual property that inf+1==inf, so bisection of cumulative sums will give the rightmost infinity.

Consider a population 'ABC' and weights of [5, 7, 2].  We have

    Element   Weight   Cumulative Weight  Range (half-open interval)
    -------   ------   -----------------  --------------------------
      A         5      5                  0.0  <= X < 5.0
      B         7      12                 5.0  <= X < 12.0
      C         2      14                 12.0 <= X < 14.0
              ------
               14

The selector X comes from:  random() * 14.0 which gives a range of: 0.0 <= X < 14.0.

Now consider a population 'ABC' and weights of [5, 0, 2].  We have

    Element   Weight   Cumulative Weight  Range (half-open interval)
    -------   ------   -----------------  --------------------------
      A         5      5                  0.0  <= X < 5.0
      B         0      5                  5.0  <= X < 5.0
      C         2      7                  5.0  <= X < 7.0
              ------
                7

If X is 5.0, we have to choose C because B has a zero selection probabliity.  So have to pick the rightmost (bottommost) range that has 5.0, giving the C.

Now, replace B's weight with float('Inf'):

    Element   Weight   Cumulative Weight  Range (half-open interval)
    -------   ------   -----------------  --------------------------
      A         5      5                  0.0  <= X < 5.0
      B        Inf     Inf                5.0  <= X < Inf
      C         2      Inf                Inf  <= X < Inf
              ------
               Inf

Since Inf+2 is Inf and Inf==Inf, the latter two ranges are undifferentiated.  The selector itself in always Inf because "X = random() * inf" always gives inf.  Using the previous rule, we have to choose the rightmost Inf which is C.  This is in fact what choices() does:

    >>> choices('ABC', [5, float('Inf'), 2])
    ['C']

It isn't an error.  That is the same behavior that bisect() has when searching for a infinite value (or any other object that universally compares larger than anything except itself):

    >>> bisect([10, 20, inf], inf)
    3
    >>> bisect([10, 20, 30], inf)
    3

When searching for infinity, you always get the rightmost insertion point even if the cuts points are finite.  Accordingly, it makes sense that if one of the weights is infinite, then the total infinite, and the selector is infinite, so you always get the rightmost value.
msg376821 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-09-13 04:50
I disagree, especially the bit about slowing down the function, since you're checking the total, not every element. Like the check for negative numbers that you do on line 493. But it's your call.
msg376827 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-09-13 08:24
Speaking of that check, the error message is 'Total of weights must be greater than zero', which is a bit misleading. 'All weights must be positive or zero' would be more accurate. Would you like a PR for that?
msg376828 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-13 09:11
> 'All weights must be positive or zero' would be more accurate.

That's not what the test checks.  It literally checks for "total <= 0.0".
msg376829 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-09-13 09:30
Yes, but the docs say that the weights are assumed to be nonnegative. I'm assuming the check is done on total because it'd be expensive to do it on each item. A user reading that error message could understand that it's okay for weights to be negative as long as the total isn't, which as far as I understand isn't true.
msg376831 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-13 10:08
>  A user reading that error message could understand that it's 
> okay for weights to be negative as long as the total isn't, 
> which as far as I understand isn't true.

This seems like another made up issue.  One could also argue the opposite case that if the error message were changed, a user could reasonably assume that the function was in fact checking all the weights individually (which it isn't).

The docs already state that the function assumes non-negative inputs, which is in fact what it does.  The function already has reasonable error reporting for common missteps.  Also, it behaves gracefully if someone is nudging weights up and down in a way that goes slightly negative due to rounding.  Beyond that, we're hitting the limits of what it can or should do when fed garbage inputs for ill-posed problems.

It's time for this one die now.  It's already eaten an hour of my development time explaining how infinities and nans flow through functions and explaining what the Python norms are for letting them flow through versus treating them as errors.
msg376832 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-09-13 10:21
Thanks for your time, and I accept your decision here. I'd appreciate it though if you wouldn't use the term "made up issue" on something that happened to me a couple of days ago. Our opposing positions each have their pros and cons, and it makes sense to me that your approach has more pros than mine. Ease up on the "It's time for this one die now" though.
msg377618 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-28 20:04
I've discussed this with other developers and now agree that a test should be added.  While the current code's handing of Inf or NaN is correct in a technical sense, it is neither intuitive or useful.

Even though it will have a small time cost for the common case of two weights, we should add a test just after the check for the total being greater than zero:

    if not _isfinite(total):
        raise ValueError('Total of weights must be finite')

Also edit the docs to say:

   Weights are assumed to be non-negative and finite.

Ram, since this was your finding, do you want to make a PR for it (with a test, news entry, and doc edit)?
msg377619 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-09-28 20:11
Sounds good, I'm on it.
msg377650 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-29 01:32
New changeset b0dfc7581697f20385813582de7e92ba6ba0105f by Ram Rachum in branch 'master':
bpo-41773: Raise exception for non-finite weights in random.choices().  (GH-22441)
https://github.com/python/cpython/commit/b0dfc7581697f20385813582de7e92ba6ba0105f
msg377651 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-29 01:33
Thanks for the PR and for filing the issue.
History
Date User Action Args
2020-09-29 01:33:05rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg377651

stage: patch review -> resolved
2020-09-29 01:32:18rhettingersetmessages: + msg377650
2020-09-28 20:27:15cool-RRsetkeywords: + patch
stage: resolved -> patch review
pull_requests: + pull_request21470
2020-09-28 20:11:36cool-RRsetmessages: + msg377619
2020-09-28 20:04:47rhettingersetstatus: closed -> open
type: behavior -> enhancement
resolution: not a bug -> (no value)
messages: + msg377618
2020-09-13 10:21:23cool-RRsetmessages: + msg376832
2020-09-13 10:08:29rhettingersetmessages: + msg376831
2020-09-13 09:30:30cool-RRsetmessages: + msg376829
2020-09-13 09:11:46rhettingersetmessages: + msg376828
2020-09-13 08:24:23cool-RRsetmessages: + msg376827
2020-09-13 04:50:20cool-RRsetmessages: + msg376821
2020-09-13 01:17:12rhettingersetmessages: + msg376816
2020-09-12 21:14:56rhettingersetstatus: open -> closed
resolution: not a bug
messages: + msg376812

stage: resolved
2020-09-12 15:33:14cool-RRsetnosy: + rhettinger
2020-09-12 15:31:17cool-RRcreate