classification
Title: random.choices ought to check that cumulative weights are in ascending order
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.8
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: PaulSFO, mark.dickinson, miss-islington, paul.moore, rhettinger, serhiy.storchaka, steven.daprano
Priority: normal Keywords: patch

Created on 2018-05-14 11:36 by steven.daprano, last changed 2018-05-27 17:58 by rhettinger. This issue is now closed.

Messages (10)
msg316499 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-05-14 11:36
As mentioned on the Python-List:

https://mail.python.org/pipermail/python-list/2018-May/733061.html

random.choices() silently returns the wrong values when cumulative weights are not given, i.e. if the user misreads the documentation and provides non-cumulative weights, expecting that cumulative weights will be constructed for them.

I think that the documentation should be more clear, and preferably the choices() function ought to fail early when given invalid weights.
msg316500 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2018-05-14 11:54
Supplying cum_weights allows the code to use bisection to locate the correct value to return. This is O(log n), and is significantly faster for large populations than supplying weights (which need to be totalled for the calculation).

Requiring a pre-check on cum_weights (for example, the obvious check that the sequence is nondecreasing) would add an O(n) step, and so significantly impact performance for that case.
msg316501 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-05-14 12:32
On Mon, May 14, 2018 at 11:54:32AM +0000, Paul Moore wrote:

> Requiring a pre-check on cum_weights (for example, the obvious check 
> that the sequence is nondecreasing) would add an O(n) step, and so 
> significantly impact performance for that case.

You may very well be right, but we should at least think about ways to 
mitigate this. After all, it doesn't matter how fast a function is if it 
returns the wrong value.

If an ahead-of-time check is too slow, can we make it just-in-time? 
Perhaps bisect can be made to fail if it finds values in the wrong 
order. That might not detect all out-of-order input (perhaps it only 
checks the values it actually looks at), it might be "good enough" to at 
least catch some bad input.
msg316516 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-14 15:01
This will just add an overhead for all correct use of bisect.

We are adult here. If the user want to break bisect or random.choices, let him to do this.
msg316524 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-14 15:49
PR 6798 and PR 6806 are related to issue33495. Fixed links on GitHub, but commit messages and NEWS entries can contain wrong references.
msg316637 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-05-15 10:58
I concur with Serhiy.  The proposed check would cripple the intended use of cumulative weights which is provided as a way to avoid the cost of rebuilding the weights on every call.  FWIW, bisect() has a long history of requiring sorted inputs and has a similar possibility of misuse, but it too would cripple its core use case if it were to have to validate that the user input was sorted.

I'm open to a docs change but am dubious that that would have helped the OP on python-ideas.  The docs already have a specific example showing the relationship between weights and cum_weights, as well as indicating that the purpose of cum_weights is to save the work of computing the weights accumulation.  There is an additional example in the recipes section.
msg316727 - (view) Author: Paul Czyzewski (PaulSFO) Date: 2018-05-15 20:56
btw, this was my suggestion. Steven opened the issue on my behalf (I'm new).

1) Documentation change.
 I would suggest that, to this paragraph:
"If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights."

The following sentence be added:
"A cum_weights sequence, if supplied, must be in strictly-ascending order, else incorrect results will be (silently) returned."

[BTW, in the current documentation, when I read this sentence:
:For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]," it wasn't clear to me that this was the format of the cum_weights *argument*.  I thought that this conversion happened internally.  So, I'd prefer that something more explicit be stated (especially the part about silently giving bad results).]

2) I'm giving up on suggesting a code change.  However, I'll just 
respond that
  a) I believe that the big win of the cum_weights option is for people who already have the sequence in that form, rather than that they will save the O(n) cost of having the list built.  
  b) If I have big list, but also an other-than-tiny k value (eg, k=100), then the time (with the change) would be 400 time the O(log n) plus one times O(n), so this may or may not be significant.
  c) I agree that, if someone did, eg, 400 *separate* calls, each with k=1, the cost would be higher.  This seems unlikely to me but...

thanks
  Paul Czyzewski
msg316728 - (view) Author: Paul Czyzewski (PaulSFO) Date: 2018-05-15 21:03
oops, if "k=400"
msg316729 - (view) Author: Paul Czyzewski (PaulSFO) Date: 2018-05-15 21:16
or, for a minimal doc change, change this sentence:
"For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50],"

to:
"For example, the relative call 'weights[10, 5, 30, 5]' is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]',"
msg317800 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-05-27 17:58
> or, for a minimal doc change, change this sentence:
> "For example, the relative weights [10, 5, 30, 5] are 
> equivalent to the cumulative weights [10, 15, 45, 50],"
>
> to:
> "For example, the relative call 'weights[10, 5, 30, 5]' 
> is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]',"

Sorry, that doesn't seem like an improvement at all to me.  Adding "call" just makes the sentence read awkwardly.

Also, this week I did some user testing on the existing docs and didn't find a single case of misreading what "cumulative weights" meant.

I'm marking this as closed.  The suggestion was appreciated but adding additional input checks would defeat the entire purpose of the feature.  The user testing suggest that the docs are okay as-is (and there are additional examples in the recipes section below).
History
Date User Action Args
2018-05-27 17:58:56rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg317800

stage: patch review -> resolved
2018-05-15 21:16:11PaulSFOsetmessages: + msg316729
2018-05-15 21:03:02PaulSFOsetmessages: + msg316728
2018-05-15 20:56:58PaulSFOsetnosy: + PaulSFO
messages: + msg316727
2018-05-15 10:58:18rhettingersetassignee: rhettinger
messages: + msg316637
components: + Documentation, - Library (Lib)
2018-05-14 17:36:45serhiy.storchakasetmessages: - msg316540
2018-05-14 17:31:13miss-islingtonsetnosy: + miss-islington
messages: + msg316540
2018-05-14 15:49:54serhiy.storchakasetpull_requests: - pull_request6494
2018-05-14 15:49:48serhiy.storchakasetpull_requests: - pull_request6492
2018-05-14 15:49:02serhiy.storchakasetmessages: + msg316524
2018-05-14 15:41:53miss-islingtonsetpull_requests: + pull_request6494
2018-05-14 15:40:58eric.smithsetpull_requests: + pull_request6492
2018-05-14 15:40:04eric.smithsetpull_requests: - pull_request6490
2018-05-14 15:39:46eric.smithsetpull_requests: - pull_request6483
2018-05-14 15:37:44miss-islingtonsetpull_requests: + pull_request6490
2018-05-14 15:01:40serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg316516
2018-05-14 14:33:49eric.smithsetkeywords: + patch
stage: patch review
pull_requests: + pull_request6483
2018-05-14 12:32:26steven.dapranosetmessages: + msg316501
2018-05-14 11:54:32paul.mooresetnosy: + paul.moore
messages: + msg316500
2018-05-14 11:36:38steven.dapranocreate