random.choices ought to check that cumulative weights are in ascending order #77675

stevendaprano · 2018-05-14T11:36:39Z

BPO	33494
Nosy	@rhettinger, @pfmoore, @mdickinson, @stevendaprano, @serhiy-storchaka, @miss-islington

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = <Date 2018-05-27.17:58:56.157>
created_at = <Date 2018-05-14.11:36:38.866>
labels = ['type-feature', '3.8', 'docs']
title = 'random.choices ought to check that cumulative weights are in ascending order'
updated_at = <Date 2018-05-27.17:58:56.156>
user = 'https://github.com/stevendaprano'

bugs.python.org fields:

activity = <Date 2018-05-27.17:58:56.156>
actor = 'rhettinger'
assignee = 'rhettinger'
closed = True
closed_date = <Date 2018-05-27.17:58:56.157>
closer = 'rhettinger'
components = ['Documentation']
creation = <Date 2018-05-14.11:36:38.866>
creator = 'steven.daprano'
dependencies = []
files = []
hgrepos = []
issue_num = 33494
keywords = ['patch']
message_count = 10.0
messages = ['316499', '316500', '316501', '316516', '316524', '316637', '316727', '316728', '316729', '317800']
nosy_count = 7.0
nosy_names = ['rhettinger', 'paul.moore', 'mark.dickinson', 'steven.daprano', 'serhiy.storchaka', 'miss-islington', 'PaulSFO']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue33494'
versions = ['Python 3.8']

stevendaprano · 2018-05-14T11:36:39Z

As mentioned on the Python-List:

https://mail.python.org/pipermail/python-list/2018-May/733061.html

random.choices() silently returns the wrong values when cumulative weights are not given, i.e. if the user misreads the documentation and provides non-cumulative weights, expecting that cumulative weights will be constructed for them.

I think that the documentation should be more clear, and preferably the choices() function ought to fail early when given invalid weights.

pfmoore · 2018-05-14T11:54:33Z

Supplying cum_weights allows the code to use bisection to locate the correct value to return. This is O(log n), and is significantly faster for large populations than supplying weights (which need to be totalled for the calculation).

Requiring a pre-check on cum_weights (for example, the obvious check that the sequence is nondecreasing) would add an O(n) step, and so significantly impact performance for that case.

stevendaprano · 2018-05-14T12:32:26Z

On Mon, May 14, 2018 at 11:54:32AM +0000, Paul Moore wrote:

Requiring a pre-check on cum_weights (for example, the obvious check
that the sequence is nondecreasing) would add an O(n) step, and so
significantly impact performance for that case.

You may very well be right, but we should at least think about ways to
mitigate this. After all, it doesn't matter how fast a function is if it
returns the wrong value.

If an ahead-of-time check is too slow, can we make it just-in-time?
Perhaps bisect can be made to fail if it finds values in the wrong
order. That might not detect all out-of-order input (perhaps it only
checks the values it actually looks at), it might be "good enough" to at
least catch some bad input.

serhiy-storchaka · 2018-05-14T15:01:40Z

This will just add an overhead for all correct use of bisect.

We are adult here. If the user want to break bisect or random.choices, let him to do this.

serhiy-storchaka · 2018-05-14T15:49:03Z

PR 6798 and PR 6806 are related to bpo-33495. Fixed links on GitHub, but commit messages and NEWS entries can contain wrong references.

rhettinger · 2018-05-15T10:58:19Z

I concur with Serhiy. The proposed check would cripple the intended use of cumulative weights which is provided as a way to avoid the cost of rebuilding the weights on every call. FWIW, bisect() has a long history of requiring sorted inputs and has a similar possibility of misuse, but it too would cripple its core use case if it were to have to validate that the user input was sorted.

I'm open to a docs change but am dubious that that would have helped the OP on python-ideas. The docs already have a specific example showing the relationship between weights and cum_weights, as well as indicating that the purpose of cum_weights is to save the work of computing the weights accumulation. There is an additional example in the recipes section.

paulsfo · 2018-05-15T20:56:58Z

btw, this was my suggestion. Steven opened the issue on my behalf (I'm new).

Documentation change.
I would suggest that, to this paragraph:
"If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights."

The following sentence be added:
"A cum_weights sequence, if supplied, must be in strictly-ascending order, else incorrect results will be (silently) returned."

[BTW, in the current documentation, when I read this sentence:
:For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]," it wasn't clear to me that this was the format of the cum_weights *argument*. I thought that this conversion happened internally. So, I'd prefer that something more explicit be stated (especially the part about silently giving bad results).]

I'm giving up on suggesting a code change. However, I'll just
respond that
a) I believe that the big win of the cum_weights option is for people who already have the sequence in that form, rather than that they will save the O(n) cost of having the list built.
b) If I have big list, but also an other-than-tiny k value (eg, k=100), then the time (with the change) would be 400 time the O(log n) plus one times O(n), so this may or may not be significant.
c) I agree that, if someone did, eg, 400 *separate* calls, each with k=1, the cost would be higher. This seems unlikely to me but...

thanks
Paul Czyzewski

paulsfo · 2018-05-15T21:03:03Z

oops, if "k=400"

paulsfo · 2018-05-15T21:16:12Z

or, for a minimal doc change, change this sentence:
"For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50],"

to:
"For example, the relative call 'weights[10, 5, 30, 5]' is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]',"

rhettinger · 2018-05-27T17:58:56Z

or, for a minimal doc change, change this sentence:
"For example, the relative weights [10, 5, 30, 5] are
equivalent to the cumulative weights [10, 15, 45, 50],"

to:
"For example, the relative call 'weights[10, 5, 30, 5]'
is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]',"

Sorry, that doesn't seem like an improvement at all to me. Adding "call" just makes the sentence read awkwardly.

Also, this week I did some user testing on the existing docs and didn't find a single case of misreading what "cumulative weights" meant.

I'm marking this as closed. The suggestion was appreciated but adding additional input checks would defeat the entire purpose of the feature. The user testing suggest that the docs are okay as-is (and there are additional examples in the recipes section below).

stevendaprano added stdlib Python modules in the Lib dir 3.8 only security fixes type-feature A feature request or enhancement labels May 14, 2018

rhettinger added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels May 15, 2018

rhettinger self-assigned this May 15, 2018

rhettinger closed this as completed May 27, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random.choices ought to check that cumulative weights are in ascending order #77675

random.choices ought to check that cumulative weights are in ascending order #77675

stevendaprano commented May 14, 2018

stevendaprano commented May 14, 2018

pfmoore commented May 14, 2018

stevendaprano commented May 14, 2018

serhiy-storchaka commented May 14, 2018

serhiy-storchaka commented May 14, 2018

rhettinger commented May 15, 2018

paulsfo mannequin commented May 15, 2018

paulsfo mannequin commented May 15, 2018

paulsfo mannequin commented May 15, 2018

rhettinger commented May 27, 2018

random.choices ought to check that cumulative weights are in ascending order #77675

random.choices ought to check that cumulative weights are in ascending order #77675

Comments

stevendaprano commented May 14, 2018

stevendaprano commented May 14, 2018

pfmoore commented May 14, 2018

stevendaprano commented May 14, 2018

serhiy-storchaka commented May 14, 2018

serhiy-storchaka commented May 14, 2018

rhettinger commented May 15, 2018

paulsfo mannequin commented May 15, 2018

paulsfo mannequin commented May 15, 2018

paulsfo mannequin commented May 15, 2018

rhettinger commented May 27, 2018