Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random.choices ought to check that cumulative weights are in ascending order #77675

Closed
stevendaprano opened this issue May 14, 2018 · 10 comments
Closed
Assignees
Labels
3.8 only security fixes docs Documentation in the Doc dir type-feature A feature request or enhancement

Comments

@stevendaprano
Copy link
Member

BPO 33494
Nosy @rhettinger, @pfmoore, @mdickinson, @stevendaprano, @serhiy-storchaka, @miss-islington

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = <Date 2018-05-27.17:58:56.157>
created_at = <Date 2018-05-14.11:36:38.866>
labels = ['type-feature', '3.8', 'docs']
title = 'random.choices ought to check that cumulative weights are in ascending order'
updated_at = <Date 2018-05-27.17:58:56.156>
user = 'https://github.com/stevendaprano'

bugs.python.org fields:

activity = <Date 2018-05-27.17:58:56.156>
actor = 'rhettinger'
assignee = 'rhettinger'
closed = True
closed_date = <Date 2018-05-27.17:58:56.157>
closer = 'rhettinger'
components = ['Documentation']
creation = <Date 2018-05-14.11:36:38.866>
creator = 'steven.daprano'
dependencies = []
files = []
hgrepos = []
issue_num = 33494
keywords = ['patch']
message_count = 10.0
messages = ['316499', '316500', '316501', '316516', '316524', '316637', '316727', '316728', '316729', '317800']
nosy_count = 7.0
nosy_names = ['rhettinger', 'paul.moore', 'mark.dickinson', 'steven.daprano', 'serhiy.storchaka', 'miss-islington', 'PaulSFO']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue33494'
versions = ['Python 3.8']

@stevendaprano
Copy link
Member Author

As mentioned on the Python-List:

https://mail.python.org/pipermail/python-list/2018-May/733061.html

random.choices() silently returns the wrong values when cumulative weights are not given, i.e. if the user misreads the documentation and provides non-cumulative weights, expecting that cumulative weights will be constructed for them.

I think that the documentation should be more clear, and preferably the choices() function ought to fail early when given invalid weights.

@stevendaprano stevendaprano added stdlib Python modules in the Lib dir 3.8 only security fixes type-feature A feature request or enhancement labels May 14, 2018
@pfmoore
Copy link
Member

pfmoore commented May 14, 2018

Supplying cum_weights allows the code to use bisection to locate the correct value to return. This is O(log n), and is significantly faster for large populations than supplying weights (which need to be totalled for the calculation).

Requiring a pre-check on cum_weights (for example, the obvious check that the sequence is nondecreasing) would add an O(n) step, and so significantly impact performance for that case.

@stevendaprano
Copy link
Member Author

On Mon, May 14, 2018 at 11:54:32AM +0000, Paul Moore wrote:

Requiring a pre-check on cum_weights (for example, the obvious check
that the sequence is nondecreasing) would add an O(n) step, and so
significantly impact performance for that case.

You may very well be right, but we should at least think about ways to
mitigate this. After all, it doesn't matter how fast a function is if it
returns the wrong value.

If an ahead-of-time check is too slow, can we make it just-in-time?
Perhaps bisect can be made to fail if it finds values in the wrong
order. That might not detect all out-of-order input (perhaps it only
checks the values it actually looks at), it might be "good enough" to at
least catch some bad input.

@serhiy-storchaka
Copy link
Member

This will just add an overhead for all correct use of bisect.

We are adult here. If the user want to break bisect or random.choices, let him to do this.

@serhiy-storchaka
Copy link
Member

PR 6798 and PR 6806 are related to bpo-33495. Fixed links on GitHub, but commit messages and NEWS entries can contain wrong references.

@rhettinger
Copy link
Contributor

I concur with Serhiy. The proposed check would cripple the intended use of cumulative weights which is provided as a way to avoid the cost of rebuilding the weights on every call. FWIW, bisect() has a long history of requiring sorted inputs and has a similar possibility of misuse, but it too would cripple its core use case if it were to have to validate that the user input was sorted.

I'm open to a docs change but am dubious that that would have helped the OP on python-ideas. The docs already have a specific example showing the relationship between weights and cum_weights, as well as indicating that the purpose of cum_weights is to save the work of computing the weights accumulation. There is an additional example in the recipes section.

@rhettinger rhettinger added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels May 15, 2018
@rhettinger rhettinger self-assigned this May 15, 2018
@paulsfo
Copy link
Mannequin

paulsfo mannequin commented May 15, 2018

btw, this was my suggestion. Steven opened the issue on my behalf (I'm new).

  1. Documentation change.
    I would suggest that, to this paragraph:
    "If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights."

The following sentence be added:
"A cum_weights sequence, if supplied, must be in strictly-ascending order, else incorrect results will be (silently) returned."

[BTW, in the current documentation, when I read this sentence:
:For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]," it wasn't clear to me that this was the format of the cum_weights *argument*. I thought that this conversion happened internally. So, I'd prefer that something more explicit be stated (especially the part about silently giving bad results).]

  1. I'm giving up on suggesting a code change. However, I'll just
    respond that
    a) I believe that the big win of the cum_weights option is for people who already have the sequence in that form, rather than that they will save the O(n) cost of having the list built.
    b) If I have big list, but also an other-than-tiny k value (eg, k=100), then the time (with the change) would be 400 time the O(log n) plus one times O(n), so this may or may not be significant.
    c) I agree that, if someone did, eg, 400 *separate* calls, each with k=1, the cost would be higher. This seems unlikely to me but...

thanks
Paul Czyzewski

@paulsfo
Copy link
Mannequin

paulsfo mannequin commented May 15, 2018

oops, if "k=400"

@paulsfo
Copy link
Mannequin

paulsfo mannequin commented May 15, 2018

or, for a minimal doc change, change this sentence:
"For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50],"

to:
"For example, the relative call 'weights[10, 5, 30, 5]' is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]',"

@rhettinger
Copy link
Contributor

or, for a minimal doc change, change this sentence:
"For example, the relative weights [10, 5, 30, 5] are
equivalent to the cumulative weights [10, 15, 45, 50],"

to:
"For example, the relative call 'weights[10, 5, 30, 5]'
is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]',"

Sorry, that doesn't seem like an improvement at all to me. Adding "call" just makes the sentence read awkwardly.

Also, this week I did some user testing on the existing docs and didn't find a single case of misreading what "cumulative weights" meant.

I'm marking this as closed. The suggestion was appreciated but adding additional input checks would defeat the entire purpose of the feature. The user testing suggest that the docs are okay as-is (and there are additional examples in the recipes section below).

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 only security fixes docs Documentation in the Doc dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants