Issue 33494: random.choices ought to check that cumulative weights are in ascending order

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77675

classification

Title:	random.choices ought to check that cumulative weights are in ascending order
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 3.8

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	PaulSFO, mark.dickinson, miss-islington, paul.moore, rhettinger, serhiy.storchaka, steven.daprano
Priority:	normal	Keywords:	patch

Created on 2018-05-14 11:36 by steven.daprano, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (10)
msg316499 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2018-05-14 11:36
As mentioned on the Python-List: https://mail.python.org/pipermail/python-list/2018-May/733061.html random.choices() silently returns the wrong values when cumulative weights are not given, i.e. if the user misreads the documentation and provides non-cumulative weights, expecting that cumulative weights will be constructed for them. I think that the documentation should be more clear, and preferably the choices() function ought to fail early when given invalid weights.
msg316500 - (view)	Author: Paul Moore (paul.moore) *	Date: 2018-05-14 11:54
Supplying cum_weights allows the code to use bisection to locate the correct value to return. This is O(log n), and is significantly faster for large populations than supplying weights (which need to be totalled for the calculation). Requiring a pre-check on cum_weights (for example, the obvious check that the sequence is nondecreasing) would add an O(n) step, and so significantly impact performance for that case.
msg316501 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2018-05-14 12:32
On Mon, May 14, 2018 at 11:54:32AM +0000, Paul Moore wrote: > Requiring a pre-check on cum_weights (for example, the obvious check > that the sequence is nondecreasing) would add an O(n) step, and so > significantly impact performance for that case. You may very well be right, but we should at least think about ways to mitigate this. After all, it doesn't matter how fast a function is if it returns the wrong value. If an ahead-of-time check is too slow, can we make it just-in-time? Perhaps bisect can be made to fail if it finds values in the wrong order. That might not detect all out-of-order input (perhaps it only checks the values it actually looks at), it might be "good enough" to at least catch some bad input.
msg316516 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-05-14 15:01
This will just add an overhead for all correct use of bisect. We are adult here. If the user want to break bisect or random.choices, let him to do this.
msg316524 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-05-14 15:49
PR 6798 and PR 6806 are related to issue33495. Fixed links on GitHub, but commit messages and NEWS entries can contain wrong references.
msg316637 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-05-15 10:58
I concur with Serhiy. The proposed check would cripple the intended use of cumulative weights which is provided as a way to avoid the cost of rebuilding the weights on every call. FWIW, bisect() has a long history of requiring sorted inputs and has a similar possibility of misuse, but it too would cripple its core use case if it were to have to validate that the user input was sorted. I'm open to a docs change but am dubious that that would have helped the OP on python-ideas. The docs already have a specific example showing the relationship between weights and cum_weights, as well as indicating that the purpose of cum_weights is to save the work of computing the weights accumulation. There is an additional example in the recipes section.
msg316727 - (view)	Author: Paul Czyzewski (PaulSFO)	Date: 2018-05-15 20:56
btw, this was my suggestion. Steven opened the issue on my behalf (I'm new). 1) Documentation change. I would suggest that, to this paragraph: "If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights." The following sentence be added: "A cum_weights sequence, if supplied, must be in strictly-ascending order, else incorrect results will be (silently) returned." [BTW, in the current documentation, when I read this sentence: :For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]," it wasn't clear to me that this was the format of the cum_weights argument. I thought that this conversion happened internally. So, I'd prefer that something more explicit be stated (especially the part about silently giving bad results).] 2) I'm giving up on suggesting a code change. However, I'll just respond that a) I believe that the big win of the cum_weights option is for people who already have the sequence in that form, rather than that they will save the O(n) cost of having the list built. b) If I have big list, but also an other-than-tiny k value (eg, k=100), then the time (with the change) would be 400 time the O(log n) plus one times O(n), so this may or may not be significant. c) I agree that, if someone did, eg, 400 separate calls, each with k=1, the cost would be higher. This seems unlikely to me but... thanks Paul Czyzewski
msg316728 - (view)	Author: Paul Czyzewski (PaulSFO)	Date: 2018-05-15 21:03
oops, if "k=400"
msg316729 - (view)	Author: Paul Czyzewski (PaulSFO)	Date: 2018-05-15 21:16
or, for a minimal doc change, change this sentence: "For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]," to: "For example, the relative call 'weights[10, 5, 30, 5]' is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]',"
msg317800 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-05-27 17:58
> or, for a minimal doc change, change this sentence: > "For example, the relative weights [10, 5, 30, 5] are > equivalent to the cumulative weights [10, 15, 45, 50]," > > to: > "For example, the relative call 'weights[10, 5, 30, 5]' > is equivalent to the cumulative call 'cum_weights[10, 15, 45, 50]'," Sorry, that doesn't seem like an improvement at all to me. Adding "call" just makes the sentence read awkwardly. Also, this week I did some user testing on the existing docs and didn't find a single case of misreading what "cumulative weights" meant. I'm marking this as closed. The suggestion was appreciated but adding additional input checks would defeat the entire purpose of the feature. The user testing suggest that the docs are okay as-is (and there are additional examples in the recipes section below).

History
Date	User	Action	Args
2022-04-11 14:59:00	admin	set	github: 77675
2018-05-27 17:58:56	rhettinger	set	status: open -> closed resolution: rejected messages: + msg317800 stage: patch review -> resolved
2018-05-15 21:16:11	PaulSFO	set	messages: + msg316729
2018-05-15 21:03:02	PaulSFO	set	messages: + msg316728
2018-05-15 20:56:58	PaulSFO	set	nosy: + PaulSFO messages: + msg316727
2018-05-15 10:58:18	rhettinger	set	assignee: rhettinger messages: + msg316637 components: + Documentation, - Library (Lib)
2018-05-14 17:36:45	serhiy.storchaka	set	messages: - msg316540
2018-05-14 17:31:13	miss-islington	set	nosy: + miss-islington messages: + msg316540
2018-05-14 15:49:54	serhiy.storchaka	set	pull_requests: - pull_request6494
2018-05-14 15:49:48	serhiy.storchaka	set	pull_requests: - pull_request6492
2018-05-14 15:49:02	serhiy.storchaka	set	messages: + msg316524
2018-05-14 15:41:53	miss-islington	set	pull_requests: + pull_request6494
2018-05-14 15:40:58	eric.smith	set	pull_requests: + pull_request6492
2018-05-14 15:40:04	eric.smith	set	pull_requests: - pull_request6490
2018-05-14 15:39:46	eric.smith	set	pull_requests: - pull_request6483
2018-05-14 15:37:44	miss-islington	set	pull_requests: + pull_request6490
2018-05-14 15:01:40	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg316516
2018-05-14 14:33:49	eric.smith	set	keywords: + patch stage: patch review pull_requests: + pull_request6483
2018-05-14 12:32:26	steven.daprano	set	messages: + msg316501
2018-05-14 11:54:32	paul.moore	set	nosy: + paul.moore messages: + msg316500
2018-05-14 11:36:38	steven.daprano	create