Message 373861 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	mark.dickinson, oscarbenjamin, rhettinger, tim.peters
Date	2020-07-17.23:34:03
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1595028843.73.0.0763171261535.issue41311@roundup.psfhosted.org>
In-reply-to

Content
I've put more thought into the proposal and am going to recommend against it. At its heart, this a CPython optimization to take advantage of list() being slower than a handful of islice() calls. It also gains a speed benefit by dropping the antibias logic because random() is faster than _randbelow(). IMO, this doesn't warrant an API extension. I'm not looking forward to years of teaching people that there are two separate APIs for sampling without replacement and that the second one is almost never what they should use. A few years ago, GvR rejected adding a pre-sizing argument to dicts even though there were some cases where it gave improved performance. His rationale was that it was challenging for a user to know when they were better off and when they weren't. It added a new complication that easily led to suboptimal choices. IMO, this new API puts the users in a similar situation. There are a number of cases where a person is worse off, sometimes much worse off. This new code runs O(n) instead of O(k). It eats more entropy. It loses the the antibias protections. The API makes it less explicit that the entire input iterable is consumed. It can only be beneficial is the input is not a sequence. When k gets bigger, the repeated calls to islice() become more expensive than a single call to list. And given the math library functions involved, I not even sure that this code can guarantee it gives the same results across platforms. Even if the user makes a correct initial decision about which API to use, the decision can become invalidated when the population sizes or sample sizes change over time. Lastly, giving users choices between two substantially similar tools typically makes them worse off. It creates a new burden to learn, remember, and distinguish the two. It's really nice that we currently have just one sample() and that it behaves well across a broad range of cases — you generally get a good result without having to think about it. Presumably, that was the wisdom behind having one-way-to-do-it.

I've put more thought into the proposal and am going to recommend against it.

At its heart, this a CPython optimization to take advantage of list() being slower than a handful of islice() calls. It also gains a speed benefit by dropping the antibias logic because random() is faster than _randbelow(). IMO, this doesn't warrant an API extension. I'm not looking forward to years of teaching people that there are two separate APIs for sampling without replacement and that the second one is almost never what they should use.

A few years ago, GvR rejected adding a pre-sizing argument to dicts even though there were some cases where it gave improved performance. His rationale was that it was challenging for a user to know when they were better off and when they weren't. It added a new complication that easily led to suboptimal choices. IMO, this new API puts the users in a similar situation. There are a number of cases where a person is worse off, sometimes much worse off.

This new code runs O(n) instead of O(k). It eats more entropy. It loses the the antibias protections. The API makes it less explicit that the entire input iterable is consumed. It can only be beneficial is the input is not a sequence. When k gets bigger, the repeated calls to islice() become more expensive than a single call to list. And given the math library functions involved, I not even sure that this code can guarantee it gives the same results across platforms.

Even if the user makes a correct initial decision about which API to use, the decision can become invalidated when the population sizes or sample sizes change over time.

Lastly, giving users choices between two substantially similar tools typically makes them worse off. It creates a new burden to learn, remember, and distinguish the two. It's really nice that we currently have just one sample() and that it behaves well across a broad range of cases — you generally get a good result without having to think about it. Presumably, that was the wisdom behind having one-way-to-do-it.

History
Date	User	Action	Args
2020-07-17 23:34:03	rhettinger	set	recipients: + rhettinger, tim.peters, mark.dickinson, oscarbenjamin
2020-07-17 23:34:03	rhettinger	set	messageid: <1595028843.73.0.0763171261535.issue41311@roundup.psfhosted.org>
2020-07-17 23:34:03	rhettinger	link	issue41311 messages
2020-07-17 23:34:03	rhettinger	create