Author rhettinger
Recipients mark.dickinson, oscarbenjamin, rhettinger, tim.peters
Date 2020-07-18.04:22:43
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1595046163.89.0.496719771133.issue41311@roundup.psfhosted.org>
In-reply-to
Content
> This comment suggest that you have missed the general
> motivation for reservoir sampling.

Please don't get personal.  I've devoted a good deal of time thinking about your proposal.  Tim is also giving it an honest look. Please devote some time to honestly thinking about what we have to say.

FWIW, this is an area of expertise for me.  I too have been fascinated with reservoir sampling for several decades, have done a good deal of reading on the topic, and routinely performed statistical sampling as part of my job (where it needed to be done in a legally defensible manner).


> The idea of reservoir sampling is that you want to sample from
> an iterator, you only get one chance to iterate over it, and 
> you don't know a priori how many items it will yield.

Several thoughts:

* The need for sampling a generator or one-time stream of data is in the "almost never" category.  Presumably, that is why you don't find it in numpy or Julia.

* The examples you gave involved dicts or sets.  These aren't one-chance examples and we do know the length in advance.

* Whether talking about sets, dicts, generators, or arbitrary iterators, "sample(list(it), k)" would still work.  Both ways still have to consume the entire input before returning.  So really this is just an optimization, one that under some circumstances runs a bit faster, but one that forgoes a number of desirable characteristics of the existing tool.  

* IMO, sample_iter() is hard to use correctly.  In most cases, the users would be worse off than they are now and it would be challenging to communicate clearly under what circumstances they would be marginally better off.

At any rate, my recommendation stands.  This should not be part of standard library random module API.  Perhaps it could be a recipe or a see-also link.  We really don't have to do this.
History
Date User Action Args
2020-07-18 04:22:43rhettingersetrecipients: + rhettinger, tim.peters, mark.dickinson, oscarbenjamin
2020-07-18 04:22:43rhettingersetmessageid: <1595046163.89.0.496719771133.issue41311@roundup.psfhosted.org>
2020-07-18 04:22:43rhettingerlinkissue41311 messages
2020-07-18 04:22:43rhettingercreate