Author rhettinger
Recipients mark.dickinson, rhettinger, serhiy.storchaka, thomasahle
Date 2019-07-26.06:09:20
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1564121361.06.0.570868393467.issue37682@roundup.psfhosted.org>
In-reply-to
Content
ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry.  

FWIW, every time we've looked at reservoir sampling it has been less performant than what we have now.  The calls to randbelow() are the slowest part, so doing more calls makes the overall performance worse.  Also, doing more calls eats more entropy.  

In general, it is okay for functions to accept only sequences if they exploit indexing in some way.  For example, the current approach works great with sample(range(100_000_000_000), k=50).  We really don't have to make everything accept all iterators.  Besides, it is trivially easy to call list() if needed.

Overall, I'm -1 on redesigning the sampling algorithm to accommodate non-sequence iterators.  AFAICT, it isn't important at all and as Serhiy pointed out, writing your own reservoir sampling is easy do.  Lastly, the standard library doesn't try to be all things to all people, it is okay to leave many things for external packages -- we mostly provide a baseline of tools that cover common use cases and defer the rest to the Python ecosystem.
History
Date User Action Args
2019-07-26 06:09:21rhettingersetrecipients: + rhettinger, mark.dickinson, thomasahle, serhiy.storchaka
2019-07-26 06:09:21rhettingersetmessageid: <1564121361.06.0.570868393467.issue37682@roundup.psfhosted.org>
2019-07-26 06:09:21rhettingerlinkissue37682 messages
2019-07-26 06:09:20rhettingercreate