Message 348466 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	mark.dickinson, rhettinger, serhiy.storchaka, thomasahle
Date	2019-07-26.06:09:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1564121361.06.0.570868393467.issue37682@roundup.psfhosted.org>
In-reply-to

Content
ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry. FWIW, every time we've looked at reservoir sampling it has been less performant than what we have now. The calls to randbelow() are the slowest part, so doing more calls makes the overall performance worse. Also, doing more calls eats more entropy. In general, it is okay for functions to accept only sequences if they exploit indexing in some way. For example, the current approach works great with sample(range(100_000_000_000), k=50). We really don't have to make everything accept all iterators. Besides, it is trivially easy to call list() if needed. Overall, I'm -1 on redesigning the sampling algorithm to accommodate non-sequence iterators. AFAICT, it isn't important at all and as Serhiy pointed out, writing your own reservoir sampling is easy do. Lastly, the standard library doesn't try to be all things to all people, it is okay to leave many things for external packages -- we mostly provide a baseline of tools that cover common use cases and defer the rest to the Python ecosystem.

ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry.  

FWIW, every time we've looked at reservoir sampling it has been less performant than what we have now.  The calls to randbelow() are the slowest part, so doing more calls makes the overall performance worse.  Also, doing more calls eats more entropy.  

In general, it is okay for functions to accept only sequences if they exploit indexing in some way.  For example, the current approach works great with sample(range(100_000_000_000), k=50).  We really don't have to make everything accept all iterators.  Besides, it is trivially easy to call list() if needed.

Overall, I'm -1 on redesigning the sampling algorithm to accommodate non-sequence iterators.  AFAICT, it isn't important at all and as Serhiy pointed out, writing your own reservoir sampling is easy do.  Lastly, the standard library doesn't try to be all things to all people, it is okay to leave many things for external packages -- we mostly provide a baseline of tools that cover common use cases and defer the rest to the Python ecosystem.

History
Date	User	Action	Args
2019-07-26 06:09:21	rhettinger	set	recipients: + rhettinger, mark.dickinson, thomasahle, serhiy.storchaka
2019-07-26 06:09:21	rhettinger	set	messageid: <1564121361.06.0.570868393467.issue37682@roundup.psfhosted.org>
2019-07-26 06:09:21	rhettinger	link	issue37682 messages
2019-07-26 06:09:20	rhettinger	create