This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author serhiy.storchaka
Recipients mark.dickinson, rhettinger, serhiy.storchaka, thomasahle
Date 2019-07-26.07:22:58
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1564125778.39.0.257095861312.issue37682@roundup.psfhosted.org>
In-reply-to
Content
> ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry.

Good point!

$ ./python -m timeit -s 'from random import sample as s' 's(range(10**6), 50)'
10000 loops, best of 5: 25.6 usec per loop
$ ./python -m timeit -s 'from random import sample as s' 's(list(range(10**6)), 50)'
10 loops, best of 5: 31.5 msec per loop
$ ./python -m timeit -s 'from random import reservoir_sample as s' 's(range(10**6), 50)'
1 loop, best of 5: 328 msec per loop

$ ./python -m timeit -s 'from random import sample as s' 's(range(10**8), 50)'
10000 loops, best of 5: 26.9 usec per loop
$ ./python -m timeit -s 'from random import sample as s' 's(list(range(10**8)), 50)'
1 loop, best of 5: 3.41 sec per loop
$ ./python -m timeit -s 'from random import reservoir_sample as s' 's(range(10**8), 50)'
1 loop, best of 5: 36.5 sec per loop

It is possible that a generator produces not so much data, but every item takes much memory so the total size does not fit in memory. But I suppose that the generation time of larger items will be proportionally larger, so reservoir_sample() will be just as slow.
History
Date User Action Args
2019-07-26 07:22:58serhiy.storchakasetrecipients: + serhiy.storchaka, rhettinger, mark.dickinson, thomasahle
2019-07-26 07:22:58serhiy.storchakasetmessageid: <1564125778.39.0.257095861312.issue37682@roundup.psfhosted.org>
2019-07-26 07:22:58serhiy.storchakalinkissue37682 messages
2019-07-26 07:22:58serhiy.storchakacreate