classification
Title: random.sample should support iterators
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.9
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: mark.dickinson, rhettinger, serhiy.storchaka, thomasahle
Priority: normal Keywords:

Created on 2019-07-25 17:03 by thomasahle, last changed 2019-07-26 09:25 by rhettinger. This issue is now closed.

Messages (6)
msg348445 - (view) Author: Thomas Dybdahl Ahle (thomasahle) * Date: 2019-07-25 17:03
Given a generator `f()` we can use `random.sample(list(f()), 10)` to get a uniform sample of the values generated.
This is fine, and fast, as long as `list(f())` easily fits in memory.
However, if it doesn't, one has to implement the reservoir sampling algorithm as a pure python function, which is much slower, and not so easy.

It seems that having a fast reservoir sampling implementation in `random.sample` to use for iterators would be both useful and make the API more predictable.

Currently when passing an iterator `random.sample` throws `TypeError: Population must be a sequence or set.`.
This is inconsistent with most of the standard library which accepts lists and iterators transparently.

I apologize if this enhancement has already been discussed.
I wasn't able to find it.
If wanted, I can write up a pull request.
I believe questions like this: https://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator-iterable-iterator makes it clear that such functionality is wanted and non-obvious.
msg348446 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-25 17:35
FYI, random.sample() (as most of other functions in the random module) is implemented in pure Python.
msg348447 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-25 17:50
Possible implementation:

from itertools import islice as _islice

def reservoir_sample(self, population, k):
    if k < 0:
        raise ValueError("Sample is negative")
    it = iter(population)
    result = list(_islice(it, k))
    if len(result) < k:
        raise ValueError("Sample larger than population")
    self.shuffle(result)
    randbelow = self._randbelow
    for i, x in enumerate(it, k+1):
        j = randbelow(i)
        if j < k:
            result[j] = x
    return result
msg348466 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-07-26 06:09
ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry.  

FWIW, every time we've looked at reservoir sampling it has been less performant than what we have now.  The calls to randbelow() are the slowest part, so doing more calls makes the overall performance worse.  Also, doing more calls eats more entropy.  

In general, it is okay for functions to accept only sequences if they exploit indexing in some way.  For example, the current approach works great with sample(range(100_000_000_000), k=50).  We really don't have to make everything accept all iterators.  Besides, it is trivially easy to call list() if needed.

Overall, I'm -1 on redesigning the sampling algorithm to accommodate non-sequence iterators.  AFAICT, it isn't important at all and as Serhiy pointed out, writing your own reservoir sampling is easy do.  Lastly, the standard library doesn't try to be all things to all people, it is okay to leave many things for external packages -- we mostly provide a baseline of tools that cover common use cases and defer the rest to the Python ecosystem.
msg348474 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-26 07:22
> ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry.

Good point!

$ ./python -m timeit -s 'from random import sample as s' 's(range(10**6), 50)'
10000 loops, best of 5: 25.6 usec per loop
$ ./python -m timeit -s 'from random import sample as s' 's(list(range(10**6)), 50)'
10 loops, best of 5: 31.5 msec per loop
$ ./python -m timeit -s 'from random import reservoir_sample as s' 's(range(10**6), 50)'
1 loop, best of 5: 328 msec per loop

$ ./python -m timeit -s 'from random import sample as s' 's(range(10**8), 50)'
10000 loops, best of 5: 26.9 usec per loop
$ ./python -m timeit -s 'from random import sample as s' 's(list(range(10**8)), 50)'
1 loop, best of 5: 3.41 sec per loop
$ ./python -m timeit -s 'from random import reservoir_sample as s' 's(range(10**8), 50)'
1 loop, best of 5: 36.5 sec per loop

It is possible that a generator produces not so much data, but every item takes much memory so the total size does not fit in memory. But I suppose that the generation time of larger items will be proportionally larger, so reservoir_sample() will be just as slow.
msg348486 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-07-26 09:25
Thomas, thank you for the suggestion but I think we should decline.  The API for sample() is consistent with choice(), choices() and shuffle().  For the most part the sequence based API has worked out well.
History
Date User Action Args
2019-07-26 09:25:43rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg348486

stage: resolved
2019-07-26 07:22:58serhiy.storchakasetmessages: + msg348474
2019-07-26 06:09:21rhettingersetassignee: rhettinger
messages: + msg348466
versions: - Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8
2019-07-25 17:50:06serhiy.storchakasetmessages: + msg348447
2019-07-25 17:35:03serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg348446
2019-07-25 17:14:47mentalsetnosy: + rhettinger, mark.dickinson
2019-07-25 17:03:28thomasahlecreate