Issue 37682: random.sample should support iterators

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81863

classification

Title:	random.sample should support iterators
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.9

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	mark.dickinson, rhettinger, serhiy.storchaka, thomasahle
Priority:	normal	Keywords:

Created on 2019-07-25 17:03 by thomasahle, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (6)
msg348445 - (view)	Author: Thomas Dybdahl Ahle (thomasahle) *	Date: 2019-07-25 17:03
Given a generator `f()` we can use `random.sample(list(f()), 10)` to get a uniform sample of the values generated. This is fine, and fast, as long as `list(f())` easily fits in memory. However, if it doesn't, one has to implement the reservoir sampling algorithm as a pure python function, which is much slower, and not so easy. It seems that having a fast reservoir sampling implementation in `random.sample` to use for iterators would be both useful and make the API more predictable. Currently when passing an iterator `random.sample` throws `TypeError: Population must be a sequence or set.`. This is inconsistent with most of the standard library which accepts lists and iterators transparently. I apologize if this enhancement has already been discussed. I wasn't able to find it. If wanted, I can write up a pull request. I believe questions like this: https://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator-iterable-iterator makes it clear that such functionality is wanted and non-obvious.
msg348446 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-07-25 17:35
FYI, random.sample() (as most of other functions in the random module) is implemented in pure Python.
msg348447 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-07-25 17:50
Possible implementation: from itertools import islice as _islice def reservoir_sample(self, population, k): if k < 0: raise ValueError("Sample is negative") it = iter(population) result = list(_islice(it, k)) if len(result) < k: raise ValueError("Sample larger than population") self.shuffle(result) randbelow = self._randbelow for i, x in enumerate(it, k+1): j = randbelow(i) if j < k: result[j] = x return result
msg348466 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-07-26 06:09
ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry. FWIW, every time we've looked at reservoir sampling it has been less performant than what we have now. The calls to randbelow() are the slowest part, so doing more calls makes the overall performance worse. Also, doing more calls eats more entropy. In general, it is okay for functions to accept only sequences if they exploit indexing in some way. For example, the current approach works great with sample(range(100_000_000_000), k=50). We really don't have to make everything accept all iterators. Besides, it is trivially easy to call list() if needed. Overall, I'm -1 on redesigning the sampling algorithm to accommodate non-sequence iterators. AFAICT, it isn't important at all and as Serhiy pointed out, writing your own reservoir sampling is easy do. Lastly, the standard library doesn't try to be all things to all people, it is okay to leave many things for external packages -- we mostly provide a baseline of tools that cover common use cases and defer the rest to the Python ecosystem.
msg348474 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-07-26 07:22
> ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry. Good point! $ ./python -m timeit -s 'from random import sample as s' 's(range(106), 50)' 10000 loops, best of 5: 25.6 usec per loop $ ./python -m timeit -s 'from random import sample as s' 's(list(range(106)), 50)' 10 loops, best of 5: 31.5 msec per loop $ ./python -m timeit -s 'from random import reservoir_sample as s' 's(range(106), 50)' 1 loop, best of 5: 328 msec per loop $ ./python -m timeit -s 'from random import sample as s' 's(range(108), 50)' 10000 loops, best of 5: 26.9 usec per loop $ ./python -m timeit -s 'from random import sample as s' 's(list(range(108)), 50)' 1 loop, best of 5: 3.41 sec per loop $ ./python -m timeit -s 'from random import reservoir_sample as s' 's(range(108), 50)' 1 loop, best of 5: 36.5 sec per loop It is possible that a generator produces not so much data, but every item takes much memory so the total size does not fit in memory. But I suppose that the generation time of larger items will be proportionally larger, so reservoir_sample() will be just as slow.
msg348486 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-07-26 09:25
Thomas, thank you for the suggestion but I think we should decline. The API for sample() is consistent with choice(), choices() and shuffle(). For the most part the sequence based API has worked out well.

History
Date	User	Action	Args
2022-04-11 14:59:18	admin	set	github: 81863
2019-07-26 09:25:43	rhettinger	set	status: open -> closed resolution: rejected messages: + msg348486 stage: resolved
2019-07-26 07:22:58	serhiy.storchaka	set	messages: + msg348474
2019-07-26 06:09:21	rhettinger	set	assignee: rhettinger messages: + msg348466 versions: - Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8
2019-07-25 17:50:06	serhiy.storchaka	set	messages: + msg348447
2019-07-25 17:35:03	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg348446
2019-07-25 17:14:47	mental	set	nosy: + rhettinger, mark.dickinson
2019-07-25 17:03:28	thomasahle	create