Author oscarbenjamin
Recipients oscarbenjamin
Date 2020-07-16.00:10:30
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1594858231.04.0.366290640753.issue41311@roundup.psfhosted.org>
In-reply-to
Content
The random.choice/random.sample functions will only accept a sequence to select from. Can there be a function in the random module for selecting from an arbitrary iterable?

It is possible to make an efficient function that can make random selections from an arbitrary iterable e.g.:


from math import exp, log, floor
from random import random, randrange
from itertools import islice

def sample_iter(iterable, k=1):
    """Select k items uniformly from iterable.

    Returns the whole population if there are k or fewer items
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values


https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm


This could be used for random sampling from sets/dicts or also to choose something like a random line from a text file. The algorithm needs to fully consume the iterable but does so efficiently using islice. In the case of a dict this is faster than converting to a list and using random.choice:


In [2]: n = 6

In [3]: d = dict(zip(range(10**n), range(10**n)))

In [4]: %timeit sample_iter(d)
15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each

In [5]: %timeit list(d)
26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit sample_iter(d, 2)
15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit sample_iter(d, 20)
17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit sample_iter(d, 100)
19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


This was already discussed on python-ideas:
https://mail.python.org/archives/list/python-ideas@python.org/thread/4OZTRD7FLXXZ6R6RU4BME6DYR3AXHOBD/
History
Date User Action Args
2020-07-16 00:10:31oscarbenjaminsetrecipients: + oscarbenjamin
2020-07-16 00:10:31oscarbenjaminsetmessageid: <1594858231.04.0.366290640753.issue41311@roundup.psfhosted.org>
2020-07-16 00:10:31oscarbenjaminlinkissue41311 messages
2020-07-16 00:10:30oscarbenjamincreate