Author oscarbenjamin
Recipients mark.dickinson, oscarbenjamin, rhettinger, tim.peters
Date 2020-07-17.11:41:36
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org>
In-reply-to
Content
All good points :)

Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 2**53)**k.

Attempting to preserve order makes the storage requirements expected O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just refers to the values list growing larger with references to None: only k of the items from iterable are stored at any time. This can be simplified by removing the option to preserve order which would also make it faster in the small-iterable case.

There are a few timings below for choosing from a dict vs converting to a list and using sample (I don't have a 3.9 build immediately available to use choices). Note that these benchmarks are not the primary motivation for sample_iter though which is the case where the underlying iterable is much more expensive in memory and/or time and where the length is not known ahead of time.



from math import exp, log, log1p, floor
from random import random, randrange, shuffle as _shuffle
from itertools import islice


def sample_iter(iterable, k=1, shuffle=True):
    """Choose a sample of k items from iterable

    shuffle=True (default) gives the items in random order
    shuffle=False preserves the original ordering of the items
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    irange = range(len(values))
    indices = dict(zip(irange, irange))

    kinv = 1 / k
    W = 1.0
    while True:
        W *= random() ** kinv
        # random() < 1.0 but random() ** kinv might not be
        # W == 1.0 implies "infinite" skips
        if W == 1.0:
            break
        # skip is geometrically distributed with parameter W
        skip = floor( log(random())/log1p(-W) )
        try:
            newval = next(islice(iterator, skip, skip+1))
        except StopIteration:
            break
        # Append new, replace old with dummy, and keep track of order
        remove_index = randrange(k)
        values[indices[remove_index]] = None
        indices[remove_index] = len(values)
        values.append(newval)

    values = [values[indices[i]] for i in irange]

    if shuffle:
        _shuffle(values)

    return values


Timings for a large dict (1,000,000 items):

In [8]: n = 6                                                                                                                                  

In [9]: d = dict(zip(range(10**n), range(10**n)))                                                                                              

In [10]: %timeit sample_iter(d, 10)                                                                                                            
16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit sample(list(d), 10)                                                                                                           
26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Timings for a small dict (5 items):

In [14]: d2 = dict(zip(range(5), range(5)))                                                                                                    

In [15]: %timeit sample_iter(d2, 2)                                                                                                            
14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [16]: %timeit sample(list(d2), 2)                                                                                                           
6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


The crossover point for this benchmark is around 10,000 items with k=2. Profiling at 10,000 items with k=2 shows that in either case the time is dominated by list/next so the time difference is just about how efficiently we can iterate vs build the list. For small dicts it is probably possible to get a significant factor speed up by removing the no shuffle option and simplifying the routine.


> Although why it keeps taking k'th roots remains a mystery to me ;-)

Thinking of sample_iter_old, before doing a swap the uvals in our reservoir look like:

  U0 = {u[1], u[2], ... u[k-1], W0}
  W0 = max(V0)

Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we swap in while removing W0 and afterwards we have

  U1 = {u[1], u[2], ... u[k-1], u[k]}
  W1 = max(U1)

Given that U1 is k iid uniform variates in (0, W0) we have that

  W1 = W0 * max(random() for _ in range(k)) = W0 * W'

Here W' has cdf x**k and so by the inverse sampling method we can generate it as random()**(1/k). That gives the update rule for sample_iter:

  W *= random() ** (1/k)
History
Date User Action Args
2020-07-17 11:41:37oscarbenjaminsetrecipients: + oscarbenjamin, tim.peters, rhettinger, mark.dickinson
2020-07-17 11:41:37oscarbenjaminsetmessageid: <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org>
2020-07-17 11:41:37oscarbenjaminlinkissue41311 messages
2020-07-17 11:41:36oscarbenjamincreate