Author oscarbenjamin mark.dickinson, oscarbenjamin, rhettinger, tim.peters 2020-07-17.11:41:36 -1.0 Yes <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org>
Content
```All good points :)

Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 2**53)**k.

Attempting to preserve order makes the storage requirements expected O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just refers to the values list growing larger with references to None: only k of the items from iterable are stored at any time. This can be simplified by removing the option to preserve order which would also make it faster in the small-iterable case.

There are a few timings below for choosing from a dict vs converting to a list and using sample (I don't have a 3.9 build immediately available to use choices). Note that these benchmarks are not the primary motivation for sample_iter though which is the case where the underlying iterable is much more expensive in memory and/or time and where the length is not known ahead of time.

from math import exp, log, log1p, floor
from random import random, randrange, shuffle as _shuffle
from itertools import islice

def sample_iter(iterable, k=1, shuffle=True):
"""Choose a sample of k items from iterable

shuffle=True (default) gives the items in random order
shuffle=False preserves the original ordering of the items
"""
iterator = iter(iterable)
values = list(islice(iterator, k))

irange = range(len(values))
indices = dict(zip(irange, irange))

kinv = 1 / k
W = 1.0
while True:
W *= random() ** kinv
# random() < 1.0 but random() ** kinv might not be
# W == 1.0 implies "infinite" skips
if W == 1.0:
break
# skip is geometrically distributed with parameter W
skip = floor( log(random())/log1p(-W) )
try:
newval = next(islice(iterator, skip, skip+1))
except StopIteration:
break
# Append new, replace old with dummy, and keep track of order
remove_index = randrange(k)
values[indices[remove_index]] = None
indices[remove_index] = len(values)
values.append(newval)

values = [values[indices[i]] for i in irange]

if shuffle:
_shuffle(values)

return values

Timings for a large dict (1,000,000 items):

In [8]: n = 6

In [9]: d = dict(zip(range(10**n), range(10**n)))

In [10]: %timeit sample_iter(d, 10)
16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit sample(list(d), 10)
26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timings for a small dict (5 items):

In [14]: d2 = dict(zip(range(5), range(5)))

In [15]: %timeit sample_iter(d2, 2)
14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [16]: %timeit sample(list(d2), 2)
6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The crossover point for this benchmark is around 10,000 items with k=2. Profiling at 10,000 items with k=2 shows that in either case the time is dominated by list/next so the time difference is just about how efficiently we can iterate vs build the list. For small dicts it is probably possible to get a significant factor speed up by removing the no shuffle option and simplifying the routine.

> Although why it keeps taking k'th roots remains a mystery to me ;-)

Thinking of sample_iter_old, before doing a swap the uvals in our reservoir look like:

U0 = {u[1], u[2], ... u[k-1], W0}
W0 = max(V0)

Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we swap in while removing W0 and afterwards we have

U1 = {u[1], u[2], ... u[k-1], u[k]}
W1 = max(U1)

Given that U1 is k iid uniform variates in (0, W0) we have that

W1 = W0 * max(random() for _ in range(k)) = W0 * W'

Here W' has cdf x**k and so by the inverse sampling method we can generate it as random()**(1/k). That gives the update rule for sample_iter:

W *= random() ** (1/k)```
History
Date User Action Args
2020-07-17 11:41:37oscarbenjaminsetrecipients: + oscarbenjamin, tim.peters, rhettinger, mark.dickinson
2020-07-17 11:41:37oscarbenjaminsetmessageid: <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org>