Message373828
All good points :)
Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 2**53)**k.
Attempting to preserve order makes the storage requirements expected O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just refers to the values list growing larger with references to None: only k of the items from iterable are stored at any time. This can be simplified by removing the option to preserve order which would also make it faster in the small-iterable case.
There are a few timings below for choosing from a dict vs converting to a list and using sample (I don't have a 3.9 build immediately available to use choices). Note that these benchmarks are not the primary motivation for sample_iter though which is the case where the underlying iterable is much more expensive in memory and/or time and where the length is not known ahead of time.
from math import exp, log, log1p, floor
from random import random, randrange, shuffle as _shuffle
from itertools import islice
def sample_iter(iterable, k=1, shuffle=True):
"""Choose a sample of k items from iterable
shuffle=True (default) gives the items in random order
shuffle=False preserves the original ordering of the items
"""
iterator = iter(iterable)
values = list(islice(iterator, k))
irange = range(len(values))
indices = dict(zip(irange, irange))
kinv = 1 / k
W = 1.0
while True:
W *= random() ** kinv
# random() < 1.0 but random() ** kinv might not be
# W == 1.0 implies "infinite" skips
if W == 1.0:
break
# skip is geometrically distributed with parameter W
skip = floor( log(random())/log1p(-W) )
try:
newval = next(islice(iterator, skip, skip+1))
except StopIteration:
break
# Append new, replace old with dummy, and keep track of order
remove_index = randrange(k)
values[indices[remove_index]] = None
indices[remove_index] = len(values)
values.append(newval)
values = [values[indices[i]] for i in irange]
if shuffle:
_shuffle(values)
return values
Timings for a large dict (1,000,000 items):
In [8]: n = 6
In [9]: d = dict(zip(range(10**n), range(10**n)))
In [10]: %timeit sample_iter(d, 10)
16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit sample(list(d), 10)
26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timings for a small dict (5 items):
In [14]: d2 = dict(zip(range(5), range(5)))
In [15]: %timeit sample_iter(d2, 2)
14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit sample(list(d2), 2)
6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The crossover point for this benchmark is around 10,000 items with k=2. Profiling at 10,000 items with k=2 shows that in either case the time is dominated by list/next so the time difference is just about how efficiently we can iterate vs build the list. For small dicts it is probably possible to get a significant factor speed up by removing the no shuffle option and simplifying the routine.
> Although why it keeps taking k'th roots remains a mystery to me ;-)
Thinking of sample_iter_old, before doing a swap the uvals in our reservoir look like:
U0 = {u[1], u[2], ... u[k-1], W0}
W0 = max(V0)
Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we swap in while removing W0 and afterwards we have
U1 = {u[1], u[2], ... u[k-1], u[k]}
W1 = max(U1)
Given that U1 is k iid uniform variates in (0, W0) we have that
W1 = W0 * max(random() for _ in range(k)) = W0 * W'
Here W' has cdf x**k and so by the inverse sampling method we can generate it as random()**(1/k). That gives the update rule for sample_iter:
W *= random() ** (1/k) |
|
Date |
User |
Action |
Args |
2020-07-17 11:41:37 | oscarbenjamin | set | recipients:
+ oscarbenjamin, tim.peters, rhettinger, mark.dickinson |
2020-07-17 11:41:37 | oscarbenjamin | set | messageid: <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org> |
2020-07-17 11:41:37 | oscarbenjamin | link | issue41311 messages |
2020-07-17 11:41:36 | oscarbenjamin | create | |
|