Message 373828 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	oscarbenjamin
Recipients	mark.dickinson, oscarbenjamin, rhettinger, tim.peters
Date	2020-07-17.11:41:36
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org>
In-reply-to

Content
All good points :) Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 253)k. Attempting to preserve order makes the storage requirements expected O(klog(k)) rather than deterministic O(k) but note that the log(k) part just refers to the values list growing larger with references to None: only k of the items from iterable are stored at any time. This can be simplified by removing the option to preserve order which would also make it faster in the small-iterable case. There are a few timings below for choosing from a dict vs converting to a list and using sample (I don't have a 3.9 build immediately available to use choices). Note that these benchmarks are not the primary motivation for sample_iter though which is the case where the underlying iterable is much more expensive in memory and/or time and where the length is not known ahead of time. from math import exp, log, log1p, floor from random import random, randrange, shuffle as _shuffle from itertools import islice def sample_iter(iterable, k=1, shuffle=True): """Choose a sample of k items from iterable shuffle=True (default) gives the items in random order shuffle=False preserves the original ordering of the items """ iterator = iter(iterable) values = list(islice(iterator, k)) irange = range(len(values)) indices = dict(zip(irange, irange)) kinv = 1 / k W = 1.0 while True: W = random() kinv # random() < 1.0 but random() kinv might not be # W == 1.0 implies "infinite" skips if W == 1.0: break # skip is geometrically distributed with parameter W skip = floor( log(random())/log1p(-W) ) try: newval = next(islice(iterator, skip, skip+1)) except StopIteration: break # Append new, replace old with dummy, and keep track of order remove_index = randrange(k) values[indices[remove_index]] = None indices[remove_index] = len(values) values.append(newval) values = [values[indices[i]] for i in irange] if shuffle: _shuffle(values) return values Timings for a large dict (1,000,000 items): In [8]: n = 6 In [9]: d = dict(zip(range(10n), range(10n))) In [10]: %timeit sample_iter(d, 10) 16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: %timeit sample(list(d), 10) 26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Timings for a small dict (5 items): In [14]: d2 = dict(zip(range(5), range(5))) In [15]: %timeit sample_iter(d2, 2) 14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [16]: %timeit sample(list(d2), 2) 6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) The crossover point for this benchmark is around 10,000 items with k=2. Profiling at 10,000 items with k=2 shows that in either case the time is dominated by list/next so the time difference is just about how efficiently we can iterate vs build the list. For small dicts it is probably possible to get a significant factor speed up by removing the no shuffle option and simplifying the routine. > Although why it keeps taking k'th roots remains a mystery to me ;-) Thinking of sample_iter_old, before doing a swap the uvals in our reservoir look like: U0 = {u[1], u[2], ... u[k-1], W0} W0 = max(V0) Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we swap in while removing W0 and afterwards we have U1 = {u[1], u[2], ... u[k-1], u[k]} W1 = max(U1) Given that U1 is k iid uniform variates in (0, W0) we have that W1 = W0 * max(random() for _ in range(k)) = W0 * W' Here W' has cdf xk and so by the inverse sampling method we can generate it as random()(1/k). That gives the update rule for sample_iter: W = random() * (1/k)

All good points :)

Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 2**53)**k.

Attempting to preserve order makes the storage requirements expected O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just refers to the values list growing larger with references to None: only k of the items from iterable are stored at any time. This can be simplified by removing the option to preserve order which would also make it faster in the small-iterable case.

There are a few timings below for choosing from a dict vs converting to a list and using sample (I don't have a 3.9 build immediately available to use choices). Note that these benchmarks are not the primary motivation for sample_iter though which is the case where the underlying iterable is much more expensive in memory and/or time and where the length is not known ahead of time.



from math import exp, log, log1p, floor
from random import random, randrange, shuffle as _shuffle
from itertools import islice


def sample_iter(iterable, k=1, shuffle=True):
    """Choose a sample of k items from iterable

    shuffle=True (default) gives the items in random order
    shuffle=False preserves the original ordering of the items
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    irange = range(len(values))
    indices = dict(zip(irange, irange))

    kinv = 1 / k
    W = 1.0
    while True:
        W *= random() ** kinv
        # random() < 1.0 but random() ** kinv might not be
        # W == 1.0 implies "infinite" skips
        if W == 1.0:
            break
        # skip is geometrically distributed with parameter W
        skip = floor( log(random())/log1p(-W) )
        try:
            newval = next(islice(iterator, skip, skip+1))
        except StopIteration:
            break
        # Append new, replace old with dummy, and keep track of order
        remove_index = randrange(k)
        values[indices[remove_index]] = None
        indices[remove_index] = len(values)
        values.append(newval)

    values = [values[indices[i]] for i in irange]

    if shuffle:
        _shuffle(values)

    return values


Timings for a large dict (1,000,000 items):

In [8]: n = 6                                                                                                                                  

In [9]: d = dict(zip(range(10**n), range(10**n)))                                                                                              

In [10]: %timeit sample_iter(d, 10)                                                                                                            
16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit sample(list(d), 10)                                                                                                           
26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Timings for a small dict (5 items):

In [14]: d2 = dict(zip(range(5), range(5)))                                                                                                    

In [15]: %timeit sample_iter(d2, 2)                                                                                                            
14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [16]: %timeit sample(list(d2), 2)                                                                                                           
6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


The crossover point for this benchmark is around 10,000 items with k=2. Profiling at 10,000 items with k=2 shows that in either case the time is dominated by list/next so the time difference is just about how efficiently we can iterate vs build the list. For small dicts it is probably possible to get a significant factor speed up by removing the no shuffle option and simplifying the routine.


> Although why it keeps taking k'th roots remains a mystery to me ;-)

Thinking of sample_iter_old, before doing a swap the uvals in our reservoir look like:

  U0 = {u[1], u[2], ... u[k-1], W0}
  W0 = max(V0)

Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we swap in while removing W0 and afterwards we have

  U1 = {u[1], u[2], ... u[k-1], u[k]}
  W1 = max(U1)

Given that U1 is k iid uniform variates in (0, W0) we have that

  W1 = W0 * max(random() for _ in range(k)) = W0 * W'

Here W' has cdf x**k and so by the inverse sampling method we can generate it as random()**(1/k). That gives the update rule for sample_iter:

  W *= random() ** (1/k)

History
Date	User	Action	Args
2020-07-17 11:41:37	oscarbenjamin	set	recipients: + oscarbenjamin, tim.peters, rhettinger, mark.dickinson
2020-07-17 11:41:37	oscarbenjamin	set	messageid: <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org>
2020-07-17 11:41:37	oscarbenjamin	link	issue41311 messages
2020-07-17 11:41:36	oscarbenjamin	create