Message373828

Author | oscarbenjamin |
---|---|

Recipients | mark.dickinson, oscarbenjamin, rhettinger, tim.peters |

Date | 2020-07-17.11:41:36 |

SpamBayes Score | -1.0 |

Marked as misclassified | Yes |

Message-id | <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org> |

In-reply-to |

Content | |
---|---|

All good points :) Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 2**53)**k. Attempting to preserve order makes the storage requirements expected O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just refers to the values list growing larger with references to None: only k of the items from iterable are stored at any time. This can be simplified by removing the option to preserve order which would also make it faster in the small-iterable case. There are a few timings below for choosing from a dict vs converting to a list and using sample (I don't have a 3.9 build immediately available to use choices). Note that these benchmarks are not the primary motivation for sample_iter though which is the case where the underlying iterable is much more expensive in memory and/or time and where the length is not known ahead of time. from math import exp, log, log1p, floor from random import random, randrange, shuffle as _shuffle from itertools import islice def sample_iter(iterable, k=1, shuffle=True): """Choose a sample of k items from iterable shuffle=True (default) gives the items in random order shuffle=False preserves the original ordering of the items """ iterator = iter(iterable) values = list(islice(iterator, k)) irange = range(len(values)) indices = dict(zip(irange, irange)) kinv = 1 / k W = 1.0 while True: W *= random() ** kinv # random() < 1.0 but random() ** kinv might not be # W == 1.0 implies "infinite" skips if W == 1.0: break # skip is geometrically distributed with parameter W skip = floor( log(random())/log1p(-W) ) try: newval = next(islice(iterator, skip, skip+1)) except StopIteration: break # Append new, replace old with dummy, and keep track of order remove_index = randrange(k) values[indices[remove_index]] = None indices[remove_index] = len(values) values.append(newval) values = [values[indices[i]] for i in irange] if shuffle: _shuffle(values) return values Timings for a large dict (1,000,000 items): In [8]: n = 6 In [9]: d = dict(zip(range(10**n), range(10**n))) In [10]: %timeit sample_iter(d, 10) 16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: %timeit sample(list(d), 10) 26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Timings for a small dict (5 items): In [14]: d2 = dict(zip(range(5), range(5))) In [15]: %timeit sample_iter(d2, 2) 14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [16]: %timeit sample(list(d2), 2) 6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) The crossover point for this benchmark is around 10,000 items with k=2. Profiling at 10,000 items with k=2 shows that in either case the time is dominated by list/next so the time difference is just about how efficiently we can iterate vs build the list. For small dicts it is probably possible to get a significant factor speed up by removing the no shuffle option and simplifying the routine. > Although why it keeps taking k'th roots remains a mystery to me ;-) Thinking of sample_iter_old, before doing a swap the uvals in our reservoir look like: U0 = {u[1], u[2], ... u[k-1], W0} W0 = max(V0) Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we swap in while removing W0 and afterwards we have U1 = {u[1], u[2], ... u[k-1], u[k]} W1 = max(U1) Given that U1 is k iid uniform variates in (0, W0) we have that W1 = W0 * max(random() for _ in range(k)) = W0 * W' Here W' has cdf x**k and so by the inverse sampling method we can generate it as random()**(1/k). That gives the update rule for sample_iter: W *= random() ** (1/k) |

History | |||
---|---|---|---|

Date | User | Action | Args |

2020-07-17 11:41:37 | oscarbenjamin | set | recipients: + oscarbenjamin, tim.peters, rhettinger, mark.dickinson |

2020-07-17 11:41:37 | oscarbenjamin | set | messageid: <1594986097.28.0.830565563453.issue41311@roundup.psfhosted.org> |

2020-07-17 11:41:37 | oscarbenjamin | link | issue41311 messages |

2020-07-17 11:41:36 | oscarbenjamin | create |