Author oscarbenjamin
Recipients mark.dickinson, oscarbenjamin, rhettinger, tim.peters
Date 2020-07-16.22:37:04
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1594939024.78.0.0222216370569.issue41311@roundup.psfhosted.org>
In-reply-to
Content
To be clear I suggest that this could be a separate function from the existing sample rather than a replacement or a routine used internally.

The intended use-cases for the separate function are:

1. Select from something where you really do not want to build a list and just want/need to use a single pass. For example in selecting a random line from a text file it is necessary to read the entire file in any case just to know how many lines there are. The method here would mean that you could make the selection in a single pass in O(k) memory. The same can apply to many situations involving generators etc.

2. Convenient, possibly faster selection from a most-likely small dict/set (this was the original context from python-ideas).

The algorithm naturally gives something in between the original order or a randomised order. There are two possibilities for changing that:

a. Call random.shuffle or equivalent either to get the original k items in a random order or at the end before returning.

b. Preserve the original ordering from the iterable: append all new items and use a sentinel for removed items (remove sentinels at the end).

Since randomised order does not come naturally and requires explicit shuffling my preference would be to preserve the original order (b.) because there is no other way to recover the information lost by shuffling (assuming that only a single pass is possible). The user can call shuffle if they want.

To explain what "geometrically distributed" means I need to refer to the precursor algorithm from which this is derived. A simple Python version could be:


def sample_iter_old(iterable, k=1):
    uvals_vals = []
    # Associate a uniform (0, 1) with each element:
    for uval, val in zip(iter(random, None), iterable):
        uvals_vals.append((uval, val))
        uvals_vals.sort()
        uvals_vals = uvals_vals[:k]   # keep the k items with smallest uval
    return [val for uval, val in uvals_vals]


In sample_iter_old each element val of the iterable is associated with a uniform (0, 1) variate uval. At each step we keep the k elements having the smallest uval variates. This is relatively inefficient because we need to generate a uniform variate for each element val of the iterable. Most of the time during the algorithm the new val is simply discarded so sample_iter tries instead to calculate how many items to discard.

The quantity W in sample_iter is the max of the uvals from sample_iter_old:

    W := max(uval, for uval, val in uvals_vals)

A new item from the iterable will be swapped in if its uval is less than W. The number of items skipped before finding a uval < W is geometrically distributed with parameter W.
History
Date User Action Args
2020-07-16 22:37:04oscarbenjaminsetrecipients: + oscarbenjamin, tim.peters, rhettinger, mark.dickinson
2020-07-16 22:37:04oscarbenjaminsetmessageid: <1594939024.78.0.0222216370569.issue41311@roundup.psfhosted.org>
2020-07-16 22:37:04oscarbenjaminlinkissue41311 messages
2020-07-16 22:37:04oscarbenjamincreate