Author rhettinger
Recipients oscarbenjamin, rhettinger
Date 2020-07-16.05:14:48
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1594876489.31.0.633551467749.issue41311@roundup.psfhosted.org>
In-reply-to
Content
Thanks for the suggestion.  I'll give it some thought over the next few days.

Here are a few initial thoughts:

* The use of islice() really helps going through a small population quickly.

* The current sample() uses _randbelow() instead of random() to guarantee equidistribution and to eliminate the small biases that come from using random().  So this is a step backwards.

* The current sample() guarantees that slices of the sample are all in randomized order.  The proposed algorithm doesn't:

    # Fully shuffled
    >>> sample(range(20), k=19)
    >>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1]

    # Not random at all 
    >>> sample_iter(range(20), k=19)
    >>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18]

* The current sample runs in speed proportional to *k* rather than *n*.  This means that the proposed algorithm slows down considerably for large populations:

    In [12]: %timeit sample(range(100_000_000), k=20)
    15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    In [13]: %timeit sample_iter(range(100_000_000), k=20)
    1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* For an apples-to-apples comparison with choices(), use the version in Python 3.9 which was updated to use floor() which is several times faster than int().

* Instead of listing the islice object, consider using next() instead.  That would likely be faster:

    def sample_iter(iterable, k=1):
        iterator = iter(iterable)
        values = list(islice(iterator, k))
        W = exp(log(random())/k)
        try:
            while True:
                # skip is geometrically distributed
                skip = floor( log(random())/log(1-W) )
                values[randrange(k)] = next(islice(iterator, skip, skip+1))
                W *= exp(log(random())/k)
        except StopIteration:
            return values

* Using mock's call-count feature shows that the proposed algorithm uses more entropy that the current sample() code.  It seems that random() and _randbelow() are both being called for every selection.  I haven't yet investigated to what causes this, but it is unfavorable especially for slow generators like SystemRandom().
History
Date User Action Args
2020-07-16 05:14:49rhettingersetrecipients: + rhettinger, oscarbenjamin
2020-07-16 05:14:49rhettingersetmessageid: <1594876489.31.0.633551467749.issue41311@roundup.psfhosted.org>
2020-07-16 05:14:49rhettingerlinkissue41311 messages
2020-07-16 05:14:48rhettingercreate