Author rhettinger oscarbenjamin, rhettinger 2020-07-16.05:14:48 -1.0 Yes <1594876489.31.0.633551467749.issue41311@roundup.psfhosted.org>
Content
```Thanks for the suggestion.  I'll give it some thought over the next few days.

Here are a few initial thoughts:

* The use of islice() really helps going through a small population quickly.

* The current sample() uses _randbelow() instead of random() to guarantee equidistribution and to eliminate the small biases that come from using random().  So this is a step backwards.

* The current sample() guarantees that slices of the sample are all in randomized order.  The proposed algorithm doesn't:

# Fully shuffled
>>> sample(range(20), k=19)
>>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1]

# Not random at all
>>> sample_iter(range(20), k=19)
>>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18]

* The current sample runs in speed proportional to *k* rather than *n*.  This means that the proposed algorithm slows down considerably for large populations:

In [12]: %timeit sample(range(100_000_000), k=20)
15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit sample_iter(range(100_000_000), k=20)
1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* For an apples-to-apples comparison with choices(), use the version in Python 3.9 which was updated to use floor() which is several times faster than int().

* Instead of listing the islice object, consider using next() instead.  That would likely be faster:

def sample_iter(iterable, k=1):
iterator = iter(iterable)
values = list(islice(iterator, k))
W = exp(log(random())/k)
try:
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
values[randrange(k)] = next(islice(iterator, skip, skip+1))
W *= exp(log(random())/k)
except StopIteration:
return values

* Using mock's call-count feature shows that the proposed algorithm uses more entropy that the current sample() code.  It seems that random() and _randbelow() are both being called for every selection.  I haven't yet investigated to what causes this, but it is unfavorable especially for slow generators like SystemRandom().```
History
Date User Action Args
2020-07-16 05:14:49rhettingersetrecipients: + rhettinger, oscarbenjamin
2020-07-16 05:14:49rhettingersetmessageid: <1594876489.31.0.633551467749.issue41311@roundup.psfhosted.org>