Message 373742 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	oscarbenjamin, rhettinger
Date	2020-07-16.05:14:48
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1594876489.31.0.633551467749.issue41311@roundup.psfhosted.org>
In-reply-to

Content
Thanks for the suggestion. I'll give it some thought over the next few days. Here are a few initial thoughts: * The use of islice() really helps going through a small population quickly. * The current sample() uses _randbelow() instead of random() to guarantee equidistribution and to eliminate the small biases that come from using random(). So this is a step backwards. * The current sample() guarantees that slices of the sample are all in randomized order. The proposed algorithm doesn't: # Fully shuffled >>> sample(range(20), k=19) >>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1] # Not random at all >>> sample_iter(range(20), k=19) >>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18] * The current sample runs in speed proportional to k rather than n. This means that the proposed algorithm slows down considerably for large populations: In [12]: %timeit sample(range(100_000_000), k=20) 15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [13]: %timeit sample_iter(range(100_000_000), k=20) 1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) * For an apples-to-apples comparison with choices(), use the version in Python 3.9 which was updated to use floor() which is several times faster than int(). * Instead of listing the islice object, consider using next() instead. That would likely be faster: def sample_iter(iterable, k=1): iterator = iter(iterable) values = list(islice(iterator, k)) W = exp(log(random())/k) try: while True: # skip is geometrically distributed skip = floor( log(random())/log(1-W) ) values[randrange(k)] = next(islice(iterator, skip, skip+1)) W = exp(log(random())/k) except StopIteration: return values Using mock's call-count feature shows that the proposed algorithm uses more entropy that the current sample() code. It seems that random() and _randbelow() are both being called for every selection. I haven't yet investigated to what causes this, but it is unfavorable especially for slow generators like SystemRandom().

Thanks for the suggestion.  I'll give it some thought over the next few days.

Here are a few initial thoughts:

* The use of islice() really helps going through a small population quickly.

* The current sample() uses _randbelow() instead of random() to guarantee equidistribution and to eliminate the small biases that come from using random().  So this is a step backwards.

* The current sample() guarantees that slices of the sample are all in randomized order.  The proposed algorithm doesn't:

    # Fully shuffled
    >>> sample(range(20), k=19)
    >>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1]

    # Not random at all 
    >>> sample_iter(range(20), k=19)
    >>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18]

* The current sample runs in speed proportional to *k* rather than *n*.  This means that the proposed algorithm slows down considerably for large populations:

    In [12]: %timeit sample(range(100_000_000), k=20)
    15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    In [13]: %timeit sample_iter(range(100_000_000), k=20)
    1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* For an apples-to-apples comparison with choices(), use the version in Python 3.9 which was updated to use floor() which is several times faster than int().

* Instead of listing the islice object, consider using next() instead.  That would likely be faster:

    def sample_iter(iterable, k=1):
        iterator = iter(iterable)
        values = list(islice(iterator, k))
        W = exp(log(random())/k)
        try:
            while True:
                # skip is geometrically distributed
                skip = floor( log(random())/log(1-W) )
                values[randrange(k)] = next(islice(iterator, skip, skip+1))
                W *= exp(log(random())/k)
        except StopIteration:
            return values

* Using mock's call-count feature shows that the proposed algorithm uses more entropy that the current sample() code.  It seems that random() and _randbelow() are both being called for every selection.  I haven't yet investigated to what causes this, but it is unfavorable especially for slow generators like SystemRandom().

History
Date	User	Action	Args
2020-07-16 05:14:49	rhettinger	set	recipients: + rhettinger, oscarbenjamin
2020-07-16 05:14:49	rhettinger	set	messageid: <1594876489.31.0.633551467749.issue41311@roundup.psfhosted.org>
2020-07-16 05:14:49	rhettinger	link	issue41311 messages
2020-07-16 05:14:48	rhettinger	create