This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Improved algorithms for random.sample
Type: resource usage Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: ciphergoth, mark.dickinson, rhettinger
Priority: normal Keywords: patch

Created on 2018-10-28 19:00 by ciphergoth, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 10192 closed ciphergoth, 2018-10-28 19:05
Messages (6)
msg328728 - (view) Author: Paul Crowley (ciphergoth) * Date: 2018-10-28 19:00
random.sample currently uses either a Fisher-Yates shuffle, or rejection sampling, to achieve sampling without replacement. I propose using reservoir sampling or "cardchoose"; these are similar performance or sometimes faster, and don't need to allocate anything except the list used for the results.
msg328733 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-28 19:54
Sorry, this has been discussed and rejected multiple times.  Reservoir sampling makes far too many calls to the underlying random number generator.
msg328736 - (view) Author: Paul Crowley (ciphergoth) * Date: 2018-10-28 20:09
I would be very grateful for your help finding those dicussions! I've tried this search:

https://www.google.com/search?q=python+%22Reservoir+sampling%22+rhettinger

and found this discussion:

https://mail.python.org/pipermail/python-ideas/2016-April/039708.html

but if I've missed any I'm keen to know.

In my pull request reservoir sampling is only used if 2k>=n, so it makes at most twice as many random requests as any other algorithm.
msg328749 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-28 20:58
For the common case, where k is small and n is large, reservoir sampling makes n calls rather the current k plus a few reselections -- the only cost is a temporary k-sized set using superfast int hashing.  This technique works even for a very large values of n, such as sample(range(1_000_000_000), k=10_000).

For the case where k is a large percentage of n, the size of the set becomes noticeable and reselections would become more common (placing a higher load on the underlying RNG and eating its entropy, keep in mind that the underlying RNG can be SystemRandom for example).  So, we switch algorithms to a partial shuffle, which has no reselections and uses a temporary list of references to the sampled objects.   The places the absolute minimum burden on the underlying RNG and makes the minimum number of data swaps (neither of those virtues can be claimed by reservoir sampling).

The only advantage of the reservoir approach is not requiring auxiliary
storage.  Keep in mind, the temporary list only holds references to the sampled data, so tends to be small relative to that data.  It is no more disadvantageous than typical applications of list comprehensions which make new lists while looping over some other iterable.

In any case, the algorithms prefer to optimize for fewest calls the the random number generator rather than aspiring to zero auxiliary storage.  The randbelow() calls are not superfast, so we don't want to use many of them.  Likewise, calls to SystemRandom() eat available entropy,  so we don't want to use many of them either.  Also, the Random class is designed to be subclassed to allow uses of other RNGs which are likely to have a smaller period that the MersenneTwister so we don't want many calls to them either (it eats their limited state space just like shuffle() exceeds the capabilities of MT with an input size as small as 2081).

Lastly, I have a vaguely held concern that reservoir sampling uses the RNG in a way that would magnify any weaknesses in that RNG (by virtues to making more calls and by virtue of using selections in the full range from n-k to n), so we would get lower quality shuffles.

FWIW, all of these things were considered when shuffle() was designed, it wasn't like other methods weren't considered.  The design we have now was deemed to be the best fit for most of our users, most of time (we've never gotten a complaint about the temporary storage being a problem for any user, ever).  I would however expect complaints about an increased number of calls to the user's rngs.
msg328753 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-28 21:07
One other thought.  We don't guarantee that sample() will always produce the same sequences, so that does give us some freedom; however, we have a strong aversion to doing so unless there is a compelling advantage.  It will almost certainly cause some disruption for users if their samples stop being reproducible from a given seed.  There is some value to the implementation remaining stable.
msg328764 - (view) Author: Paul Crowley (ciphergoth) * Date: 2018-10-28 22:15
Thank you for a very comprehensive and helpful answer!

Yep, reservoir sampling makes n calls not k calls, and so should only be used when k is a large fraction of n; in my patch it's k/n >= 1/2.

Because modern CPRNGs are so fast, I had been assuming that overall runtime, rather than calls to the RNG; I'll have to bear that in mind here, though in general "use a secure seed to whatever secure RNG is fastest" is the right strategy.

I don't think hedging against the quality of the RNG is the right thing to do here.

I don't mean to suggest you didn't think about this problem hard! It's just that I've been obsessing about this problem for the last few weeks for some reason (see my repo) so I thought I might be able to help. Thanks again for you reply!
History
Date User Action Args
2022-04-11 14:59:07adminsetgithub: 79275
2018-10-28 22:15:38ciphergothsetmessages: + msg328764
2018-10-28 21:07:43rhettingersetmessages: + msg328753
2018-10-28 20:58:33rhettingersetmessages: + msg328749
2018-10-28 20:09:12ciphergothsetmessages: + msg328736
2018-10-28 19:54:05rhettingersetstatus: open -> closed
messages: + msg328733

assignee: rhettinger
resolution: rejected
stage: patch review -> resolved
2018-10-28 19:07:55pablogsalsetnosy: + rhettinger, mark.dickinson
2018-10-28 19:05:24ciphergothsetkeywords: + patch
stage: patch review
pull_requests: + pull_request9510
2018-10-28 19:00:35ciphergothcreate