Message 328749 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	ciphergoth, mark.dickinson, rhettinger
Date	2018-10-28.20:58:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1540760313.21.0.788709270274.issue35094@psf.upfronthosting.co.za>
In-reply-to

Content
For the common case, where k is small and n is large, reservoir sampling makes n calls rather the current k plus a few reselections -- the only cost is a temporary k-sized set using superfast int hashing. This technique works even for a very large values of n, such as sample(range(1_000_000_000), k=10_000). For the case where k is a large percentage of n, the size of the set becomes noticeable and reselections would become more common (placing a higher load on the underlying RNG and eating its entropy, keep in mind that the underlying RNG can be SystemRandom for example). So, we switch algorithms to a partial shuffle, which has no reselections and uses a temporary list of references to the sampled objects. The places the absolute minimum burden on the underlying RNG and makes the minimum number of data swaps (neither of those virtues can be claimed by reservoir sampling). The only advantage of the reservoir approach is not requiring auxiliary storage. Keep in mind, the temporary list only holds references to the sampled data, so tends to be small relative to that data. It is no more disadvantageous than typical applications of list comprehensions which make new lists while looping over some other iterable. In any case, the algorithms prefer to optimize for fewest calls the the random number generator rather than aspiring to zero auxiliary storage. The randbelow() calls are not superfast, so we don't want to use many of them. Likewise, calls to SystemRandom() eat available entropy, so we don't want to use many of them either. Also, the Random class is designed to be subclassed to allow uses of other RNGs which are likely to have a smaller period that the MersenneTwister so we don't want many calls to them either (it eats their limited state space just like shuffle() exceeds the capabilities of MT with an input size as small as 2081). Lastly, I have a vaguely held concern that reservoir sampling uses the RNG in a way that would magnify any weaknesses in that RNG (by virtues to making more calls and by virtue of using selections in the full range from n-k to n), so we would get lower quality shuffles. FWIW, all of these things were considered when shuffle() was designed, it wasn't like other methods weren't considered. The design we have now was deemed to be the best fit for most of our users, most of time (we've never gotten a complaint about the temporary storage being a problem for any user, ever). I would however expect complaints about an increased number of calls to the user's rngs.

For the common case, where k is small and n is large, reservoir sampling makes n calls rather the current k plus a few reselections -- the only cost is a temporary k-sized set using superfast int hashing. This technique works even for a very large values of n, such as sample(range(1_000_000_000), k=10_000).

For the case where k is a large percentage of n, the size of the set becomes noticeable and reselections would become more common (placing a higher load on the underlying RNG and eating its entropy, keep in mind that the underlying RNG can be SystemRandom for example). So, we switch algorithms to a partial shuffle, which has no reselections and uses a temporary list of references to the sampled objects. The places the absolute minimum burden on the underlying RNG and makes the minimum number of data swaps (neither of those virtues can be claimed by reservoir sampling).

The only advantage of the reservoir approach is not requiring auxiliary
storage. Keep in mind, the temporary list only holds references to the sampled data, so tends to be small relative to that data. It is no more disadvantageous than typical applications of list comprehensions which make new lists while looping over some other iterable.

In any case, the algorithms prefer to optimize for fewest calls the the random number generator rather than aspiring to zero auxiliary storage. The randbelow() calls are not superfast, so we don't want to use many of them. Likewise, calls to SystemRandom() eat available entropy, so we don't want to use many of them either. Also, the Random class is designed to be subclassed to allow uses of other RNGs which are likely to have a smaller period that the MersenneTwister so we don't want many calls to them either (it eats their limited state space just like shuffle() exceeds the capabilities of MT with an input size as small as 2081).

Lastly, I have a vaguely held concern that reservoir sampling uses the RNG in a way that would magnify any weaknesses in that RNG (by virtues to making more calls and by virtue of using selections in the full range from n-k to n), so we would get lower quality shuffles.

FWIW, all of these things were considered when shuffle() was designed, it wasn't like other methods weren't considered. The design we have now was deemed to be the best fit for most of our users, most of time (we've never gotten a complaint about the temporary storage being a problem for any user, ever). I would however expect complaints about an increased number of calls to the user's rngs.

History
Date	User	Action	Args
2018-10-28 20:58:33	rhettinger	set	recipients: + rhettinger, mark.dickinson, ciphergoth
2018-10-28 20:58:33	rhettinger	set	messageid: <1540760313.21.0.788709270274.issue35094@psf.upfronthosting.co.za>
2018-10-28 20:58:33	rhettinger	link	issue35094 messages
2018-10-28 20:58:33	rhettinger	create